State of NewHash work, future directions, and discussion

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* State of NewHash work, future directions, and discussion
@ 2018-06-09 20:56 brian m. carlson
  2018-06-09 21:26 ` Ævar Arnfjörð Bjarmason
                   ` (3 more replies)
  0 siblings, 4 replies; 66+ messages in thread
From: brian m. carlson @ 2018-06-09 20:56 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 4661 bytes --]

Since there's been a lot of questions recently about the state of the
NewHash work, I thought I'd send out a summary.

== Status

I have patches to make the entire codebase work, including passing all
tests, when Git is converted to use a 256-bit hash algorithm.
Obviously, such a Git is incompatible with the current version, but it
means that we've fixed essentially all of the hard-coded 20 and 40
constants (and therefore Git doesn't segfault).

I'm working on getting a 256-bit Git to work with SHA-1 being the
default.  Currently, this involves doing things like writing transport
code, since in order to clone a repository, you need to be able to set
up the hash algorithm correctly.  I know that this was a non-goal in the
transition plan, but since the testsuite doesn't pass without it, it's
become necessary.

Some of these patches will be making their way to the list soon.
They're hanging out in the normal places in the object-id-part14 branch
(which may be rebased).

== Future Design

The work I've done necessarily involves porting everything to use
the_hash_algo.  Essentially, when the piece I'm currently working on is
complete, we'll have a transition stage 4 implementation (all NewHash).
Stage 2 and 3 will be implemented next.

My vision of how data is stored is that the .git directory is, except
for pack indices and the loose object lookup table, entirely in one
format.  It will be all SHA-1 or all NewHash.  This algorithm will be
stored in the_hash_algo.

I plan on introducing an array of hash algorithms into struct repository
(and wrapper macros) which stores, in order, the output hash, and if
used, the additional input hash.

Functions like get_oid_hex and parse_oid_hex will acquire an internal
version, which knows about parsing things (like refs) in the internal
format, and one which knows about parsing in the UI formats.  Similarly,
oid_to_hex will have an internal version that handles data in the .git
directory, and an external version that produces data in the output
format.  Translation will take place at the outer edges of the program.

The transition plan anticipates a stage 1 where accept only SHA-1 on
input and produce only SHA-1 on output, but store in NewHash.  As I've
worked with our tests, I've realized such an implementation is not
entirely possible.  We have various tools that expect to accept invalid
object IDs, and obviously there's no way to have those continue to work.
We'd have to either reject invalid data in such a case or combine stages
1 and 2.

== Compatibility with this Work

If you're working on new features and you'd like to implement the best
possible compatibility with this work, here are some recommendations:

* Assume everything in the .git directory but pack indices and the loose
  object index will be in the same algorithm and that that algorithm is
  the_hash_algo.
* For the moment, use the_hash_algo to look up the size of all
  hash-related constants.  Use GIT_MAX_* for allocations.
* If you are writing a new data format, add a version number.
* If you need to serialize an algorithm identifier into your data
  format, use the format_id field of struct git_hash_algo.  It's
  designed specifically for that purpose.
* You can safely assume that the_hash_algo will be suitably initialized
  to the correct algorithm for your repository.
* Keep using the object ID functions and struct object_id.
* Try not to use mmap'd structs for reading and writing formats on disk,
  since these are hard to make hash size agnostic.

== Discussion about an Actual NewHash

Since I'll be writing new code, I'll be writing tests for this code.
However, writing tests for creating and initializing repositories
requires that I be able to test that objects are being serialized
correctly, and therefore requires that I actually know what the hash
algorithm is going to be.  I also can't submit code for multi-hash packs
when we officially only support one hash algorithm.

I know that we have long tried to avoid discussing the specific
algorithm to use, in part because the last discussion generated more
heat than light, and settled on referring to it as NewHash for the time
being.  However, I think it's time to pick this topic back up, since I
can't really continue work in this direction without us picking a
NewHash.

If people are interested, I've done some analysis on availability of
implementations, performance, and other attributes described in the
transition plan and can send that to the list.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: State of NewHash work, future directions, and discussion
  2018-06-09 20:56 State of NewHash work, future directions, and discussion brian m. carlson
@ 2018-06-09 21:26 ` Ævar Arnfjörð Bjarmason
  2018-06-09 22:49 ` Hash algorithm analysis brian m. carlson
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 66+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-06-09 21:26 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git


On Sat, Jun 09 2018, brian m. carlson wrote:

> Since there's been a lot of questions recently about the state of the
> NewHash work, I thought I'd send out a summary.

Thanks for all your work on this.

> I know that we have long tried to avoid discussing the specific
> algorithm to use, in part because the last discussion generated more
> heat than light, and settled on referring to it as NewHash for the time
> being.  However, I think it's time to pick this topic back up, since I
> can't really continue work in this direction without us picking a
> NewHash.
>
> If people are interested, I've done some analysis on availability of
> implementations, performance, and other attributes described in the
> transition plan and can send that to the list.

Let's see it!

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Hash algorithm analysis
  2018-06-09 20:56 State of NewHash work, future directions, and discussion brian m. carlson
  2018-06-09 21:26 ` Ævar Arnfjörð Bjarmason
@ 2018-06-09 22:49 ` brian m. carlson
  2018-06-11 19:29   ` Jonathan Nieder
  2018-06-11 21:19   ` Hash algorithm analysis Ævar Arnfjörð Bjarmason
  2018-06-11 18:09 ` State of NewHash work, future directions, and discussion Duy Nguyen
  2018-06-11 19:01 ` Jonathan Nieder
  3 siblings, 2 replies; 66+ messages in thread
From: brian m. carlson @ 2018-06-09 22:49 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 6836 bytes --]

== Discussion of Candidates

I've implemented and tested the following algorithms, all of which are
256-bit (in alphabetical order):

* BLAKE2b (libb2)
* BLAKE2bp (libb2)
* KangarooTwelve (imported from the Keccak Code Package)
* SHA-256 (OpenSSL)
* SHA-512/256 (OpenSSL)
* SHA3-256 (OpenSSL)
* SHAKE128 (OpenSSL)

I also rejected some other candidates.  I couldn't find any reference or
implementation of SHA256×16, so I didn't implement it.  I didn't
consider SHAKE256 because it is nearly identical to SHA3-256 in almost
all characteristics (including performance).

I imported the optimized 64-bit implementation of KangarooTwelve.  The
AVX2 implementation was not considered for licensing reasons (it's
partially generated from external code, which falls foul of the GPL's
"preferred form for modifications" rule).

=== BLAKE2b and BLAKE2bp

These are the non-parallelized and parallelized 64-bit variants of
BLAKE2.

Benefits:
* Both algorithms provide 256-bit preimage resistance.

Downsides:
* Some people are uncomfortable that the security margin has been
  decreased from the original SHA-3 submission, although it is still
  considered secure.
* BLAKE2bp, as implemented in libb2, uses OpenMP (and therefore
  multithreading) by default.  It was no longer possible to run the
  testsuite with -j3 on my laptop in this configuration.

=== Keccak-based Algorithms

SHA3-256 is the 256-bit Keccak algorithm with 24 rounds, processing 136
bytes at a time.  SHAKE128 is an extendable output function with 24
rounds, processing 168 bytes at a time.  KangarooTwelve is an extendable
output function with 12 rounds, processing 136 bytes at a time.

Benefits:
* SHA3-256 provides 256-bit preimage resistance.
* SHA3-256 has been heavily studied and is believed to have a large
  security margin.

I noted the following downsides:
* There's a lack of a availability of KangarooTwelve in other
  implementations.  It may be the least available option in terms of
  implementations.
* Some people are uncomfortable that the security margin of
  KangarooTwelve has been decreased, although it is still considered
  secure.
* SHAKE128 and KangarooTwelve provide only 128-bit preimage resistance.

=== SHA-256 and SHA-512/256

These are the 32-bit and 64-bit SHA-2 algorithms that are 256 bits in
size.

I noted the following benefits:
* Both algorithms are well known and heavily analyzed.
* Both algorithms provide 256-bit preimage resistance.

== Implementation Support

|===
| Implementation | OpenSSL | libb2 | NSS | ACC | gcrypt | Nettle| CL  |
| SHA-1          | 🗸       |       | 🗸   | 🗸   | 🗸      | 🗸     | {1} |
| BLAKE2b        | f       | 🗸     |     |     | 🗸      |       | {2} |
| BLAKE2bp       |         | 🗸     |     |     |        |       |     |
| KangarooTwelve |         |       |     |     |        |       |     |
| SHA-256        | 🗸       |       | 🗸   |  🗸  | 🗸      | 🗸     | {1} |
| SHA-512/256    | 🗸       |       |     |     |        | 🗸     | {3} |
| SHA3-256       | 🗸       |       |     |     | 🗸      | 🗸     | {4} |
| SHAKE128       | 🗸       |       |     |     | 🗸      |       | {5} |
|===

f: future version (expected 1.2.0)
ACC: Apple Common Crypto
CL: Command-line

:1: OpenSSL, coreutils, Perl Digest::SHA.
:2: OpenSSL, coreutils.
:3: OpenSSL
:4: OpenSSL, Perl Digest::SHA3.
:5: Perl Digest::SHA3.

=== Performance Analysis

The test system used below is my personal laptop, a 2016 Lenovo ThinkPad
X1 Carbon with an Intel i7-6600U CPU (2.60 GHz) running Debian unstable.

I implemented a test tool helper to compute speed much like OpenSSL
does.  Below is a comparison of speeds.  The columns indicate the speed
in KiB/s for chunks of the given size.  The runs are representative of
multiple similar runs.

256 and 1024 bytes were chosen to represent common tree and commit
object sizes and the 8 KiB is an approximate average blob size.

Algorithms are sorted by performance on the 1 KiB column.

|===
| Implementation             | 256 B  | 1 KiB  | 8 KiB  | 16 KiB |
| SHA-1 (OpenSSL)            | 513963 | 685966 | 748993 | 754270 |
| BLAKE2b (libb2)            | 488123 | 552839 | 576246 | 579292 |
| SHA-512/256 (OpenSSL)      | 181177 | 349002 | 499113 | 495169 |
| BLAKE2bp (libb2)           | 139891 | 344786 | 488390 | 522575 |
| SHA-256 (OpenSSL)          | 264276 | 333560 | 357830 | 355761 |
| KangarooTwelve             | 239305 | 307300 | 355257 | 364261 |
| SHAKE128 (OpenSSL)         | 154775 | 253344 | 337811 | 346732 |
| SHA3-256 (OpenSSL)         | 128597 | 185381 | 198931 | 207365 |
| BLAKE2bp (libb2; threaded) |  12223 |  49306 | 132833 | 179616 |
|===

SUPERCOP (a crypto benchmarking tool;
https://bench.cr.yp.to/results-hash.html) has also benchmarked these
algorithms.  Note that BLAKE2bp is not listed, KangarooTwelve is k12,
SHA-512/256 is equivalent to sha512, SHA3-256 is keccakc512, and SHAKE128 is
keccakc256.

Information is for kizomba, a Kaby Lake system.  Counts are in cycles
per byte (smaller is better; sorted by 1536 B column):

|===
| Algorithm      | 576 B | 1536 B | 4096 B | long |
| BLAKE2b        |  3.51 |   3.10 |   3.08 | 3.07 |
| SHA-1          |  4.36 |   3.81 |   3.59 | 3.49 |
| KangarooTwelve |  4.99 |   4.57 |   4.13 | 3.86 |
| SHA-512/256    |  6.39 |   5.76 |   5.31 | 5.05 |
| SHAKE128       |  8.23 |   7.67 |   7.17 | 6.97 |
| SHA-256        |  8.90 |   8.08 |   7.77 | 7.59 |
| SHA3-256       | 10.26 |   9.15 |   8.84 | 8.57 |
|===

Numbers for genji262, an AMD Ryzen System, which has SHA acceleration:

|===
| Algorithm      | 576 B | 1536 B | 4096 B | long |
| SHA-1          |  1.87 |   1.69 |   1.60 | 1.54 |
| SHA-256        |  1.95 |   1.72 |   1.68 | 1.64 |
| BLAKE2b        |  2.94 |   2.59 |   2.59 | 2.59 |
| KangarooTwelve |  4.09 |   3.65 |   3.35 | 3.17 |
| SHA-512/256    |  5.54 |   5.08 |   4.71 | 4.48 |
| SHAKE128       |  6.95 |   6.23 |   5.71 | 5.49 |
| SHA3-256       |  8.29 |   7.35 |   7.04 | 6.81 |
|===

Note that no mid- to high-end Intel processors provide acceleration.
AMD Ryzen and some ARM64 processors do.

== Summary

The algorithms with the greatest implementation availability are
SHA-256, SHA3-256, BLAKE2b, and SHAKE128.

In terms of command-line availability, BLAKE2b, SHA-256, SHA-512/256,
and SHA3-256 should be available in the near future on a reasonably
small Debian, Ubuntu, or Fedora install.

As far as security, the most conservative choices appear to be SHA-256,
SHA-512/256, and SHA3-256.

The performance winners are BLAKE2b unaccelerated and SHA-256 accelerated.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: State of NewHash work, future directions, and discussion
  2018-06-09 20:56 State of NewHash work, future directions, and discussion brian m. carlson
  2018-06-09 21:26 ` Ævar Arnfjörð Bjarmason
  2018-06-09 22:49 ` Hash algorithm analysis brian m. carlson
@ 2018-06-11 18:09 ` Duy Nguyen
  2018-06-12  1:28   ` brian m. carlson
  2018-06-11 19:01 ` Jonathan Nieder
  3 siblings, 1 reply; 66+ messages in thread
From: Duy Nguyen @ 2018-06-11 18:09 UTC (permalink / raw)
  To: brian m. carlson, Git Mailing List

On Sat, Jun 9, 2018 at 10:57 PM brian m. carlson
<sandals@crustytoothpaste.net> wrote:
>
> Since there's been a lot of questions recently about the state of the
> NewHash work, I thought I'd send out a summary.
>
> == Status
>
> I have patches to make the entire codebase work, including passing all
> tests, when Git is converted to use a 256-bit hash algorithm.
> Obviously, such a Git is incompatible with the current version, but it
> means that we've fixed essentially all of the hard-coded 20 and 40
> constants (and therefore Git doesn't segfault).

This is so cool!

> == Future Design
>
> The work I've done necessarily involves porting everything to use
> the_hash_algo.  Essentially, when the piece I'm currently working on is
> complete, we'll have a transition stage 4 implementation (all NewHash).
> Stage 2 and 3 will be implemented next.
>
> My vision of how data is stored is that the .git directory is, except
> for pack indices and the loose object lookup table, entirely in one
> format.  It will be all SHA-1 or all NewHash.  This algorithm will be
> stored in the_hash_algo.
>
> I plan on introducing an array of hash algorithms into struct repository
> (and wrapper macros) which stores, in order, the output hash, and if
> used, the additional input hash.

I'm actually thinking that putting the_hash_algo inside struct
repository is a mistake. We have code that's supposed to work without
a repo and it shows this does not really make sense to forcefully use
a partially-valid repo. Keeping the_hash_algo a separate variable
sounds more elegant.

> If people are interested, I've done some analysis on availability of
> implementations, performance, and other attributes described in the
> transition plan and can send that to the list.

I quickly skimmed through that document. I have two more concerns that
are less about any specific hash algorithm:

- how does larger hash size affects git (I guess you covered cpu
aspect, but what about cache-friendliness, disk usage, memory
consumption)

- how does all the function redirection (from abstracting away SHA-1)
affects git performance. E.g. hashcmp could be optimized and inlined
by the compiler. Now it still probably can optimize the memcmp(,,20),
but we stack another indirect function call on top. I guess I might be
just paranoid and this is not a big deal after all.
-- 
Duy

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: State of NewHash work, future directions, and discussion
  2018-06-09 20:56 State of NewHash work, future directions, and discussion brian m. carlson
                   ` (2 preceding siblings ...)
  2018-06-11 18:09 ` State of NewHash work, future directions, and discussion Duy Nguyen
@ 2018-06-11 19:01 ` Jonathan Nieder
  2018-06-12  2:28   ` brian m. carlson
  3 siblings, 1 reply; 66+ messages in thread
From: Jonathan Nieder @ 2018-06-11 19:01 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Duy Nguyen

Hi,

brian m. carlson wrote:

> Since there's been a lot of questions recently about the state of the
> NewHash work, I thought I'd send out a summary.

Yay!

[...]
> I plan on introducing an array of hash algorithms into struct repository
> (and wrapper macros) which stores, in order, the output hash, and if
> used, the additional input hash.

Interesting.  In principle the four following are separate things:

 1. Hash to be used for command output to the terminal
 2. Hash used in pack files
 3. Additional hashes (beyond (2)) that we can look up using the
    translation table
 4. Additional hashes (beyond (1)) accepted in input from the command
    line and stdin

In principle, (1) and (4) would be globals, and (2) and (3) would be
tied to the repository.  I think this is always what Duy was hinting
at.

All that said, as long as there is some notion of (1) and (4), I'm
excited. :)  Details of how they are laid out in memory are less
important.

[...]
> The transition plan anticipates a stage 1 where accept only SHA-1 on
> input and produce only SHA-1 on output, but store in NewHash.  As I've
> worked with our tests, I've realized such an implementation is not
> entirely possible.  We have various tools that expect to accept invalid
> object IDs, and obviously there's no way to have those continue to work.

Can you give an example?  Do you mean commands like "git mktree"?

[...]
> If you're working on new features and you'd like to implement the best
> possible compatibility with this work, here are some recommendations:

This list is great.  Thanks for it.

[...]
> == Discussion about an Actual NewHash
>
> Since I'll be writing new code, I'll be writing tests for this code.
> However, writing tests for creating and initializing repositories
> requires that I be able to test that objects are being serialized
> correctly, and therefore requires that I actually know what the hash
> algorithm is going to be.  I also can't submit code for multi-hash packs
> when we officially only support one hash algorithm.

Thanks for restarting this discussion as well.

You can always use something like e.g. "doubled SHA-1" as a proof of
concept, but I agree that it's nice to be able to avoid some churn by
using an actual hash function that we're likely to switch to.

Sincerely,
Jonathan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-06-09 22:49 ` Hash algorithm analysis brian m. carlson
@ 2018-06-11 19:29   ` Jonathan Nieder
  2018-06-11 20:20     ` Linus Torvalds
                       ` (2 more replies)
  2018-06-11 21:19   ` Hash algorithm analysis Ævar Arnfjörð Bjarmason
  1 sibling, 3 replies; 66+ messages in thread
From: Jonathan Nieder @ 2018-06-11 19:29 UTC (permalink / raw)
  To: brian m. carlson
  Cc: git, Johannes Schindelin, demerphq, Linus Torvalds, Adam Langley,
	The Keccak Team

Hi,

brian m. carlson wrote:

> == Discussion of Candidates
>
> I've implemented and tested the following algorithms, all of which are
> 256-bit (in alphabetical order):

Thanks for this.  Where can I read your code?

[...]
> I also rejected some other candidates.  I couldn't find any reference or
> implementation of SHA256×16, so I didn't implement it.

Reference: https://eprint.iacr.org/2012/476.pdf

If consensus turns toward it being the right hash function to use,
then we can pursue finding or writing a good high-quality
implementation.  But I tend to suspect that the lack of wide
implementation availability is a reason to avoid it unless we find
SHA-256 to be too slow.

[...]
> * BLAKE2bp, as implemented in libb2, uses OpenMP (and therefore
>   multithreading) by default.  It was no longer possible to run the
>   testsuite with -j3 on my laptop in this configuration.

My understanding is that BLAKE2bp is better able to make use of simd
instructions than BLAKE2b.  Is there a way to configure libb2 to take
advantage of that without multithreading?

E.g. https://github.com/sneves/blake2-avx2 looks promising for that.

[...]
> |===
> | Implementation             | 256 B  | 1 KiB  | 8 KiB  | 16 KiB |
> | SHA-1 (OpenSSL)            | 513963 | 685966 | 748993 | 754270 |
> | BLAKE2b (libb2)            | 488123 | 552839 | 576246 | 579292 |
> | SHA-512/256 (OpenSSL)      | 181177 | 349002 | 499113 | 495169 |
> | BLAKE2bp (libb2)           | 139891 | 344786 | 488390 | 522575 |
> | SHA-256 (OpenSSL)          | 264276 | 333560 | 357830 | 355761 |
> | KangarooTwelve             | 239305 | 307300 | 355257 | 364261 |
> | SHAKE128 (OpenSSL)         | 154775 | 253344 | 337811 | 346732 |
> | SHA3-256 (OpenSSL)         | 128597 | 185381 | 198931 | 207365 |
> | BLAKE2bp (libb2; threaded) |  12223 |  49306 | 132833 | 179616 |
> |===

That's a bit surprising, since my impression (e.g. in the SUPERCOP
benchmarks you cite) is that there are secure hash functions that
allow comparable or even faster performance than SHA-1 with large
inputs on a single core.  In Git we also care about performance with
small inputs, creating a bit of a trade-off.

More on the subject of blake2b vs blake2bp:

- blake2b is faster for small inputs (under 1k, say). If this is
  important then we could set a threshold, e.g. 512 bytes, for
  swtiching to blake2bp.

- blake2b is supported in OpenSSL and likely to get x86-optimized
  versions in the future. blake2bp is not in OpenSSL.

[...]
> == Summary
>
> The algorithms with the greatest implementation availability are
> SHA-256, SHA3-256, BLAKE2b, and SHAKE128.
>
> In terms of command-line availability, BLAKE2b, SHA-256, SHA-512/256,
> and SHA3-256 should be available in the near future on a reasonably
> small Debian, Ubuntu, or Fedora install.
>
> As far as security, the most conservative choices appear to be SHA-256,
> SHA-512/256, and SHA3-256.

SHA-256x16 has the same security properties as SHA-256.  Picking
between those two is a tradeoff between performance and implementation
availability.

My understanding is that all the algorithms we're discussing are
believed to be approximately equivalent in security.  That's a strange
thing to say when e.g. K12 uses fewer rounds than SHA3 of the same
permutation, but it is my understanding nonetheless.  We don't know
yet how these hash algorithms will ultimately break.

My understanding of the discussion so far:

Keccak team encourages us[1] to consider a variant like K12 instead of
SHA3.

AGL explains[2] that the algorithms considered all seem like
reasonable choices and we should decide using factors like
implementation ease and performance.

If we choose a Keccak-based function, AGL also[3] encourages using a
variant like K12 instead of SHA3.

Dscho strongly prefers[4] SHA-256, because of
- wide implementation availability, including in future hardware
- has been widely analyzed
- is fast

Yves Orton and Linus Torvalds prefer[5] SHA3 over SHA2 because of how
it is constructed.

Thanks,
Jonathan

[1] https://public-inbox.org/git/91a34c5b-7844-3db2-cf29-411df5bcf886@noekeon.org/
[2] https://public-inbox.org/git/CAL9PXLzhPyE+geUdcLmd=pidT5P8eFEBbSgX_dS88knz2q_LSw@mail.gmail.com/
[3] https://www.imperialviolet.org/2017/05/31/skipsha3.html
[4] https://public-inbox.org/git/alpine.DEB.2.21.1.1706151122180.4200@virtualbox/
[5] https://public-inbox.org/git/CA+55aFwUn0KibpDQK2ZrxzXKOk8-aAub2nJZQqKCpq1ddhDcMQ@mail.gmail.com/

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-06-11 19:29   ` Jonathan Nieder
@ 2018-06-11 20:20     ` Linus Torvalds
  2018-06-11 23:27       ` Ævar Arnfjörð Bjarmason
  2018-06-11 22:35     ` brian m. carlson
  2018-07-20 21:52     ` brian m. carlson
  2 siblings, 1 reply; 66+ messages in thread
From: Linus Torvalds @ 2018-06-11 20:20 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: brian m. carlson, Git Mailing List, Johannes Schindelin, demerphq,
	agl, keccak

On Mon, Jun 11, 2018 at 12:29 PM Jonathan Nieder <jrnieder@gmail.com> wrote:
>
> Yves Orton and Linus Torvalds prefer[5] SHA3 over SHA2 because of how
> it is constructed.

Yeah, I really think that it's a mistake to switch to something that
has the same problem SHA1 had.

That doesn't necessarily mean SHA3, but it does mean "bigger
intermediate hash state" (so no length extension attack), which could
be SHA3, but also SHA-512/256 or K12.

Honestly, git has effectively already moved from SHA1 to SHA1DC.

So the actual known attack and weakness of SHA1 should simply not be
part of the discussion for the next hash. You can basically say "we're
_already_ on the second hash, we just picked one that was so
compatible with SHA1 that nobody even really noticed.

The reason to switch is

 (a) 160 bits may not be enough

 (b) maybe there are other weaknesses in SHA1 that SHA1DC doesn't catch.

 (c) others?

Obviously all of the choices address (a).

But at least for me, (b) makes me go "well, SHA2 has the exact same
weak inter-block state attack, so if there are unknown weaknesses in
SHA1, then what about unknown weaknesses in SHA2"?

And no, I'm not a cryptographer. But honestly, length extension
attacks were how both md5 and sha1 were broken in practice, so I'm
just going "why would we go with a crypto choice that has that known
weakness? That's just crazy".

From a performance standpoint, I have to say (once more) that crypto
performance actually mattered a lot less than I originally thought it
would. Yes, there are phases that do care, but they are rare.

For example, I think SHA1 performance has probably mattered most for
the index and pack-file, where it's really only used as a fancy CRC.
For most individual object cases, it is almost never an issue.

From a performance angle, I think the whole "256-bit hashes are
bigger" is going to be the more noticeable performance issue, just
because things like delta compression and zlib - both of which are
very *real* and present performance issues - will have more data that
they need to work on. The performance difference between different
hashing functions is likely not all that noticeable in most common
cases as long as we're not talking orders of magnitude.

And yes, I guess we're in the "approaching an order of magnitude"
performance difference, but we should actually compare not to OpenSSL
SHA1, but to SHA1DC. See above.

Personally, the fact that the Keccak people would suggest K12 makes me
think that should be a front-runner, but whatever. I don't think the
128-bit preimage case is an issue, since 128 bits is the brute-force
cost for any 256-bit hash.

But hey, I picked sha1 to begin with, so take any input from me with
that historical pinch of salt in mind ;)

                Linus

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-06-09 22:49 ` Hash algorithm analysis brian m. carlson
  2018-06-11 19:29   ` Jonathan Nieder
@ 2018-06-11 21:19   ` Ævar Arnfjörð Bjarmason
  2018-06-21  8:20     ` Johannes Schindelin
  2018-06-21 22:39     ` brian m. carlson
  1 sibling, 2 replies; 66+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-06-11 21:19 UTC (permalink / raw)
  To: brian m. carlson
  Cc: git, Adam Langley, Jeff King, Mike Hommey, Brandon Williams,
	Linus Torvalds, Jonathan Nieder, Stefan Beller, Jonathan Tan,
	Junio Hamano, Johannes Schindelin


On Sat, Jun 09 2018, brian m. carlson wrote:

[Expanding the CC list to what we had in the last "what hash" thread[1]
last year].

> == Discussion of Candidates
>
> I've implemented and tested the following algorithms, all of which are
> 256-bit (in alphabetical order):
>
> * BLAKE2b (libb2)
> * BLAKE2bp (libb2)
> * KangarooTwelve (imported from the Keccak Code Package)
> * SHA-256 (OpenSSL)
> * SHA-512/256 (OpenSSL)
> * SHA3-256 (OpenSSL)
> * SHAKE128 (OpenSSL)
>
> I also rejected some other candidates.  I couldn't find any reference or
> implementation of SHA256×16, so I didn't implement it.  I didn't
> consider SHAKE256 because it is nearly identical to SHA3-256 in almost
> all characteristics (including performance).
>
> I imported the optimized 64-bit implementation of KangarooTwelve.  The
> AVX2 implementation was not considered for licensing reasons (it's
> partially generated from external code, which falls foul of the GPL's
> "preferred form for modifications" rule).
>
> === BLAKE2b and BLAKE2bp
>
> These are the non-parallelized and parallelized 64-bit variants of
> BLAKE2.
>
> Benefits:
> * Both algorithms provide 256-bit preimage resistance.
>
> Downsides:
> * Some people are uncomfortable that the security margin has been
>   decreased from the original SHA-3 submission, although it is still
>   considered secure.
> * BLAKE2bp, as implemented in libb2, uses OpenMP (and therefore
>   multithreading) by default.  It was no longer possible to run the
>   testsuite with -j3 on my laptop in this configuration.
>
> === Keccak-based Algorithms
>
> SHA3-256 is the 256-bit Keccak algorithm with 24 rounds, processing 136
> bytes at a time.  SHAKE128 is an extendable output function with 24
> rounds, processing 168 bytes at a time.  KangarooTwelve is an extendable
> output function with 12 rounds, processing 136 bytes at a time.
>
> Benefits:
> * SHA3-256 provides 256-bit preimage resistance.
> * SHA3-256 has been heavily studied and is believed to have a large
>   security margin.
>
> I noted the following downsides:
> * There's a lack of a availability of KangarooTwelve in other
>   implementations.  It may be the least available option in terms of
>   implementations.
> * Some people are uncomfortable that the security margin of
>   KangarooTwelve has been decreased, although it is still considered
>   secure.
> * SHAKE128 and KangarooTwelve provide only 128-bit preimage resistance.
>
> === SHA-256 and SHA-512/256
>
> These are the 32-bit and 64-bit SHA-2 algorithms that are 256 bits in
> size.
>
> I noted the following benefits:
> * Both algorithms are well known and heavily analyzed.
> * Both algorithms provide 256-bit preimage resistance.
>
> == Implementation Support
>
> |===
> | Implementation | OpenSSL | libb2 | NSS | ACC | gcrypt | Nettle| CL  |
> | SHA-1          | 🗸       |       | 🗸   | 🗸   | 🗸      | 🗸     | {1} |
> | BLAKE2b        | f       | 🗸     |     |     | 🗸      |       | {2} |
> | BLAKE2bp       |         | 🗸     |     |     |        |       |     |
> | KangarooTwelve |         |       |     |     |        |       |     |
> | SHA-256        | 🗸       |       | 🗸   |  🗸  | 🗸      | 🗸     | {1} |
> | SHA-512/256    | 🗸       |       |     |     |        | 🗸     | {3} |
> | SHA3-256       | 🗸       |       |     |     | 🗸      | 🗸     | {4} |
> | SHAKE128       | 🗸       |       |     |     | 🗸      |       | {5} |
> |===
>
> f: future version (expected 1.2.0)
> ACC: Apple Common Crypto
> CL: Command-line
>
> :1: OpenSSL, coreutils, Perl Digest::SHA.
> :2: OpenSSL, coreutils.
> :3: OpenSSL
> :4: OpenSSL, Perl Digest::SHA3.
> :5: Perl Digest::SHA3.
>
> === Performance Analysis
>
> The test system used below is my personal laptop, a 2016 Lenovo ThinkPad
> X1 Carbon with an Intel i7-6600U CPU (2.60 GHz) running Debian unstable.
>
> I implemented a test tool helper to compute speed much like OpenSSL
> does.  Below is a comparison of speeds.  The columns indicate the speed
> in KiB/s for chunks of the given size.  The runs are representative of
> multiple similar runs.
>
> 256 and 1024 bytes were chosen to represent common tree and commit
> object sizes and the 8 KiB is an approximate average blob size.
>
> Algorithms are sorted by performance on the 1 KiB column.
>
> |===
> | Implementation             | 256 B  | 1 KiB  | 8 KiB  | 16 KiB |
> | SHA-1 (OpenSSL)            | 513963 | 685966 | 748993 | 754270 |
> | BLAKE2b (libb2)            | 488123 | 552839 | 576246 | 579292 |
> | SHA-512/256 (OpenSSL)      | 181177 | 349002 | 499113 | 495169 |
> | BLAKE2bp (libb2)           | 139891 | 344786 | 488390 | 522575 |
> | SHA-256 (OpenSSL)          | 264276 | 333560 | 357830 | 355761 |
> | KangarooTwelve             | 239305 | 307300 | 355257 | 364261 |
> | SHAKE128 (OpenSSL)         | 154775 | 253344 | 337811 | 346732 |
> | SHA3-256 (OpenSSL)         | 128597 | 185381 | 198931 | 207365 |
> | BLAKE2bp (libb2; threaded) |  12223 |  49306 | 132833 | 179616 |
> |===
>
> SUPERCOP (a crypto benchmarking tool;
> https://bench.cr.yp.to/results-hash.html) has also benchmarked these
> algorithms.  Note that BLAKE2bp is not listed, KangarooTwelve is k12,
> SHA-512/256 is equivalent to sha512, SHA3-256 is keccakc512, and SHAKE128 is
> keccakc256.
>
> Information is for kizomba, a Kaby Lake system.  Counts are in cycles
> per byte (smaller is better; sorted by 1536 B column):
>
> |===
> | Algorithm      | 576 B | 1536 B | 4096 B | long |
> | BLAKE2b        |  3.51 |   3.10 |   3.08 | 3.07 |
> | SHA-1          |  4.36 |   3.81 |   3.59 | 3.49 |
> | KangarooTwelve |  4.99 |   4.57 |   4.13 | 3.86 |
> | SHA-512/256    |  6.39 |   5.76 |   5.31 | 5.05 |
> | SHAKE128       |  8.23 |   7.67 |   7.17 | 6.97 |
> | SHA-256        |  8.90 |   8.08 |   7.77 | 7.59 |
> | SHA3-256       | 10.26 |   9.15 |   8.84 | 8.57 |
> |===
>
> Numbers for genji262, an AMD Ryzen System, which has SHA acceleration:
>
> |===
> | Algorithm      | 576 B | 1536 B | 4096 B | long |
> | SHA-1          |  1.87 |   1.69 |   1.60 | 1.54 |
> | SHA-256        |  1.95 |   1.72 |   1.68 | 1.64 |
> | BLAKE2b        |  2.94 |   2.59 |   2.59 | 2.59 |
> | KangarooTwelve |  4.09 |   3.65 |   3.35 | 3.17 |
> | SHA-512/256    |  5.54 |   5.08 |   4.71 | 4.48 |
> | SHAKE128       |  6.95 |   6.23 |   5.71 | 5.49 |
> | SHA3-256       |  8.29 |   7.35 |   7.04 | 6.81 |
> |===
>
> Note that no mid- to high-end Intel processors provide acceleration.
> AMD Ryzen and some ARM64 processors do.
>
> == Summary
>
> The algorithms with the greatest implementation availability are
> SHA-256, SHA3-256, BLAKE2b, and SHAKE128.
>
> In terms of command-line availability, BLAKE2b, SHA-256, SHA-512/256,
> and SHA3-256 should be available in the near future on a reasonably
> small Debian, Ubuntu, or Fedora install.
>
> As far as security, the most conservative choices appear to be SHA-256,
> SHA-512/256, and SHA3-256.
>
> The performance winners are BLAKE2b unaccelerated and SHA-256 accelerated.

This is a great summary. Thanks.

In case it's not apparent from what follows, I have a bias towards
SHA-256. Reasons for that, to summarize some of the discussion the last
time around[1], and to add more details:

== Popularity

Other things being equal we should be biased towards whatever's in the
widest use & recommended fon new projects.

I fear that if e.g. git had used whatever at time was to SHA-1 as
BLAKE2b is to SHA-256 now, we might not even know that it's broken (or
had the sha1collisiondetection work to fall back on), since researchers
are less likely to look at algorithms that aren't in wide use.

SHA-256 et al were published in 2001 and has ~20k results on Google
Scholar, compared to ~150 for BLAKE2b[4], published in 2008 (but ~1.2K
for "BLAKE2").

Between the websites of Intel, AMD & ARM there are thousands of results
for SHA-256 (and existing in-silicon acceleration). There's exactly one
result on all three for BLAKE2b (on amd.com, in the context of a laundry
list of hash algorithms in some presentation.

Since BLAKE2b lost the SHA-3 competition to Keccak it seems impossible
that it'll get ever get anywhere close to the same scrutiny or support
in silicon as one of the SHA families.

Which brings me to the next section...

== Hardware acceleration

The only widely deployed HW acceleration is for the SHA-1 and SHA-256
from the SHA-2 family[5], but notably nothing from the newer SHA-3
family (released in 2015).

It seems implausible that anything except SHA-3 will get future HW
acceleration given the narrow scope of current HW acceleration
v.s. existing hash algorithms.

As noted in the thread from last year[1] most git users won't even
notice if the hashing is faster, but it does matter for some big users
(big monorepos), so having the out of purchasing hardware to make things
faster today is great, and given how these new instruction set
extensions get rolled out it seems inevitable that this'll be available
in all consumer CPUs within 5-10 years.

== Age

Similar to "popularity" it seems better to bias things towards a hash
that's been out there for a while, i.e. it would be too early to pick
SHA-3.

The hash transitioning plan, once implemented, also makes it easier to
switch to something else in the future, so we shouldn't be in a rush to
pick some newer hash because we'll need to keep it forever, we can
always do another transition in another 10-15 years.

== Conclusion

For all the above reasons I think we should pick SHA-256.

1. https://public-inbox.org/git/87y3ss8n4h.fsf@gmail.com/#t
2. https://github.com/cr-marcstevens/sha1collisiondetection
3. https://scholar.google.nl/scholar?hl=en&as_sdt=0%2C5&q=SHA-256&btnG=
4. https://scholar.google.nl/scholar?hl=en&as_sdt=0%2C5&q=BLAKE2b&btnG=
5. https://en.wikipedia.org/wiki/Intel_SHA_extensions

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-06-11 19:29   ` Jonathan Nieder
  2018-06-11 20:20     ` Linus Torvalds
@ 2018-06-11 22:35     ` brian m. carlson
  2018-06-12 16:21       ` Gilles Van Assche
  2018-07-20 21:52     ` brian m. carlson
  2 siblings, 1 reply; 66+ messages in thread
From: brian m. carlson @ 2018-06-11 22:35 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: git, Johannes Schindelin, demerphq, Linus Torvalds, Adam Langley,
	The Keccak Team

[-- Attachment #1: Type: text/plain, Size: 4498 bytes --]

On Mon, Jun 11, 2018 at 12:29:42PM -0700, Jonathan Nieder wrote:
> brian m. carlson wrote:
> 
> > == Discussion of Candidates
> >
> > I've implemented and tested the following algorithms, all of which are
> > 256-bit (in alphabetical order):
> 
> Thanks for this.  Where can I read your code?

https://github.com/bk2204/git.git, test-hashes branch.  You will need to
have libb2 and OPENSSL_SHA1 set.  It's a bit of a hack, so don't look
too hard.

> [...]
> > I also rejected some other candidates.  I couldn't find any reference or
> > implementation of SHA256×16, so I didn't implement it.
> 
> Reference: https://eprint.iacr.org/2012/476.pdf

Thanks for that reference.

> If consensus turns toward it being the right hash function to use,
> then we can pursue finding or writing a good high-quality
> implementation.  But I tend to suspect that the lack of wide
> implementation availability is a reason to avoid it unless we find
> SHA-256 to be too slow.

I agree.  Implementation availability is important.  Whatever we provide
is likely going to be portable C code, which is going to be slower than
an optimized implementation.

> [...]
> > * BLAKE2bp, as implemented in libb2, uses OpenMP (and therefore
> >   multithreading) by default.  It was no longer possible to run the
> >   testsuite with -j3 on my laptop in this configuration.
> 
> My understanding is that BLAKE2bp is better able to make use of simd
> instructions than BLAKE2b.  Is there a way to configure libb2 to take
> advantage of that without multithreading?

You'll notice below that I have both BLAKE2bp with and without
threading.  I recompiled libb2 to not use threading, and it still didn't
perform as well.

libb2 is written by the authors of BLAKE2, so it's the most favorable
implementation we're likely to get.

> [...]
> > |===
> > | Implementation             | 256 B  | 1 KiB  | 8 KiB  | 16 KiB |
> > | SHA-1 (OpenSSL)            | 513963 | 685966 | 748993 | 754270 |
> > | BLAKE2b (libb2)            | 488123 | 552839 | 576246 | 579292 |
> > | SHA-512/256 (OpenSSL)      | 181177 | 349002 | 499113 | 495169 |
> > | BLAKE2bp (libb2)           | 139891 | 344786 | 488390 | 522575 |
> > | SHA-256 (OpenSSL)          | 264276 | 333560 | 357830 | 355761 |
> > | KangarooTwelve             | 239305 | 307300 | 355257 | 364261 |
> > | SHAKE128 (OpenSSL)         | 154775 | 253344 | 337811 | 346732 |
> > | SHA3-256 (OpenSSL)         | 128597 | 185381 | 198931 | 207365 |
> > | BLAKE2bp (libb2; threaded) |  12223 |  49306 | 132833 | 179616 |
> > |===
> 
> That's a bit surprising, since my impression (e.g. in the SUPERCOP
> benchmarks you cite) is that there are secure hash functions that
> allow comparable or even faster performance than SHA-1 with large
> inputs on a single core.  In Git we also care about performance with
> small inputs, creating a bit of a trade-off.
> 
> More on the subject of blake2b vs blake2bp:
> 
> - blake2b is faster for small inputs (under 1k, say). If this is
>   important then we could set a threshold, e.g. 512 bytes, for
>   swtiching to blake2bp.
> 
> - blake2b is supported in OpenSSL and likely to get x86-optimized
>   versions in the future. blake2bp is not in OpenSSL.

Correct.  BLAKE2b in OpenSSL is currently 512-bit only, but it's
intended to add support for 256-bit versions soon.  I think the benefit
of sticking to one hash function altogether is significant, so I think
we should one that has good all-around performance instead of trying to
split between different ones.

> My understanding is that all the algorithms we're discussing are
> believed to be approximately equivalent in security.  That's a strange
> thing to say when e.g. K12 uses fewer rounds than SHA3 of the same
> permutation, but it is my understanding nonetheless.  We don't know
> yet how these hash algorithms will ultimately break.

With the exception of variations in preimage security, I expect that's
correct.  I think implementation availability and performance are the
best candidates for consideration.

> My understanding of the discussion so far:
> 
> Keccak team encourages us[1] to consider a variant like K12 instead of
> SHA3.

While I think K12 is an interesting algorithm, I'm not sure we're going
to get as good of performance out of it as we might want due to the lack
of implementations.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-06-11 20:20     ` Linus Torvalds
@ 2018-06-11 23:27       ` Ævar Arnfjörð Bjarmason
  2018-06-12  0:11         ` David Lang
  2018-06-12  0:45         ` Linus Torvalds
  0 siblings, 2 replies; 66+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-06-11 23:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jonathan Nieder, brian m. carlson, Git Mailing List,
	Johannes Schindelin, demerphq, agl, keccak


On Mon, Jun 11 2018, Linus Torvalds wrote:

> On Mon, Jun 11, 2018 at 12:29 PM Jonathan Nieder <jrnieder@gmail.com> wrote:
>>
>> Yves Orton and Linus Torvalds prefer[5] SHA3 over SHA2 because of how
>> it is constructed.
>
> Yeah, I really think that it's a mistake to switch to something that
> has the same problem SHA1 had.
>
> That doesn't necessarily mean SHA3, but it does mean "bigger
> intermediate hash state" (so no length extension attack), which could
> be SHA3, but also SHA-512/256 or K12.
>
> Honestly, git has effectively already moved from SHA1 to SHA1DC.
>
> So the actual known attack and weakness of SHA1 should simply not be
> part of the discussion for the next hash. You can basically say "we're
> _already_ on the second hash, we just picked one that was so
> compatible with SHA1 that nobody even really noticed.
>
> The reason to switch is
>
>  (a) 160 bits may not be enough
>
>  (b) maybe there are other weaknesses in SHA1 that SHA1DC doesn't catch.
>
>  (c) others?
>
> Obviously all of the choices address (a).

FWIW I updated our docs 3 months ago to try to address some of this:
https://github.com/git/git/commit/5988eb631a

> But at least for me, (b) makes me go "well, SHA2 has the exact same
> weak inter-block state attack, so if there are unknown weaknesses in
> SHA1, then what about unknown weaknesses in SHA2"?
>
> And no, I'm not a cryptographer. But honestly, length extension
> attacks were how both md5 and sha1 were broken in practice, so I'm
> just going "why would we go with a crypto choice that has that known
> weakness? That's just crazy".

What do you think about Johannes's summary of this being a non-issue for
Git in
https://public-inbox.org/git/alpine.DEB.2.21.1.1706151122180.4200@virtualbox/
?

> From a performance standpoint, I have to say (once more) that crypto
> performance actually mattered a lot less than I originally thought it
> would. Yes, there are phases that do care, but they are rare.

One real-world case is rebasing[1]. As noted in that E-Mail of mine a
year ago we can use SHA1DC v.s. OpenSSL as a stand-in for the sort of
performance difference we might expect between hash functions, although
as you note this doesn't account for the difference in length.

With our perf tests, in t/perf on linux.git:

    $ GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_REPEAT_COUNT=10 GIT_PERF_MAKE_COMMAND='if pwd | grep -q $(git rev-parse origin/master); then make -j8 CFLAGS=-O3 DC_SHA1=Y; else make -j8 CFLAGS=-O3 OPENSSL_SHA1=Y; fi' ./run origin/master~ origin/master -- p3400-rebase.sh
    Test                                                            origin/master~    origin/master
    --------------------------------------------------------------------------------------------------------
    3400.2: rebase on top of a lot of unrelated changes             1.38(1.19+0.11)   1.40(1.23+0.10) +1.4%
    3400.4: rebase a lot of unrelated changes without split-index   4.07(3.28+0.66)   4.62(3.71+0.76) +13.5%
    3400.6: rebase a lot of unrelated changes with split-index      3.41(2.94+0.38)   3.35(2.87+0.37) -1.8%

On a bigger monorepo I have here:

    Test                                                            origin/master~    origin/master
    -------------------------------------------------------------------------------------------------------
    3400.2: rebase on top of a lot of unrelated changes             1.39(1.19+0.17)   1.34(1.16+0.16) -3.6%
    3400.4: rebase a lot of unrelated changes without split-index   6.67(3.37+0.63)   6.95(3.90+0.62) +4.2%
    3400.6: rebase a lot of unrelated changes with split-index      3.70(2.85+0.45)   3.73(2.85+0.41) +0.8%

I didn't paste any numbers in that E-Mail a year ago, maybe I produced
them differently, but this is clerly not that of a "big difference". But
this is one way to see the difference.

> For example, I think SHA1 performance has probably mattered most for
> the index and pack-file, where it's really only used as a fancy CRC.
> For most individual object cases, it is almost never an issue.

Yeah there's lots of things we could optimize there, but we are going to
need to hash things to create the commit in e.g. the rebase case, but
much of that could probably be done more efficiently without switching
the hash.

> From a performance angle, I think the whole "256-bit hashes are
> bigger" is going to be the more noticeable performance issue, just
> because things like delta compression and zlib - both of which are
> very *real* and present performance issues - will have more data that
> they need to work on. The performance difference between different
> hashing functions is likely not all that noticeable in most common
> cases as long as we're not talking orders of magnitude.
>
> And yes, I guess we're in the "approaching an order of magnitude"
> performance difference, but we should actually compare not to OpenSSL
> SHA1, but to SHA1DC. See above.
>
> Personally, the fact that the Keccak people would suggest K12 makes me
> think that should be a front-runner, but whatever. I don't think the
> 128-bit preimage case is an issue, since 128 bits is the brute-force
> cost for any 256-bit hash.
>
> But hey, I picked sha1 to begin with, so take any input from me with
> that historical pinch of salt in mind ;)

1. https://public-inbox.org/git/87tw3f8vez.fsf@gmail.com/

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-06-11 23:27       ` Ævar Arnfjörð Bjarmason
@ 2018-06-12  0:11         ` David Lang
  2018-06-12  0:45         ` Linus Torvalds
  1 sibling, 0 replies; 66+ messages in thread
From: David Lang @ 2018-06-12  0:11 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Linus Torvalds, Jonathan Nieder, brian m. carlson,
	Git Mailing List, Johannes Schindelin, demerphq, agl, keccak

[-- Attachment #1: Type: TEXT/PLAIN, Size: 956 bytes --]

On Tue, 12 Jun 2018, Ævar Arnfjörð Bjarmason wrote:

>> From a performance standpoint, I have to say (once more) that crypto
>> performance actually mattered a lot less than I originally thought it
>> would. Yes, there are phases that do care, but they are rare.
>
> One real-world case is rebasing[1]. As noted in that E-Mail of mine a
> year ago we can use SHA1DC v.s. OpenSSL as a stand-in for the sort of
> performance difference we might expect between hash functions, although
> as you note this doesn't account for the difference in length.

when you are rebasing, how many hashes do you need to calculate? a few dozen, a 
few hundred, a few thousand, a few hundered thousand?

If the common uses of rebasing are on the low end, then the fact that the hash 
takes a bit longer won't matter much because the entire job is so fast.

And at the high end, I/O will probably dominate.

so where does it really make a human visible difference?

David Lang

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-06-11 23:27       ` Ævar Arnfjörð Bjarmason
  2018-06-12  0:11         ` David Lang
@ 2018-06-12  0:45         ` Linus Torvalds
  1 sibling, 0 replies; 66+ messages in thread
From: Linus Torvalds @ 2018-06-12  0:45 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jonathan Nieder, brian m. carlson, Git Mailing List,
	Johannes Schindelin, demerphq, agl, keccak

On Mon, Jun 11, 2018 at 4:27 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
> >
> > And no, I'm not a cryptographer. But honestly, length extension
> > attacks were how both md5 and sha1 were broken in practice, so I'm
> > just going "why would we go with a crypto choice that has that known
> > weakness? That's just crazy".
>
> What do you think about Johannes's summary of this being a non-issue for
> Git in
> https://public-inbox.org/git/alpine.DEB.2.21.1.1706151122180.4200@virtualbox/
> ?

I agree that the fact that git internal data is structured and all
meaningful (and doesn't really have ignored state) makes it *much*
harder to attack the basic git objects, since you not only have to
generate a good hash, the end result has to also *parse* and there is
not really any hidden non-parsed data that you can use to hide the
attack.

And *if* you are using git for source code, the same is pretty much
true even for the blob objects - an attacking object will stand out
like a sore thumb in "diff" etc.

So I don't disagree with Johannes in that sense: I think git does
fundamentally tend to have some extra validation in place, and there's
a reason why the examples for both the md5 and the sha1 attack were
pdf files.

That said, even if git internal ("metadata") objects like trees and
commits tend to not have opaque parts to them and are thus pretty hard
to attack, the blob objects are still an attack vector for projects
that use git for non-source-code (and even source projects do embed
binary files - including pdf files - even though they might not be "as
interesting" to attack). So you do want to protect those too.

And hey, protecting the metadata objects is good just to protect
against annoyances. Sure, you should always sanity check the object at
receive time anyway, but even so, if somebody is able to generate a
blob object that hashes to the same hash as a metadata object (ie tree
or commit), that really could be pretty damn annoying.

And the whole "intermediate hashed state is same size as final hash
state" just _fundamentally_ means that if you find a weakness in the
hash, you can now attack that weakness without having to worry about
the attack being fundamentally more expensive.

That's essentially what SHAttered relied on. It didn't rely on a
secret and a hash and length extension, but it *did* rely on the same
mechanism that a length extension attack relies on, where you can
basically attack the state in the middle with no extra cost.

Maybe some people don't consider it a length extension attack for that
reason, but it boils down to much the same basic situation where you
can attack the internal hash state and cause a state collision. And
you can try to find the patterns that then cause that state collision
when you've found a weakness in the hash.

With SHA3 or k12, you can obviously _also_ try to attack the hash
state and cause a collision, but because the intermediate state is
much bigger than the final hash, you're just making things *way*
harder for yourself if you try that.

              Linus

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: State of NewHash work, future directions, and discussion
  2018-06-11 18:09 ` State of NewHash work, future directions, and discussion Duy Nguyen
@ 2018-06-12  1:28   ` brian m. carlson
  0 siblings, 0 replies; 66+ messages in thread
From: brian m. carlson @ 2018-06-12  1:28 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Git Mailing List

[-- Attachment #1: Type: text/plain, Size: 1236 bytes --]

On Mon, Jun 11, 2018 at 08:09:47PM +0200, Duy Nguyen wrote:
> I'm actually thinking that putting the_hash_algo inside struct
> repository is a mistake. We have code that's supposed to work without
> a repo and it shows this does not really make sense to forcefully use
> a partially-valid repo. Keeping the_hash_algo a separate variable
> sounds more elegant.

It can fairly easily be moved out if we want.

> I quickly skimmed through that document. I have two more concerns that
> are less about any specific hash algorithm:
> 
> - how does larger hash size affects git (I guess you covered cpu
> aspect, but what about cache-friendliness, disk usage, memory
> consumption)
> 
> - how does all the function redirection (from abstracting away SHA-1)
> affects git performance. E.g. hashcmp could be optimized and inlined
> by the compiler. Now it still probably can optimize the memcmp(,,20),
> but we stack another indirect function call on top. I guess I might be
> just paranoid and this is not a big deal after all.

I would have to run some numbers on this.  I probably won't get around
to doing that until Friday or Saturday.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: State of NewHash work, future directions, and discussion
  2018-06-11 19:01 ` Jonathan Nieder
@ 2018-06-12  2:28   ` brian m. carlson
  2018-06-12  2:42     ` Jonathan Nieder
  0 siblings, 1 reply; 66+ messages in thread
From: brian m. carlson @ 2018-06-12  2:28 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git, Duy Nguyen

[-- Attachment #1: Type: text/plain, Size: 2262 bytes --]

On Mon, Jun 11, 2018 at 12:01:03PM -0700, Jonathan Nieder wrote:
> Hi,
> 
> brian m. carlson wrote:
> > I plan on introducing an array of hash algorithms into struct repository
> > (and wrapper macros) which stores, in order, the output hash, and if
> > used, the additional input hash.
> 
> Interesting.  In principle the four following are separate things:
> 
>  1. Hash to be used for command output to the terminal
>  2. Hash used in pack files
>  3. Additional hashes (beyond (2)) that we can look up using the
>     translation table
>  4. Additional hashes (beyond (1)) accepted in input from the command
>     line and stdin
> 
> In principle, (1) and (4) would be globals, and (2) and (3) would be
> tied to the repository.  I think this is always what Duy was hinting
> at.
> 
> All that said, as long as there is some notion of (1) and (4), I'm
> excited. :)  Details of how they are laid out in memory are less
> important.

I'm happy to hear suggestions on how this should or shouldn't work.  I'm
seeing these things in my head, but it can be helpful to have feedback
about what people expect out of the code before I spend a bunch of time
writing it.

> [...]
> > The transition plan anticipates a stage 1 where accept only SHA-1 on
> > input and produce only SHA-1 on output, but store in NewHash.  As I've
> > worked with our tests, I've realized such an implementation is not
> > entirely possible.  We have various tools that expect to accept invalid
> > object IDs, and obviously there's no way to have those continue to work.
> 
> Can you give an example?  Do you mean commands like "git mktree"?

I mean situations like git update-index.  We allow the user to insert
any old invalid value (and in fact check that the user can do this).
t0000 does this, for example.

> You can always use something like e.g. "doubled SHA-1" as a proof of
> concept, but I agree that it's nice to be able to avoid some churn by
> using an actual hash function that we're likely to switch to.

I have a hash that I've been using, but redoing the work would be less
enjoyable.  I'd rather write the tests only once if I can help it.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: State of NewHash work, future directions, and discussion
  2018-06-12  2:28   ` brian m. carlson
@ 2018-06-12  2:42     ` Jonathan Nieder
  0 siblings, 0 replies; 66+ messages in thread
From: Jonathan Nieder @ 2018-06-12  2:42 UTC (permalink / raw)
  To: brian m. carlson; +Cc: git, Duy Nguyen

brian m. carlson wrote:
> On Mon, Jun 11, 2018 at 12:01:03PM -0700, Jonathan Nieder wrote:

>>  1. Hash to be used for command output to the terminal
>>  2. Hash used in pack files
>>  3. Additional hashes (beyond (2)) that we can look up using the
>>     translation table
>>  4. Additional hashes (beyond (1)) accepted in input from the command
>>     line and stdin
>>
>> In principle, (1) and (4) would be globals, and (2) and (3) would be
>> tied to the repository.  I think this is always what Duy was hinting

Here, by 'always' I meant 'also'.  Sorry for the confusion.

>> at.
>>
>> All that said, as long as there is some notion of (1) and (4), I'm
>> excited. :)  Details of how they are laid out in memory are less
>> important.
>
> I'm happy to hear suggestions on how this should or shouldn't work.  I'm
> seeing these things in my head, but it can be helpful to have feedback
> about what people expect out of the code before I spend a bunch of time
> writing it.

So far you're doing pretty well. :)

I just noticed that I have some copy-edits for the
hash-function-transition doc from last year that I hadn't sent out yet
(oops).  I'll send them tonight or tomorrow morning.

[...]
>> brian m. carlson wrote:

>>> The transition plan anticipates a stage 1 where accept only SHA-1 on
>>> input and produce only SHA-1 on output, but store in NewHash.  As I've
>>> worked with our tests, I've realized such an implementation is not
>>> entirely possible.  We have various tools that expect to accept invalid
>>> object IDs, and obviously there's no way to have those continue to work.
>>
>> Can you give an example?  Do you mean commands like "git mktree"?
>
> I mean situations like git update-index.  We allow the user to insert
> any old invalid value (and in fact check that the user can do this).
> t0000 does this, for example.

I think we can forbid this in the new mode (using a test prereq to
ensure the relevant tests don't get run).  Likewise for the similar
functionality in "git mktree" and "git hash-object -w".

>> You can always use something like e.g. "doubled SHA-1" as a proof of
>> concept, but I agree that it's nice to be able to avoid some churn by
>> using an actual hash function that we're likely to switch to.
>
> I have a hash that I've been using, but redoing the work would be less
> enjoyable.  I'd rather write the tests only once if I can help it.

Thanks for the test fixes so far that make most of the test suite
hash-agnostic!

For t0000, yeah, there's no way around having to hard-code the new
hash there.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-06-11 22:35     ` brian m. carlson
@ 2018-06-12 16:21       ` Gilles Van Assche
  2018-06-13 23:58         ` brian m. carlson
  0 siblings, 1 reply; 66+ messages in thread
From: Gilles Van Assche @ 2018-06-12 16:21 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Jonathan Nieder, git, Johannes Schindelin, demerphq,
	Linus Torvalds, Adam Langley, Keccak Team

Hi,

On 10/06/18 00:49, brian m. carlson wrote:
> I imported the optimized 64-bit implementation of KangarooTwelve. The
> AVX2 implementation was not considered for licensing reasons (it's
> partially generated from external code, which falls foul of the GPL's
> "preferred form for modifications" rule).

Indeed part of the AVX2 code in the Keccak code package is an extension
of the implementation in OpenSSL (written by Andy Polyakov). The
assembly code is generated by a Perl script, and we extended it to fit
in the KCP's internal API.

Would it solve this licensing problem if we remap our extensions to the
Perl script, which would then become "the source"?

On 12/06/18 00:35, brian m. carlson wrote:
>> My understanding is that all the algorithms we're discussing are
>> believed to be approximately equivalent in security. That's a strange
>> thing to say when e.g. K12 uses fewer rounds than SHA3 of the same
>> permutation, but it is my understanding nonetheless. We don't know
>> yet how these hash algorithms will ultimately break. 
>
> With the exception of variations in preimage security, I expect that's
> correct. I think implementation availability and performance are the
> best candidates for consideration.

Note that we recently updated the paper on K12 (accepted at ACNS 2018),
with more details on performance and security.
https://eprint.iacr.org/2016/770

>> My understanding of the discussion so far:
>>
>> Keccak team encourages us[1] to consider a variant like K12 instead
>> of SHA3. 
>
> While I think K12 is an interesting algorithm, I'm not sure we're
> going to get as good of performance out of it as we might want due to
> the lack of implementations.

Implementation availability is indeed important. The effort to transform
an implementation of SHAKE128 into one of K12 is limited due to the
reuse of their main components (round function, sponge construction). So
the availability of SHA-3/Keccak implementations can benefit that of K12
if there is sufficient interest. E.g., the SHA-3/Keccak instructions in
ARMv8.2 can speed up K12 as well.

Kind regards,
Gilles

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-06-12 16:21       ` Gilles Van Assche
@ 2018-06-13 23:58         ` brian m. carlson
  2018-06-15 10:33           ` Gilles Van Assche
  0 siblings, 1 reply; 66+ messages in thread
From: brian m. carlson @ 2018-06-13 23:58 UTC (permalink / raw)
  To: Gilles Van Assche
  Cc: Jonathan Nieder, git, Johannes Schindelin, demerphq,
	Linus Torvalds, Adam Langley, Keccak Team

[-- Attachment #1: Type: text/plain, Size: 2187 bytes --]

On Tue, Jun 12, 2018 at 06:21:21PM +0200, Gilles Van Assche wrote:
> Hi,
> 
> On 10/06/18 00:49, brian m. carlson wrote:
> > I imported the optimized 64-bit implementation of KangarooTwelve. The
> > AVX2 implementation was not considered for licensing reasons (it's
> > partially generated from external code, which falls foul of the GPL's
> > "preferred form for modifications" rule).
> 
> Indeed part of the AVX2 code in the Keccak code package is an extension
> of the implementation in OpenSSL (written by Andy Polyakov). The
> assembly code is generated by a Perl script, and we extended it to fit
> in the KCP's internal API.
> 
> Would it solve this licensing problem if we remap our extensions to the
> Perl script, which would then become "the source"?

The GPLv2 requires "the preferred form of the work for making
modifications to it".  If that form is the Perl script, then yes, that
would be sufficient.  If your code is dissimilar enough that editing it
directly is better than editing the Perl script, then it might already
meet the definition.

I don't do assembly programming, so I don't know what forms one
generally wants for editing assembly.  Apparently OpenSSL wants a Perl
script, but that is, I understand, less common.  What would you use if
you were going to improve it?

> On 12/06/18 00:35, brian m. carlson wrote:
> > While I think K12 is an interesting algorithm, I'm not sure we're
> > going to get as good of performance out of it as we might want due to
> > the lack of implementations.
> 
> Implementation availability is indeed important. The effort to transform
> an implementation of SHAKE128 into one of K12 is limited due to the
> reuse of their main components (round function, sponge construction). So
> the availability of SHA-3/Keccak implementations can benefit that of K12
> if there is sufficient interest. E.g., the SHA-3/Keccak instructions in
> ARMv8.2 can speed up K12 as well.

That's good to know.  I wasn't aware that ARM was providing Keccak
instructions, but it's good to see that new chips are providing them.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-06-13 23:58         ` brian m. carlson
@ 2018-06-15 10:33           ` Gilles Van Assche
  0 siblings, 0 replies; 66+ messages in thread
From: Gilles Van Assche @ 2018-06-15 10:33 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Jonathan Nieder, git, Johannes Schindelin, demerphq,
	Linus Torvalds, Adam Langley, Keccak Team

On 14/06/18 01:58, brian m. carlson wrote:
>>> I imported the optimized 64-bit implementation of KangarooTwelve.
>>> The AVX2 implementation was not considered for licensing reasons
>>> (it's partially generated from external code, which falls foul of
>>> the GPL's "preferred form for modifications" rule). 
>>
>> Indeed part of the AVX2 code in the Keccak code package is an
>> extension of the implementation in OpenSSL (written by Andy
>> Polyakov). The assembly code is generated by a Perl script, and we
>> extended it to fit in the KCP's internal API.
>>
>> Would it solve this licensing problem if we remap our extensions to
>> the Perl script, which would then become "the source"? 
>
> The GPLv2 requires "the preferred form of the work for making
> modifications to it". If that form is the Perl script, then yes, that
> would be sufficient. If your code is dissimilar enough that editing it
> directly is better than editing the Perl script, then it might already
> meet the definition.
>
> I don't do assembly programming, so I don't know what forms one
> generally wants for editing assembly. Apparently OpenSSL wants a Perl
> script, but that is, I understand, less common. What would you use if
> you were going to improve it?

The Perl script would be more flexible in case one needs to improve the
implementation. It allows one to use meaningful symbolic names for the
registers. My colleague Ronny, who did the extension, worked directly
with physical register names and considered the output of the Perl
script as his source. But this extension could probably be done also at
the level of the Perl script.

Kind regards,
Gilles


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-06-11 21:19   ` Hash algorithm analysis Ævar Arnfjörð Bjarmason
@ 2018-06-21  8:20     ` Johannes Schindelin
  2018-06-21 22:39     ` brian m. carlson
  1 sibling, 0 replies; 66+ messages in thread
From: Johannes Schindelin @ 2018-06-21  8:20 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: brian m. carlson, git, Adam Langley, Jeff King, Mike Hommey,
	Brandon Williams, Linus Torvalds, Jonathan Nieder, Stefan Beller,
	Jonathan Tan, Junio Hamano

[-- Attachment #1: Type: text/plain, Size: 10948 bytes --]

Hi Ævar,

On Mon, 11 Jun 2018, Ævar Arnfjörð Bjarmason wrote:

> On Sat, Jun 09 2018, brian m. carlson wrote:
> 
> [Expanding the CC list to what we had in the last "what hash" thread[1]
> last year].
> 
> > == Discussion of Candidates
> >
> > I've implemented and tested the following algorithms, all of which are
> > 256-bit (in alphabetical order):
> >
> > * BLAKE2b (libb2)
> > * BLAKE2bp (libb2)
> > * KangarooTwelve (imported from the Keccak Code Package)
> > * SHA-256 (OpenSSL)
> > * SHA-512/256 (OpenSSL)
> > * SHA3-256 (OpenSSL)
> > * SHAKE128 (OpenSSL)
> >
> > I also rejected some other candidates.  I couldn't find any reference or
> > implementation of SHA256×16, so I didn't implement it.  I didn't
> > consider SHAKE256 because it is nearly identical to SHA3-256 in almost
> > all characteristics (including performance).
> >
> > I imported the optimized 64-bit implementation of KangarooTwelve.  The
> > AVX2 implementation was not considered for licensing reasons (it's
> > partially generated from external code, which falls foul of the GPL's
> > "preferred form for modifications" rule).
> >
> > === BLAKE2b and BLAKE2bp
> >
> > These are the non-parallelized and parallelized 64-bit variants of
> > BLAKE2.
> >
> > Benefits:
> > * Both algorithms provide 256-bit preimage resistance.
> >
> > Downsides:
> > * Some people are uncomfortable that the security margin has been
> >   decreased from the original SHA-3 submission, although it is still
> >   considered secure.
> > * BLAKE2bp, as implemented in libb2, uses OpenMP (and therefore
> >   multithreading) by default.  It was no longer possible to run the
> >   testsuite with -j3 on my laptop in this configuration.
> >
> > === Keccak-based Algorithms
> >
> > SHA3-256 is the 256-bit Keccak algorithm with 24 rounds, processing 136
> > bytes at a time.  SHAKE128 is an extendable output function with 24
> > rounds, processing 168 bytes at a time.  KangarooTwelve is an extendable
> > output function with 12 rounds, processing 136 bytes at a time.
> >
> > Benefits:
> > * SHA3-256 provides 256-bit preimage resistance.
> > * SHA3-256 has been heavily studied and is believed to have a large
> >   security margin.
> >
> > I noted the following downsides:
> > * There's a lack of a availability of KangarooTwelve in other
> >   implementations.  It may be the least available option in terms of
> >   implementations.
> > * Some people are uncomfortable that the security margin of
> >   KangarooTwelve has been decreased, although it is still considered
> >   secure.
> > * SHAKE128 and KangarooTwelve provide only 128-bit preimage resistance.
> >
> > === SHA-256 and SHA-512/256
> >
> > These are the 32-bit and 64-bit SHA-2 algorithms that are 256 bits in
> > size.
> >
> > I noted the following benefits:
> > * Both algorithms are well known and heavily analyzed.
> > * Both algorithms provide 256-bit preimage resistance.
> >
> > == Implementation Support
> >
> > |===
> > | Implementation | OpenSSL | libb2 | NSS | ACC | gcrypt | Nettle| CL  |
> > | SHA-1          | 🗸       |       | 🗸   | 🗸   | 🗸      | 🗸     | {1} |
> > | BLAKE2b        | f       | 🗸     |     |     | 🗸      |       | {2} |
> > | BLAKE2bp       |         | 🗸     |     |     |        |       |     |
> > | KangarooTwelve |         |       |     |     |        |       |     |
> > | SHA-256        | 🗸       |       | 🗸   |  🗸  | 🗸      | 🗸     | {1} |
> > | SHA-512/256    | 🗸       |       |     |     |        | 🗸     | {3} |
> > | SHA3-256       | 🗸       |       |     |     | 🗸      | 🗸     | {4} |
> > | SHAKE128       | 🗸       |       |     |     | 🗸      |       | {5} |
> > |===
> >
> > f: future version (expected 1.2.0)
> > ACC: Apple Common Crypto
> > CL: Command-line
> >
> > :1: OpenSSL, coreutils, Perl Digest::SHA.
> > :2: OpenSSL, coreutils.
> > :3: OpenSSL
> > :4: OpenSSL, Perl Digest::SHA3.
> > :5: Perl Digest::SHA3.
> >
> > === Performance Analysis
> >
> > The test system used below is my personal laptop, a 2016 Lenovo ThinkPad
> > X1 Carbon with an Intel i7-6600U CPU (2.60 GHz) running Debian unstable.
> >
> > I implemented a test tool helper to compute speed much like OpenSSL
> > does.  Below is a comparison of speeds.  The columns indicate the speed
> > in KiB/s for chunks of the given size.  The runs are representative of
> > multiple similar runs.
> >
> > 256 and 1024 bytes were chosen to represent common tree and commit
> > object sizes and the 8 KiB is an approximate average blob size.
> >
> > Algorithms are sorted by performance on the 1 KiB column.
> >
> > |===
> > | Implementation             | 256 B  | 1 KiB  | 8 KiB  | 16 KiB |
> > | SHA-1 (OpenSSL)            | 513963 | 685966 | 748993 | 754270 |
> > | BLAKE2b (libb2)            | 488123 | 552839 | 576246 | 579292 |
> > | SHA-512/256 (OpenSSL)      | 181177 | 349002 | 499113 | 495169 |
> > | BLAKE2bp (libb2)           | 139891 | 344786 | 488390 | 522575 |
> > | SHA-256 (OpenSSL)          | 264276 | 333560 | 357830 | 355761 |
> > | KangarooTwelve             | 239305 | 307300 | 355257 | 364261 |
> > | SHAKE128 (OpenSSL)         | 154775 | 253344 | 337811 | 346732 |
> > | SHA3-256 (OpenSSL)         | 128597 | 185381 | 198931 | 207365 |
> > | BLAKE2bp (libb2; threaded) |  12223 |  49306 | 132833 | 179616 |
> > |===
> >
> > SUPERCOP (a crypto benchmarking tool;
> > https://bench.cr.yp.to/results-hash.html) has also benchmarked these
> > algorithms.  Note that BLAKE2bp is not listed, KangarooTwelve is k12,
> > SHA-512/256 is equivalent to sha512, SHA3-256 is keccakc512, and SHAKE128 is
> > keccakc256.
> >
> > Information is for kizomba, a Kaby Lake system.  Counts are in cycles
> > per byte (smaller is better; sorted by 1536 B column):
> >
> > |===
> > | Algorithm      | 576 B | 1536 B | 4096 B | long |
> > | BLAKE2b        |  3.51 |   3.10 |   3.08 | 3.07 |
> > | SHA-1          |  4.36 |   3.81 |   3.59 | 3.49 |
> > | KangarooTwelve |  4.99 |   4.57 |   4.13 | 3.86 |
> > | SHA-512/256    |  6.39 |   5.76 |   5.31 | 5.05 |
> > | SHAKE128       |  8.23 |   7.67 |   7.17 | 6.97 |
> > | SHA-256        |  8.90 |   8.08 |   7.77 | 7.59 |
> > | SHA3-256       | 10.26 |   9.15 |   8.84 | 8.57 |
> > |===
> >
> > Numbers for genji262, an AMD Ryzen System, which has SHA acceleration:
> >
> > |===
> > | Algorithm      | 576 B | 1536 B | 4096 B | long |
> > | SHA-1          |  1.87 |   1.69 |   1.60 | 1.54 |
> > | SHA-256        |  1.95 |   1.72 |   1.68 | 1.64 |
> > | BLAKE2b        |  2.94 |   2.59 |   2.59 | 2.59 |
> > | KangarooTwelve |  4.09 |   3.65 |   3.35 | 3.17 |
> > | SHA-512/256    |  5.54 |   5.08 |   4.71 | 4.48 |
> > | SHAKE128       |  6.95 |   6.23 |   5.71 | 5.49 |
> > | SHA3-256       |  8.29 |   7.35 |   7.04 | 6.81 |
> > |===
> >
> > Note that no mid- to high-end Intel processors provide acceleration.
> > AMD Ryzen and some ARM64 processors do.
> >
> > == Summary
> >
> > The algorithms with the greatest implementation availability are
> > SHA-256, SHA3-256, BLAKE2b, and SHAKE128.
> >
> > In terms of command-line availability, BLAKE2b, SHA-256, SHA-512/256,
> > and SHA3-256 should be available in the near future on a reasonably
> > small Debian, Ubuntu, or Fedora install.
> >
> > As far as security, the most conservative choices appear to be SHA-256,
> > SHA-512/256, and SHA3-256.
> >
> > The performance winners are BLAKE2b unaccelerated and SHA-256 accelerated.
> 
> This is a great summary. Thanks.
> 
> In case it's not apparent from what follows, I have a bias towards
> SHA-256. Reasons for that, to summarize some of the discussion the last
> time around[1], and to add more details:
> 
> == Popularity
> 
> Other things being equal we should be biased towards whatever's in the
> widest use & recommended fon new projects.
> 
> I fear that if e.g. git had used whatever at time was to SHA-1 as
> BLAKE2b is to SHA-256 now, we might not even know that it's broken (or
> had the sha1collisiondetection work to fall back on), since researchers
> are less likely to look at algorithms that aren't in wide use.
> 
> SHA-256 et al were published in 2001 and has ~20k results on Google
> Scholar, compared to ~150 for BLAKE2b[4], published in 2008 (but ~1.2K
> for "BLAKE2").
> 
> Between the websites of Intel, AMD & ARM there are thousands of results
> for SHA-256 (and existing in-silicon acceleration). There's exactly one
> result on all three for BLAKE2b (on amd.com, in the context of a laundry
> list of hash algorithms in some presentation.
> 
> Since BLAKE2b lost the SHA-3 competition to Keccak it seems impossible
> that it'll get ever get anywhere close to the same scrutiny or support
> in silicon as one of the SHA families.
> 
> Which brings me to the next section...
> 
> == Hardware acceleration
> 
> The only widely deployed HW acceleration is for the SHA-1 and SHA-256
> from the SHA-2 family[5], but notably nothing from the newer SHA-3
> family (released in 2015).
> 
> It seems implausible that anything except SHA-3 will get future HW
> acceleration given the narrow scope of current HW acceleration
> v.s. existing hash algorithms.
> 
> As noted in the thread from last year[1] most git users won't even
> notice if the hashing is faster, but it does matter for some big users
> (big monorepos), so having the out of purchasing hardware to make things
> faster today is great, and given how these new instruction set
> extensions get rolled out it seems inevitable that this'll be available
> in all consumer CPUs within 5-10 years.
> 
> == Age
> 
> Similar to "popularity" it seems better to bias things towards a hash
> that's been out there for a while, i.e. it would be too early to pick
> SHA-3.
> 
> The hash transitioning plan, once implemented, also makes it easier to
> switch to something else in the future, so we shouldn't be in a rush to
> pick some newer hash because we'll need to keep it forever, we can
> always do another transition in another 10-15 years.
> 
> == Conclusion
> 
> For all the above reasons I think we should pick SHA-256.
> 
> 1. https://public-inbox.org/git/87y3ss8n4h.fsf@gmail.com/#t
> 2. https://github.com/cr-marcstevens/sha1collisiondetection
> 3. https://scholar.google.nl/scholar?hl=en&as_sdt=0%2C5&q=SHA-256&btnG=
> 4. https://scholar.google.nl/scholar?hl=en&as_sdt=0%2C5&q=BLAKE2b&btnG=
> 5. https://en.wikipedia.org/wiki/Intel_SHA_extensions

I agree with that reasoning.

More importantly, my cryptography researcher colleagues agree with this
assessment, and I do trust them quite a bit (you know one of them very
well already as we, ahem, *might* be using his code for SHA-1 collision
detection all the time now *cough, cough*).

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-06-11 21:19   ` Hash algorithm analysis Ævar Arnfjörð Bjarmason
  2018-06-21  8:20     ` Johannes Schindelin
@ 2018-06-21 22:39     ` brian m. carlson
  1 sibling, 0 replies; 66+ messages in thread
From: brian m. carlson @ 2018-06-21 22:39 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Adam Langley, Jeff King, Mike Hommey, Brandon Williams,
	Linus Torvalds, Jonathan Nieder, Stefan Beller, Jonathan Tan,
	Junio Hamano, Johannes Schindelin

[-- Attachment #1: Type: text/plain, Size: 2700 bytes --]

On Mon, Jun 11, 2018 at 11:19:10PM +0200, Ævar Arnfjörð Bjarmason wrote:
> This is a great summary. Thanks.
> 
> In case it's not apparent from what follows, I have a bias towards
> SHA-256. Reasons for that, to summarize some of the discussion the last
> time around[1], and to add more details:

To summarize my view, I think my ordered preference of hashes is
BLAKE2b, SHA-256, and SHA3-256.

I agree with AGL that all three of these options are secure and will be
for some time.  I believe there's sufficient literature on all three of
them and there will continue to be for some time.  I've seen and read
papers from the IACR archves on all three of them, and because all three
are widely used, they'll continue to be interesting to cryptologists for
a long time to come.

I'm personally partial to having full preimage resistance, which I think
makes SHAKE128 less appealing.  SHAKE128 also has fewer crypto library
implementations than the others.

My rationale for this ordering is essentially performance.  BLAKE2b is
quite fast on all known hardware, and it is almost as fast as an
accelerated SHA-256.  The entire rationale for BLAKE2b is to give people
a secure algorithm that is faster than MD5 and SHA-1, so there's no
reason to use an insecure algorithm.  It also greatly outperforms the
other two even in pure C, which matters for the fallback implementation
we'll need to ship.

I tend to think SHA3-256 is the most conservative of these choices as
far as security.  It has had an open development process and has a large
security margin.  It has gained a lot of cryptanalysis and come out
quite well, and the versatility of the Keccak sponge construction means
that it's going to get a lot more attention.  Pretty much the only
downside is its performance relative to the other two.

I placed SHA-256 in the middle because of its potential for acceleration
on Intel hardware.  I know such changes are coming, but they won't
likely be here for another two years.

While hashing performance isn't a huge concern for Git now, I had
planned to implement an rsync-based delta algorithm for large files that
could make storing some large files in a Git repository viable (of
course, there will still be many cases where Git LFS and git-annex are
better).  The algorithm is extremely sensitive to hash performance and
would simply not be viable with an unaccelerated SHA-256 or SHA3-256,
although it would perform reasonably well with BLAKE2b.

Having said that, I'd be happy with any of the three, and would support
a consensus around any of them as well.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-06-11 19:29   ` Jonathan Nieder
  2018-06-11 20:20     ` Linus Torvalds
  2018-06-11 22:35     ` brian m. carlson
@ 2018-07-20 21:52     ` brian m. carlson
  2018-07-21  0:31       ` Jonathan Nieder
                         ` (3 more replies)
  2 siblings, 4 replies; 66+ messages in thread
From: brian m. carlson @ 2018-07-20 21:52 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: git, Johannes Schindelin, demerphq, Linus Torvalds, Adam Langley,
	The Keccak Team

[-- Attachment #1: Type: text/plain, Size: 1334 bytes --]

On Mon, Jun 11, 2018 at 12:29:42PM -0700, Jonathan Nieder wrote:
> My understanding of the discussion so far:
> 
> Keccak team encourages us[1] to consider a variant like K12 instead of
> SHA3.
> 
> AGL explains[2] that the algorithms considered all seem like
> reasonable choices and we should decide using factors like
> implementation ease and performance.
> 
> If we choose a Keccak-based function, AGL also[3] encourages using a
> variant like K12 instead of SHA3.
> 
> Dscho strongly prefers[4] SHA-256, because of
> - wide implementation availability, including in future hardware
> - has been widely analyzed
> - is fast
> 
> Yves Orton and Linus Torvalds prefer[5] SHA3 over SHA2 because of how
> it is constructed.

I know this discussion has sort of petered out, but I'd like to see if
we can revive it.  I'm writing index v3 and having a decision would help
me write tests for it.

To summarize the discussion that's been had in addition to the above,
Ævar has also stated a preference for SHA-256 and I would prefer BLAKE2b
over SHA-256 over SHA3-256, although any of them would be fine.

Are there other contributors who have a strong opinion?  Are there
things I can do to help us coalesce around an option?
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-20 21:52     ` brian m. carlson
@ 2018-07-21  0:31       ` Jonathan Nieder
  2018-07-21 19:52       ` Ævar Arnfjörð Bjarmason
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 66+ messages in thread
From: Jonathan Nieder @ 2018-07-21  0:31 UTC (permalink / raw)
  To: brian m. carlson
  Cc: git, Johannes Schindelin, demerphq, Linus Torvalds, Adam Langley,
	The Keccak Team

Hi,

brian m. carlson wrote:

> I know this discussion has sort of petered out, but I'd like to see if
> we can revive it.  I'm writing index v3 and having a decision would help
> me write tests for it.

Nice!  That's awesome.

> To summarize the discussion that's been had in addition to the above,
> Ævar has also stated a preference for SHA-256 and I would prefer BLAKE2b
> over SHA-256 over SHA3-256, although any of them would be fine.
>
> Are there other contributors who have a strong opinion?  Are there
> things I can do to help us coalesce around an option?

My advice would be to go with BLAKE2b.  If anyone objects, we can deal
with their objection at that point (and I volunteer to help with
cleaning up any mess in rewriting test cases that this advice causes).

Full disclosure: my preference order (speaking for myself and no one
else) is

 K12 > BLAKE2bp-256 > SHA-256x16 > BLAKE2b > SHA-256 > SHA-512/256 > SHA3-256

so I'm not just asking you to go with my favorite. ;-)

Jonathan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-20 21:52     ` brian m. carlson
  2018-07-21  0:31       ` Jonathan Nieder
@ 2018-07-21 19:52       ` Ævar Arnfjörð Bjarmason
  2018-07-21 20:25         ` brian m. carlson
  2018-07-21 22:38       ` Johannes Schindelin
  2018-07-24 19:01       ` Edward Thomson
  3 siblings, 1 reply; 66+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-07-21 19:52 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Jonathan Nieder, git, Johannes Schindelin, demerphq,
	Linus Torvalds, Adam Langley, The Keccak Team


On Fri, Jul 20 2018, brian m. carlson wrote:

> On Mon, Jun 11, 2018 at 12:29:42PM -0700, Jonathan Nieder wrote:
>> My understanding of the discussion so far:
>>
>> Keccak team encourages us[1] to consider a variant like K12 instead of
>> SHA3.
>>
>> AGL explains[2] that the algorithms considered all seem like
>> reasonable choices and we should decide using factors like
>> implementation ease and performance.
>>
>> If we choose a Keccak-based function, AGL also[3] encourages using a
>> variant like K12 instead of SHA3.
>>
>> Dscho strongly prefers[4] SHA-256, because of
>> - wide implementation availability, including in future hardware
>> - has been widely analyzed
>> - is fast
>>
>> Yves Orton and Linus Torvalds prefer[5] SHA3 over SHA2 because of how
>> it is constructed.
>
> I know this discussion has sort of petered out, but I'd like to see if
> we can revive it.  I'm writing index v3 and having a decision would help
> me write tests for it.
>
> To summarize the discussion that's been had in addition to the above,
> Ævar has also stated a preference for SHA-256 and I would prefer BLAKE2b
> over SHA-256 over SHA3-256, although any of them would be fine.
>
> Are there other contributors who have a strong opinion?  Are there
> things I can do to help us coalesce around an option?

I have a vague recollection of suggesting something similar in the past,
but can't find that E-Mail (and maybe it never happened), but for
testing purposes isn't in simplest if we just have some "test SHA-1"
algorithm where we pretent that all inputs like "STRING" are really
"PREFIX-STRING" for the purposes of hashing, or fake shortening /
lengthening the hash to test arbitrary lenghts of N (just by repeating
the hash from the beginning is probably good enough...).

That would make such patches easier to review, since we wouldn't need to
carry hundreds/thousands of lines of dense hashing code, but a more
trivial wrapper around SHA-1, and we could have some test mode where we
could compile & run tests with an arbitrary hash length to make sure
everything's future proof even after we move to NewHash.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-21 19:52       ` Ævar Arnfjörð Bjarmason
@ 2018-07-21 20:25         ` brian m. carlson
  0 siblings, 0 replies; 66+ messages in thread
From: brian m. carlson @ 2018-07-21 20:25 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jonathan Nieder, git, Johannes Schindelin, demerphq,
	Linus Torvalds, Adam Langley, The Keccak Team

[-- Attachment #1: Type: text/plain, Size: 2925 bytes --]

On Sat, Jul 21, 2018 at 09:52:05PM +0200, Ævar Arnfjörð Bjarmason wrote:
> 
> On Fri, Jul 20 2018, brian m. carlson wrote:
> > I know this discussion has sort of petered out, but I'd like to see if
> > we can revive it.  I'm writing index v3 and having a decision would help
> > me write tests for it.
> >
> > To summarize the discussion that's been had in addition to the above,
> > Ævar has also stated a preference for SHA-256 and I would prefer BLAKE2b
> > over SHA-256 over SHA3-256, although any of them would be fine.
> >
> > Are there other contributors who have a strong opinion?  Are there
> > things I can do to help us coalesce around an option?
> 
> I have a vague recollection of suggesting something similar in the past,
> but can't find that E-Mail (and maybe it never happened), but for
> testing purposes isn't in simplest if we just have some "test SHA-1"
> algorithm where we pretent that all inputs like "STRING" are really
> "PREFIX-STRING" for the purposes of hashing, or fake shortening /
> lengthening the hash to test arbitrary lenghts of N (just by repeating
> the hash from the beginning is probably good enough...).
> 
> That would make such patches easier to review, since we wouldn't need to
> carry hundreds/thousands of lines of dense hashing code, but a more
> trivial wrapper around SHA-1, and we could have some test mode where we
> could compile & run tests with an arbitrary hash length to make sure
> everything's future proof even after we move to NewHash.

I think Stefan suggested this approach.  It is viable for testing some
aspects of the code, but not others.  It doesn't work for synthesizing
partial collisions or the bisect tests (since bisect falls back to
object ID as a disambiguator).

I had tried this approach (using a single zero-byte as a prefix), but
for whatever reason, it ended up producing inconsistent results when I
hashed.  I'm unclear what went wrong in that approach, but I finally
discarded it after spending an hour or two staring at it.  I'm not
opposed to someone else providing it as an option, though.

Also, after feedback from Eric Sunshine, I decided to adopt an approach
for my hash-independent tests series that used the name of the hash
within the tests so that we could support additional algorithms (such as
a pseudo-SHA-1).  That work necessarily involves having a name for the
hash, which is why I haven't revisited it.

As for arbitrary hash sizes, there is some code which necessarily needs
to depend on a fixed hash size.  A lot of our Perl code matches
[0-9a-f]{40}, which needs to change.  There's no reason we couldn't
adopt such testing in the future, but it might end up being more
complicated than we want.  I have strived to reduce the dependence on
fixed-size constants wherever possible, though.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-20 21:52     ` brian m. carlson
  2018-07-21  0:31       ` Jonathan Nieder
  2018-07-21 19:52       ` Ævar Arnfjörð Bjarmason
@ 2018-07-21 22:38       ` Johannes Schindelin
  2018-07-21 23:09         ` Linus Torvalds
  2018-07-21 23:59         ` brian m. carlson
  2018-07-24 19:01       ` Edward Thomson
  3 siblings, 2 replies; 66+ messages in thread
From: Johannes Schindelin @ 2018-07-21 22:38 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Jonathan Nieder, git, demerphq, Linus Torvalds, Adam Langley,
	The Keccak Team

[-- Attachment #1: Type: text/plain, Size: 1847 bytes --]

Hi Brian,

On Fri, 20 Jul 2018, brian m. carlson wrote:

> On Mon, Jun 11, 2018 at 12:29:42PM -0700, Jonathan Nieder wrote:
> > My understanding of the discussion so far:
> > 
> > Keccak team encourages us[1] to consider a variant like K12 instead of
> > SHA3.
> > 
> > AGL explains[2] that the algorithms considered all seem like
> > reasonable choices and we should decide using factors like
> > implementation ease and performance.
> > 
> > If we choose a Keccak-based function, AGL also[3] encourages using a
> > variant like K12 instead of SHA3.
> > 
> > Dscho strongly prefers[4] SHA-256, because of
> > - wide implementation availability, including in future hardware
> > - has been widely analyzed
> > - is fast
> > 
> > Yves Orton and Linus Torvalds prefer[5] SHA3 over SHA2 because of how
> > it is constructed.
> 
> I know this discussion has sort of petered out, but I'd like to see if
> we can revive it.  I'm writing index v3 and having a decision would help
> me write tests for it.
> 
> To summarize the discussion that's been had in addition to the above,
> Ævar has also stated a preference for SHA-256 and I would prefer BLAKE2b
> over SHA-256 over SHA3-256, although any of them would be fine.
> 
> Are there other contributors who have a strong opinion?  Are there
> things I can do to help us coalesce around an option?

Do you really want to value contributors' opinion more than
cryptographers'? I mean, that's exactly what got us into this hard-coded
SHA-1 mess in the first place.

And to set the record straight: I do not have a strong preference of the
hash algorithm. But cryprographers I have the incredible luck to have
access to, by virtue of being a colleague, did mention their preference.

I see no good reason to just blow their advice into the wind.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-21 22:38       ` Johannes Schindelin
@ 2018-07-21 23:09         ` Linus Torvalds
  2018-07-21 23:59         ` brian m. carlson
  1 sibling, 0 replies; 66+ messages in thread
From: Linus Torvalds @ 2018-07-21 23:09 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: brian m. carlson, Jonathan Nieder, Git Mailing List, demerphq,
	Adam Langley, keccak

On Sat, Jul 21, 2018 at 3:39 PM Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> Do you really want to value contributors' opinion more than
> cryptographers'? I mean, that's exactly what got us into this hard-coded
> SHA-1 mess in the first place.

Don't be silly.

Other real cryptographers consider SHA256 to be a problem.

Really. It's amenable to the same hack on the internal hash that made
for the SHAttered break.

So your argument that "cryptographers prefer SHA256" is simply not true.

Your real argument is that know at least one cryptographer that you
work with that prefers it.

Don't try to make that into some generic "cryptographers prefer it".
It's not like cryptographers have issues with blake2b either, afaik.

And blake2b really _does_ have very real advantages.

If you can actually point to some "a large percentage of
cryptographers prefer it", you'd have a point.

But as it is, you don't have data, you have an anecdote, and you try
to use that anecdote to put down other peoples opinions.

Intellectually dishonest, that is.

           Linus

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-21 22:38       ` Johannes Schindelin
  2018-07-21 23:09         ` Linus Torvalds
@ 2018-07-21 23:59         ` brian m. carlson
  2018-07-22  9:34           ` Eric Deplagne
                             ` (2 more replies)
  1 sibling, 3 replies; 66+ messages in thread
From: brian m. carlson @ 2018-07-21 23:59 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Jonathan Nieder, git, demerphq, Linus Torvalds, Adam Langley,
	The Keccak Team

[-- Attachment #1: Type: text/plain, Size: 3089 bytes --]

On Sun, Jul 22, 2018 at 12:38:41AM +0200, Johannes Schindelin wrote:
> Do you really want to value contributors' opinion more than
> cryptographers'? I mean, that's exactly what got us into this hard-coded
> SHA-1 mess in the first place.

I agree (believe me, of all people, I agree) that hard-coding SHA-1 was
a bad choice in retrospect.  But I've solicited contributors' opinions
because the Git Project needs to make a decision *for this project*
about the algorithm we're going to use going forward.

> And to set the record straight: I do not have a strong preference of the
> hash algorithm. But cryprographers I have the incredible luck to have
> access to, by virtue of being a colleague, did mention their preference.

I don't know your colleagues, and they haven't commented here.  One
person that has commented here is Adam Langley.  It is my impression
(and anyone is free to correct me if I'm incorrect) that he is indeed a
cryptographer.  To quote him[0]:

  I think this group can safely assume that SHA-256, SHA-512, BLAKE2,
  K12, etc are all secure to the extent that I don't believe that making
  comparisons between them on that axis is meaningful. Thus I think the
  question is primarily concerned with performance and implementation
  availability.

  […]

  So, overall, none of these choices should obviously be excluded. The
  considerations at this point are not cryptographic and the tradeoff
  between implementation ease and performance is one that the git
  community would have to make.

I'm aware that cryptographers tend to prefer algorithms that have been
studied longer over ones that have been studied less.  They also prefer
algorithms built in the open to ones developed behind closed doors.

SHA-256 has the benefit that it has been studied for a long time, but it
was also designed in secret by the NSA.  SHA3-256 was created with
significant study in the open, but is not as mature.  BLAKE2b has been
incorporated into standards like Argon2, but has been weakened slightly
for performance.

I'm not sure that there's a really obvious choice here.

I'm at the point where to continue the work that I'm doing, I need to
make a decision.  I'm happy to follow the consensus if there is one, but
it does not appear that there is.

I will admit that I don't love making this decision by myself, because
right now, whatever I pick, somebody is going to be unhappy.  I want to
state, unambiguously, that I'm trying to make a decision that is in the
interests of the Git Project, the community, and our users.

I'm happy to wait a few more days to see if a consensus develops; if so,
I'll follow it.  If we haven't come to one by, say, Wednesday, I'll make
a decision and write my patches accordingly.  The community is free, as
always, to reject my patches if taking them is not in the interest of
the project.

[0] https://public-inbox.org/git/CAL9PXLzhPyE+geUdcLmd=pidT5P8eFEBbSgX_dS88knz2q_LSw@mail.gmail.com/
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-21 23:59         ` brian m. carlson
@ 2018-07-22  9:34           ` Eric Deplagne
  2018-07-22 14:21             ` brian m. carlson
  2018-07-22 15:23           ` Joan Daemen
  2018-07-23 12:40           ` demerphq
  2 siblings, 1 reply; 66+ messages in thread
From: Eric Deplagne @ 2018-07-22  9:34 UTC (permalink / raw)
  To: brian m. carlson, Johannes Schindelin, Jonathan Nieder, git,
	demerphq, Linus Torvalds, Adam Langley, The Keccak Team

[-- Attachment #1: Type: text/plain, Size: 3516 bytes --]

On Sat, 21 Jul 2018 23:59:41 +0000, brian m. carlson wrote:
> On Sun, Jul 22, 2018 at 12:38:41AM +0200, Johannes Schindelin wrote:
> > Do you really want to value contributors' opinion more than
> > cryptographers'? I mean, that's exactly what got us into this hard-coded
> > SHA-1 mess in the first place.
> 
> I agree (believe me, of all people, I agree) that hard-coding SHA-1 was
> a bad choice in retrospect.  But I've solicited contributors' opinions
> because the Git Project needs to make a decision *for this project*
> about the algorithm we're going to use going forward.
> 
> > And to set the record straight: I do not have a strong preference of the
> > hash algorithm. But cryprographers I have the incredible luck to have
> > access to, by virtue of being a colleague, did mention their preference.
> 
> I don't know your colleagues, and they haven't commented here.  One
> person that has commented here is Adam Langley.  It is my impression
> (and anyone is free to correct me if I'm incorrect) that he is indeed a
> cryptographer.  To quote him[0]:
> 
>   I think this group can safely assume that SHA-256, SHA-512, BLAKE2,
>   K12, etc are all secure to the extent that I don't believe that making
>   comparisons between them on that axis is meaningful. Thus I think the
>   question is primarily concerned with performance and implementation
>   availability.
> 
>   […]
> 
>   So, overall, none of these choices should obviously be excluded. The
>   considerations at this point are not cryptographic and the tradeoff
>   between implementation ease and performance is one that the git
>   community would have to make.

  Am I completely out of the game, or the statement that
    "the considerations at this point are not cryptographic"
  is just the wrongest ?

  I mean, if that was true, would we not be sticking to SHA1 ?

> I'm aware that cryptographers tend to prefer algorithms that have been
> studied longer over ones that have been studied less.  They also prefer
> algorithms built in the open to ones developed behind closed doors.
> 
> SHA-256 has the benefit that it has been studied for a long time, but it
> was also designed in secret by the NSA.  SHA3-256 was created with
> significant study in the open, but is not as mature.  BLAKE2b has been
> incorporated into standards like Argon2, but has been weakened slightly
> for performance.
> 
> I'm not sure that there's a really obvious choice here.
> 
> I'm at the point where to continue the work that I'm doing, I need to
> make a decision.  I'm happy to follow the consensus if there is one, but
> it does not appear that there is.
> 
> I will admit that I don't love making this decision by myself, because
> right now, whatever I pick, somebody is going to be unhappy.  I want to
> state, unambiguously, that I'm trying to make a decision that is in the
> interests of the Git Project, the community, and our users.
> 
> I'm happy to wait a few more days to see if a consensus develops; if so,
> I'll follow it.  If we haven't come to one by, say, Wednesday, I'll make
> a decision and write my patches accordingly.  The community is free, as
> always, to reject my patches if taking them is not in the interest of
> the project.
> 
> [0] https://public-inbox.org/git/CAL9PXLzhPyE+geUdcLmd=pidT5P8eFEBbSgX_dS88knz2q_LSw@mail.gmail.com/
> -- 
> brian m. carlson: Houston, Texas, US
> OpenPGP: https://keybase.io/bk2204



-- 
  Eric Deplagne

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-22  9:34           ` Eric Deplagne
@ 2018-07-22 14:21             ` brian m. carlson
  2018-07-22 14:55               ` Eric Deplagne
  0 siblings, 1 reply; 66+ messages in thread
From: brian m. carlson @ 2018-07-22 14:21 UTC (permalink / raw)
  To: Eric Deplagne
  Cc: Johannes Schindelin, Jonathan Nieder, git, demerphq,
	Linus Torvalds, Adam Langley, The Keccak Team

[-- Attachment #1: Type: text/plain, Size: 1644 bytes --]

On Sun, Jul 22, 2018 at 11:34:42AM +0200, Eric Deplagne wrote:
> On Sat, 21 Jul 2018 23:59:41 +0000, brian m. carlson wrote:
> > I don't know your colleagues, and they haven't commented here.  One
> > person that has commented here is Adam Langley.  It is my impression
> > (and anyone is free to correct me if I'm incorrect) that he is indeed a
> > cryptographer.  To quote him[0]:
> > 
> >   I think this group can safely assume that SHA-256, SHA-512, BLAKE2,
> >   K12, etc are all secure to the extent that I don't believe that making
> >   comparisons between them on that axis is meaningful. Thus I think the
> >   question is primarily concerned with performance and implementation
> >   availability.
> > 
> >   […]
> > 
> >   So, overall, none of these choices should obviously be excluded. The
> >   considerations at this point are not cryptographic and the tradeoff
> >   between implementation ease and performance is one that the git
> >   community would have to make.
> 
>   Am I completely out of the game, or the statement that
>     "the considerations at this point are not cryptographic"
>   is just the wrongest ?
> 
>   I mean, if that was true, would we not be sticking to SHA1 ?

I snipped a portion of the context, but AGL was referring to the
considerations involved in choosing from the proposed ones for NewHash.
In context, he meant that the candidates for NewHash “are all secure”
and are therefore a better choice than SHA-1.

I think we can all agree that SHA-1 is weak and should be replaced.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-22 14:21             ` brian m. carlson
@ 2018-07-22 14:55               ` Eric Deplagne
  2018-07-26 10:05                 ` Johannes Schindelin
  0 siblings, 1 reply; 66+ messages in thread
From: Eric Deplagne @ 2018-07-22 14:55 UTC (permalink / raw)
  To: brian m. carlson, Eric Deplagne, Johannes Schindelin,
	Jonathan Nieder, git, demerphq, Linus Torvalds, Adam Langley,
	The Keccak Team

[-- Attachment #1: Type: text/plain, Size: 1926 bytes --]

On Sun, 22 Jul 2018 14:21:48 +0000, brian m. carlson wrote:
> On Sun, Jul 22, 2018 at 11:34:42AM +0200, Eric Deplagne wrote:
> > On Sat, 21 Jul 2018 23:59:41 +0000, brian m. carlson wrote:
> > > I don't know your colleagues, and they haven't commented here.  One
> > > person that has commented here is Adam Langley.  It is my impression
> > > (and anyone is free to correct me if I'm incorrect) that he is indeed a
> > > cryptographer.  To quote him[0]:
> > > 
> > >   I think this group can safely assume that SHA-256, SHA-512, BLAKE2,
> > >   K12, etc are all secure to the extent that I don't believe that making
> > >   comparisons between them on that axis is meaningful. Thus I think the
> > >   question is primarily concerned with performance and implementation
> > >   availability.
> > > 
> > >   […]
> > > 
> > >   So, overall, none of these choices should obviously be excluded. The
> > >   considerations at this point are not cryptographic and the tradeoff
> > >   between implementation ease and performance is one that the git
> > >   community would have to make.
> > 
> >   Am I completely out of the game, or the statement that
> >     "the considerations at this point are not cryptographic"
> >   is just the wrongest ?
> > 
> >   I mean, if that was true, would we not be sticking to SHA1 ?
> 
> I snipped a portion of the context, but AGL was referring to the
> considerations involved in choosing from the proposed ones for NewHash.
> In context, he meant that the candidates for NewHash “are all secure”
> and are therefore a better choice than SHA-1.

  Maybe a little bit sensitive, but I really did read
    "we don't care if it's weak or strong, that's not the matter".

> I think we can all agree that SHA-1 is weak and should be replaced.
> -- 
> brian m. carlson: Houston, Texas, US
> OpenPGP: https://keybase.io/bk2204



-- 
  Eric Deplagne

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-21 23:59         ` brian m. carlson
  2018-07-22  9:34           ` Eric Deplagne
@ 2018-07-22 15:23           ` Joan Daemen
  2018-07-22 18:54             ` Adam Langley
  2018-07-26 10:31             ` Johannes Schindelin
  2018-07-23 12:40           ` demerphq
  2 siblings, 2 replies; 66+ messages in thread
From: Joan Daemen @ 2018-07-22 15:23 UTC (permalink / raw)
  To: brian m. carlson, Johannes Schindelin, Jonathan Nieder, git,
	demerphq, Linus Torvalds, Adam Langley
  Cc: Keccak Team

Dear all,

I wanted to react to some statements I read in this discussion. But
first let me introduce myself. I'm Joan Daemen and I'm working in
symmetric cryptography since 1988. Vincent Rijmen and I designed
Rijndael that was selected to become AES and Guido Bertoni, Michael
Peeters and Gilles Van Assche and I (the Keccak team, later extended
with Ronny Van Keer) designed Keccak that was selected to become SHA3.
Of course as a member of the Keccak team I'm biased in this discussion
but I'll try to keep it factual.

Adam Langley says:

  I think this group can safely assume that SHA-256, SHA-512, BLAKE2, K12, etc are all secure to the extent that I don't believe that making
  comparisons between them on that axis is meaningful.

If never any cryptographic algorithms would be broken, this would be
true. Actually, one can manage the risk by going for cryptographic
algorithms with higher security assurance. In symmetric crypto one
compares security assurance of cryptographic algorithms by the amount of
third-party cryptanalysis, and a good indication of that is the number
of peer-reviewed papers published.

People tend to believe that the SHA2 functions have received more
third-party cryptanalysis than Keccak, but this is not supported by the
facts. We recently did a count of number of cryptanalysis papers for
different hash functions and found the following:

- Keccak: 35 third-party cryptanalysis papers dealing with the
permutation underlying Keccak, most of them at venues with peer review
(see https://keccak.team/third_party.html) This cryptanalysis carries
over to K12 as it is a tree hashing mode built on top of a reduced-round
Keccak variant.

- SHA-256 and SHA-512 together: we found 21 third-party cryptanalysis
papers dealing with the compression functions of SHA-256 or SHA-512.

- BLAKE2: the BLAKE2 webpage blake2.net lists 4 third-party
cryptanalysis papers. There are also a handful of cryptanalysis papers
on its predecessor BLAKE, but these results do not necessarily carry
over as the two compression functions in the different BLAKE2 variants
are different from the two compression functions in the different BLAKE
variants.

I was not surprised by the relatively low number of SHA-2 cryptanalysis
papers we found as during the SHA-3 competition all cryptanalysts were
focusing on SHA-3 candidates and after the competition attention shifted
to authenticated encryption.

Anyway, these numbers support the opinion that the safety margins taken
in K12 are better understood than those in SHA-256, SHA-512 and BLAKE2.

Adam Langley continues:

	Thus I think the question is primarily concerned with performance and implementation availability

Table 2 in our ACNS paper on K12 (available at
https://eprint.iacr.org/2016/770) shows that performance of K12 is quite
competitive. Moreover, there is a lot of code available under CC0
license in the Keccak Code Package on github
https://github.com/gvanas/KeccakCodePackage. If there is shortage of
code for some platforms in the short term, we will be happy to work on that.

In the long term, it is likely that the relative advantage of K12 will
increase as it has more potential for hardware acceleration, e.g., by
instruction set extension. This is thanks to the fact that it does not
use addition, as opposed to so-called addition-xor-rotation (ARX)
designs such as the SHA-2 and BLAKE2 families. This is already
illustrated in our Table 2 I referred to above, in the transition from
Skylake to SkylakeX.

Maybe also interesting for this discussion are the two notes we (Keccak
team) wrote on our choice to not go for ARX and the one on "open source
crypto" at https://keccak.team/2017/not_arx.html and
https://keccak.team/2017/open_source_crypto.html respectively.

Kind regards,

Joan Daemen

On 22/07/2018 01:59, brian m. carlson wrote:
> On Sun, Jul 22, 2018 at 12:38:41AM +0200, Johannes Schindelin wrote:
>> Do you really want to value contributors' opinion more than
>> cryptographers'? I mean, that's exactly what got us into this hard-coded
>> SHA-1 mess in the first place.
> I agree (believe me, of all people, I agree) that hard-coding SHA-1 was
> a bad choice in retrospect.  But I've solicited contributors' opinions
> because the Git Project needs to make a decision *for this project*
> about the algorithm we're going to use going forward.
>
>> And to set the record straight: I do not have a strong preference of the
>> hash algorithm. But cryprographers I have the incredible luck to have
>> access to, by virtue of being a colleague, did mention their preference.
> I don't know your colleagues, and they haven't commented here.  One
> person that has commented here is Adam Langley.  It is my impression
> (and anyone is free to correct me if I'm incorrect) that he is indeed a
> cryptographer.  To quote him[0]:
>
>   I think this group can safely assume that SHA-256, SHA-512, BLAKE2,
>   K12, etc are all secure to the extent that I don't believe that making
>   comparisons between them on that axis is meaningful. Thus I think the
>   question is primarily concerned with performance and implementation
>   availability.
>
>   […]
>
>   So, overall, none of these choices should obviously be excluded. The
>   considerations at this point are not cryptographic and the tradeoff
>   between implementation ease and performance is one that the git
>   community would have to make.
>
> I'm aware that cryptographers tend to prefer algorithms that have been
> studied longer over ones that have been studied less.  They also prefer
> algorithms built in the open to ones developed behind closed doors.
>
> SHA-256 has the benefit that it has been studied for a long time, but it
> was also designed in secret by the NSA.  SHA3-256 was created with
> significant study in the open, but is not as mature.  BLAKE2b has been
> incorporated into standards like Argon2, but has been weakened slightly
> for performance.
>
> I'm not sure that there's a really obvious choice here.
>
> I'm at the point where to continue the work that I'm doing, I need to
> make a decision.  I'm happy to follow the consensus if there is one, but
> it does not appear that there is.
>
> I will admit that I don't love making this decision by myself, because
> right now, whatever I pick, somebody is going to be unhappy.  I want to
> state, unambiguously, that I'm trying to make a decision that is in the
> interests of the Git Project, the community, and our users.
>
> I'm happy to wait a few more days to see if a consensus develops; if so,
> I'll follow it.  If we haven't come to one by, say, Wednesday, I'll make
> a decision and write my patches accordingly.  The community is free, as
> always, to reject my patches if taking them is not in the interest of
> the project.
>
> [0] https://public-inbox.org/git/CAL9PXLzhPyE+geUdcLmd=pidT5P8eFEBbSgX_dS88knz2q_LSw@mail.gmail.com/

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-22 15:23           ` Joan Daemen
@ 2018-07-22 18:54             ` Adam Langley
  2018-07-26 10:31             ` Johannes Schindelin
  1 sibling, 0 replies; 66+ messages in thread
From: Adam Langley @ 2018-07-22 18:54 UTC (permalink / raw)
  To: jda
  Cc: brian m. carlson, Johannes Schindelin, Jonathan Nieder,
	Git Mailing List, demerphq, Linus Torvalds, all

Somewhere upthread, Brian refers to me as a cryptographer. That's
flattering (thank you), but probably not really true even on a good
day. And certainly not true next to Joan Daemen. I do have experience
with crypto at scale and in ecosystems, though.

Joan's count of cryptanalysis papers is a reasonable way to try and
bring some quantitative clarity to an otherwise subjective topic. But
still, despite lacking any counterpoint to it, I find myself believing
that practical concerns are a stronger differentiater here.

But the world is in a position where a new, common hash function might
crystalise, and git could be the start of that. What that means for
the ecosystem is is that numerous libraries need to grow
implementations optimised for 3+ platforms and those platforms (esp
Intel) often need multiple versions (e.g. for different vector widths)
with code-size concerns pushing back at the same time. Intrinsics
still don't cut it, so that means hand-assembly and thus dealing with
gas vs Windows, CFI metadata, etc. Licensing differences mean that
code-sharing doesn't work nearly as well as one might hope.

Then complexity spreads upwards as testing matrices expand with the
combination of each signature algorithm with the new hash function,
options in numerous protocols etc.

In short, picking just one would be lovely.

For that reason, I've held back from SHA3 (which I consider distinct
from K12) because I didn't feel that it relieved enough pressure:
people who wanted more performance weren't going to be satisfied.
Other than that, I don't have strong feelings and, to be clear, K12
seems like a fine option.

But it does seem that a) there is probably not any more information to
discover that is going to alter your decision and b) waiting a short
to medium amount of time is probably not going to bring any definitive
developments either.

Cheers

AGL

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-21 23:59         ` brian m. carlson
  2018-07-22  9:34           ` Eric Deplagne
  2018-07-22 15:23           ` Joan Daemen
@ 2018-07-23 12:40           ` demerphq
  2018-07-23 12:48             ` Sitaram Chamarty
                               ` (2 more replies)
  2 siblings, 3 replies; 66+ messages in thread
From: demerphq @ 2018-07-23 12:40 UTC (permalink / raw)
  To: brian m. carlson, Johannes Schindelin, Jonathan Nieder, Git,
	Linus Torvalds, agl, keccak

On Sun, 22 Jul 2018 at 01:59, brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> I will admit that I don't love making this decision by myself, because
> right now, whatever I pick, somebody is going to be unhappy.  I want to
> state, unambiguously, that I'm trying to make a decision that is in the
> interests of the Git Project, the community, and our users.
>
> I'm happy to wait a few more days to see if a consensus develops; if so,
> I'll follow it.  If we haven't come to one by, say, Wednesday, I'll make
> a decision and write my patches accordingly.  The community is free, as
> always, to reject my patches if taking them is not in the interest of
> the project.

Hi Brian.

I do not envy you this decision.

Personally I would aim towards pushing this decision out to the git
user base and facilitating things so we can choose whatever hash
function (and config) we wish, including ones not invented yet.

Failing that I would aim towards a hashing strategy which has the most
flexibility. Keccak for instance has the interesting property that its
security level is tunable, and that it can produce aribitrarily long
hashes.  Leaving aside other concerns raised elsewhere in this thread,
these two features alone seem to make it a superior choice for an
initial implementation. You can find bugs by selecting unusual hash
sizes, including very long ones, and you can provide ways to tune the
function to peoples security and speed preferences.  Someone really
paranoid can specify an unusually large round count and a very long
hash.

Also frankly I keep thinking that the ability to arbitrarily extend
the hash size has to be useful /somewhere/ in git.

cheers,
Yves
I am not a cryptographer.
-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-23 12:40           ` demerphq
@ 2018-07-23 12:48             ` Sitaram Chamarty
  2018-07-23 12:55               ` demerphq
  2018-07-23 18:23               ` Linus Torvalds
  2018-07-23 17:57             ` Stefan Beller
  2018-07-23 18:35             ` Jonathan Nieder
  2 siblings, 2 replies; 66+ messages in thread
From: Sitaram Chamarty @ 2018-07-23 12:48 UTC (permalink / raw)
  To: demerphq, brian m. carlson, Johannes Schindelin, Jonathan Nieder,
	Git, Linus Torvalds, agl, keccak



On 07/23/2018 06:10 PM, demerphq wrote:
> On Sun, 22 Jul 2018 at 01:59, brian m. carlson
> <sandals@crustytoothpaste.net> wrote:
>> I will admit that I don't love making this decision by myself, because
>> right now, whatever I pick, somebody is going to be unhappy.  I want to
>> state, unambiguously, that I'm trying to make a decision that is in the
>> interests of the Git Project, the community, and our users.
>>
>> I'm happy to wait a few more days to see if a consensus develops; if so,
>> I'll follow it.  If we haven't come to one by, say, Wednesday, I'll make
>> a decision and write my patches accordingly.  The community is free, as
>> always, to reject my patches if taking them is not in the interest of
>> the project.
> 
> Hi Brian.
> 
> I do not envy you this decision.
> 
> Personally I would aim towards pushing this decision out to the git
> user base and facilitating things so we can choose whatever hash
> function (and config) we wish, including ones not invented yet.
> 
> Failing that I would aim towards a hashing strategy which has the most
> flexibility. Keccak for instance has the interesting property that its
> security level is tunable, and that it can produce aribitrarily long
> hashes.  Leaving aside other concerns raised elsewhere in this thread,
> these two features alone seem to make it a superior choice for an
> initial implementation. You can find bugs by selecting unusual hash
> sizes, including very long ones, and you can provide ways to tune the
> function to peoples security and speed preferences.  Someone really
> paranoid can specify an unusually large round count and a very long
> hash.
> 
> Also frankly I keep thinking that the ability to arbitrarily extend
> the hash size has to be useful /somewhere/ in git.

I would not suggest arbitrarily long hashes.  Not only would it
complicate a lot of code, it is not clear that it has any real benefit.

Plus, the code contortions required to support arbitrarily long hashes
would be more susceptible to potential bugs and exploits, simply by
being more complex code.  Why take chances?

I would suggest (a) hash size of 256 bits and (b) choice of any hash
function that can produce such a hash.  If people feel strongly that 256
bits may also turn out to be too small (really?) then a choice of 256 or
512, but not arbitrary sizes.

Sitaram
also not a cryptographer!

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-23 12:48             ` Sitaram Chamarty
@ 2018-07-23 12:55               ` demerphq
  2018-07-23 18:23               ` Linus Torvalds
  1 sibling, 0 replies; 66+ messages in thread
From: demerphq @ 2018-07-23 12:55 UTC (permalink / raw)
  To: Sitaram Chamarty
  Cc: brian m. carlson, Johannes Schindelin, Jonathan Nieder, Git,
	Linus Torvalds, agl, keccak

On Mon, 23 Jul 2018 at 14:48, Sitaram Chamarty <sitaramc@gmail.com> wrote:
> On 07/23/2018 06:10 PM, demerphq wrote:
> > On Sun, 22 Jul 2018 at 01:59, brian m. carlson
> > <sandals@crustytoothpaste.net> wrote:
> >> I will admit that I don't love making this decision by myself, because
> >> right now, whatever I pick, somebody is going to be unhappy.  I want to
> >> state, unambiguously, that I'm trying to make a decision that is in the
> >> interests of the Git Project, the community, and our users.
> >>
> >> I'm happy to wait a few more days to see if a consensus develops; if so,
> >> I'll follow it.  If we haven't come to one by, say, Wednesday, I'll make
> >> a decision and write my patches accordingly.  The community is free, as
> >> always, to reject my patches if taking them is not in the interest of
> >> the project.
> >
> > Hi Brian.
> >
> > I do not envy you this decision.
> >
> > Personally I would aim towards pushing this decision out to the git
> > user base and facilitating things so we can choose whatever hash
> > function (and config) we wish, including ones not invented yet.
> >
> > Failing that I would aim towards a hashing strategy which has the most
> > flexibility. Keccak for instance has the interesting property that its
> > security level is tunable, and that it can produce aribitrarily long
> > hashes.  Leaving aside other concerns raised elsewhere in this thread,
> > these two features alone seem to make it a superior choice for an
> > initial implementation. You can find bugs by selecting unusual hash
> > sizes, including very long ones, and you can provide ways to tune the
> > function to peoples security and speed preferences.  Someone really
> > paranoid can specify an unusually large round count and a very long
> > hash.
> >
> > Also frankly I keep thinking that the ability to arbitrarily extend
> > the hash size has to be useful /somewhere/ in git.
>
> I would not suggest arbitrarily long hashes.  Not only would it
> complicate a lot of code, it is not clear that it has any real benefit.

It has the benefit of armoring the code for the *next* hash change,
and making it clear that such decisions are arbitrary and should not
be depended on.

> Plus, the code contortions required to support arbitrarily long hashes
> would be more susceptible to potential bugs and exploits, simply by
> being more complex code.  Why take chances?

I think the benefits would outweight the risks.

> I would suggest (a) hash size of 256 bits and (b) choice of any hash
> function that can produce such a hash.  If people feel strongly that 256
> bits may also turn out to be too small (really?) then a choice of 256 or
> 512, but not arbitrary sizes.

I am aware of too many systems that cannot change their size and are
locked into woefully bad decisions that were made long ago to buy
this.

Making it a per-repo option, would eliminate assumptions and make for
a more secure and flexible tool.

Anyway, I am not going to do the work so my opinion is worth the price
of the paper I sent it on. :-)

cheers,
Yves

-- 
perl -Mre=debug -e "/just|another|perl|hacker/"

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-23 12:40           ` demerphq
  2018-07-23 12:48             ` Sitaram Chamarty
@ 2018-07-23 17:57             ` Stefan Beller
  2018-07-23 18:35             ` Jonathan Nieder
  2 siblings, 0 replies; 66+ messages in thread
From: Stefan Beller @ 2018-07-23 17:57 UTC (permalink / raw)
  To: demerphq
  Cc: brian m. carlson, Johannes Schindelin, Jonathan Nieder, git,
	Linus Torvalds, Adam Langley, keccak

On Mon, Jul 23, 2018 at 5:41 AM demerphq <demerphq@gmail.com> wrote:
>
> On Sun, 22 Jul 2018 at 01:59, brian m. carlson
> <sandals@crustytoothpaste.net> wrote:
> > I will admit that I don't love making this decision by myself, because
> > right now, whatever I pick, somebody is going to be unhappy.  I want to
> > state, unambiguously, that I'm trying to make a decision that is in the
> > interests of the Git Project, the community, and our users.
> >
> > I'm happy to wait a few more days to see if a consensus develops; if so,
> > I'll follow it.  If we haven't come to one by, say, Wednesday, I'll make
> > a decision and write my patches accordingly.  The community is free, as
> > always, to reject my patches if taking them is not in the interest of
> > the project.
>
> Hi Brian.
>
> I do not envy you this decision.
>
> Personally I would aim towards pushing this decision out to the git
> user base and facilitating things so we can choose whatever hash
> function (and config) we wish, including ones not invented yet.

By Git user base you actually mean millions of people?
(And they'll have different opinions and needs)

One of the goals of the hash transition is to pick a hash function
such that git repositories are compatible.
If users were to pick their own hashes, we would need to not just give
a SHA-1 -> <newhash> transition plan, but we'd have to make sure the
full matrix of possible hashes is interchangeable as we have no idea
of what the user would think of "safer". For example one server operator
might decide to settle on SHA2 and another would settle on blake2,
whereas a user that uses both servers as remotes settles with k12.

Then there would be a whole lot of conversion going on (you cannot talk
natively to a remote with a different hash; checking pgp signatures is
also harder as you have an abstraction layer in between).

I would rather just have the discussion now and then provide only one
conversion tool which might be easy to adapt, but after the majority
converted, it is rather left to bitrot instead of needing to support ongoing
conversions.

On the other hand, even if we'd provide a "different hashes are fine"
solution, I would think the network effect would make sure that eventually
most people end up with one hash.

One example of using different hashes successfully are transports,
like TLS, SSH. The difference there is that it is a point-to-point communication
whereas a git repository needs to be read by many parties involved; also
a communication over TLS/SSH is ephemeral unlike objects in Git.

> Failing that I would aim towards a hashing strategy which has the most
> flexibility. Keccak for instance has the interesting property that its
> security level is tunable, and that it can produce aribitrarily long
> hashes.  Leaving aside other concerns raised elsewhere in this thread,
> these two features alone seem to make it a superior choice for an
> initial implementation. You can find bugs by selecting unusual hash
> sizes, including very long ones, and you can provide ways to tune the
> function to peoples security and speed preferences.  Someone really
> paranoid can specify an unusually large round count and a very long
> hash.

I do not object to this in theory, but I would rather not want to burden
the need to write code for this on the community.

> I am not a cryptographer.

Same here. My personal preference would be blake2b
as that is the fastest IIRC.

Re-reading brians initial mail, I think we should settle on
SHA-256, as that is a conservative choice for security and
the winner in HW accelerated setups, and not to shabby in
a software implementation; it is also widely available.

Stefan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-23 12:48             ` Sitaram Chamarty
  2018-07-23 12:55               ` demerphq
@ 2018-07-23 18:23               ` Linus Torvalds
  1 sibling, 0 replies; 66+ messages in thread
From: Linus Torvalds @ 2018-07-23 18:23 UTC (permalink / raw)
  To: Sitaram Chamarty
  Cc: demerphq, brian m. carlson, Johannes Schindelin, Jonathan Nieder,
	Git Mailing List, Adam Langley, keccak

On Mon, Jul 23, 2018 at 5:48 AM Sitaram Chamarty <sitaramc@gmail.com> wrote:
>
> I would suggest (a) hash size of 256 bits and (b) choice of any hash
> function that can produce such a hash.  If people feel strongly that 256
> bits may also turn out to be too small (really?) then a choice of 256 or
> 512, but not arbitrary sizes.

Honestly, what's the expected point of 512-bit hashes?

The _only_ point of a 512-bit hash is that it's going to grow objects
in incompressible ways, and use more memory. Just don't do it.

If somebody can break a 256-bit hash, you have two choices:

 (a) the hash function itself was broken, and 512 bits isn't the
solution to it anyway, even if it can certainly hide the problem

 (b) you had some "new math" kind of unexpected breakthrough, which
means that 512 bits might not be much  better either.

Honestly, the number of particles in the observable universe is on the
order of 2**256. It's a really really big number.

Don't make the code base more complex than it needs to be. Make a
informed technical decision, and say "256 bits is a *lot*".

The difference between engineering and theory is that engineering
makes trade-offs. Good software is well *engineered*, not theorized.

Also, I would suggest that git default to "abbrev-commit=40", so that
nobody actually *sees* the new bits by default. So the perl scripts
etc that use "[0-9a-f]{40}" as a hash pattern would just silently
continue to work.

Because backwards compatibility is important (*)

                     Linus

(*) And 2**160 is still a big big number, and hasn't really been a
practical problem, and SHA1DC is likely a good hash for the next
decade or longer.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-23 12:40           ` demerphq
  2018-07-23 12:48             ` Sitaram Chamarty
  2018-07-23 17:57             ` Stefan Beller
@ 2018-07-23 18:35             ` Jonathan Nieder
  2 siblings, 0 replies; 66+ messages in thread
From: Jonathan Nieder @ 2018-07-23 18:35 UTC (permalink / raw)
  To: demerphq
  Cc: brian m. carlson, Johannes Schindelin, Git, Linus Torvalds, agl,
	keccak

Hi Yves,

demerphq wrote:
> On Sun, 22 Jul 2018 at 01:59, brian m. carlson
> <sandals@crustytoothpaste.net> wrote:

>> I will admit that I don't love making this decision by myself, because
>> right now, whatever I pick, somebody is going to be unhappy.
[...]
> I do not envy you this decision.
>
> Personally I would aim towards pushing this decision out to the git
> user base and facilitating things so we can choose whatever hash
> function (and config) we wish, including ones not invented yet.

There are two separate pieces to this.

One is configurability at compile time.  So far that has definitely
been a goal, because we want to be ready to start the transition to
another hash, and quickly, as soon as the new hash is discovered to be
weak.  This also means that people can experiment with new hashes and
in a controlled environment (where the users can afford to build from
source), some users might prefer some bespoke hash for reasons only
known to them. ;-)

Another piece is configurability at run time.  This is a harder sell
because it has some negative effects in the ecosystem:

 - performance impact from users having to maintain a translation table
   between the different hash functions in use

 - security impact, in the form of downgrade attacks

 - dependency bloat, from Git having to be able to compute all hash
   functions permitted in that run-time configuration

The security impact can be mitigated by keeping the list of supported
hashes small (i.e. two or three instead of 10ish).  Each additional
hash function is a potential liability (just as in SSL), so they have
to earn their keep.

The performance impact is unavoidable if we encourage Git servers to
pick their favorite hash function instead of making a decision in the
project.  This can in turn affect security, since it would increase
the switching cost away from SHA-1, with the likely effect being that
most users stay on SHA-1.  I don't want to go there.

So I would say, support for arbitrary hash functions at compile time
and in file formats is important and I encourage you to hold us to
that (when reviewing patches, etc).  But in the standard Git build
configuration that most people run, I believe it is best to support
only SHA-1 + our chosen replacement hash.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-20 21:52     ` brian m. carlson
                         ` (2 preceding siblings ...)
  2018-07-21 22:38       ` Johannes Schindelin
@ 2018-07-24 19:01       ` Edward Thomson
  2018-07-24 20:31         ` Linus Torvalds
  3 siblings, 1 reply; 66+ messages in thread
From: Edward Thomson @ 2018-07-24 19:01 UTC (permalink / raw)
  To: brian m. carlson, Jonathan Nieder, git, Johannes Schindelin,
	demerphq, Linus Torvalds, Adam Langley, The Keccak Team

On Fri, Jul 20, 2018 at 09:52:20PM +0000, brian m. carlson wrote:
> 
> To summarize the discussion that's been had in addition to the above,
> Ævar has also stated a preference for SHA-256 and I would prefer BLAKE2b
> over SHA-256 over SHA3-256, although any of them would be fine.
> 
> Are there other contributors who have a strong opinion?  Are there
> things I can do to help us coalesce around an option?

Overall, I prefer SHA-256.

I mentioned this at the contributor summit - so this may have been
captured in the notes.  But if not, when I look at this from the
perspective of my day job at Notorious Big Software Company, we would
prefer SHA-256 due to its performance characteristics and the
availability of hardware acceleration.  We think about git object ids
in a few different ways:

Obviously we use git as a version control system - we have a significant
investment in hosting repositories (for both internal Microsoft teams
and our external customers).  What may be less obvious is that often,
git blob ids are used as fingerprints: on a typical Windows machine,
you don't have the command-line hash functions (md5sum and friends),
but every developer has git installed.  So we end up calculating git
object ids in places within the development pipeline that are beyond the
scope of just version control.  Not to dwell too much on implementation
details, but this is especially advantageous for us in (say) labs where
we can ensure that particular hardware is available to speed this up as
necessary.

Switching gears, if I look at this from the perspective of the libgit2
project, I would also prefer SHA-256 or SHA3 over blake2b.  To support
blake2b, we'd have to include - and support - that code ourselves.  But
to support SHA-256, we would simply use the system's crypto libraries
that we already take a dependecy on (OpenSSL, mbedTLS, CryptoNG, or
SecureTransport).  All of those support SHA-256 and none of them include
support for blake2b.  That means if there's a problem with (say)
OpenSSL's SHA-256 implementation, then it will be fixed by their vendor.
If there's a problem with libb2, then that's now my responsibility.

This is not to suggest that one library is of higher or lower quality
than another.  And surely we would try to use the same blake2b library
that git itself is using to minimize some of this risk (so that at least
we're all in the same boat and can leverage each other's communications
to users) but even then, there will be inevitable drift between our
vendored dependencies and the upstream code.  You can see this in action
in xdiff: git's xdiff has deviated from upstream, and libgit2 has taken
git's and ours has deviated from that.

Cheers-
-ed

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-24 19:01       ` Edward Thomson
@ 2018-07-24 20:31         ` Linus Torvalds
  2018-07-24 20:49           ` Jonathan Nieder
  2018-07-24 21:13           ` Junio C Hamano
  0 siblings, 2 replies; 66+ messages in thread
From: Linus Torvalds @ 2018-07-24 20:31 UTC (permalink / raw)
  To: Edward Thomson
  Cc: brian m. carlson, Jonathan Nieder, Git Mailing List,
	Johannes Schindelin, demerphq, Adam Langley, keccak

On Tue, Jul 24, 2018 at 12:01 PM Edward Thomson
<ethomson@edwardthomson.com> wrote:
>
> Switching gears, if I look at this from the perspective of the libgit2
> project, I would also prefer SHA-256 or SHA3 over blake2b.  To support
> blake2b, we'd have to include - and support - that code ourselves.  But
> to support SHA-256, we would simply use the system's crypto libraries
> that we already take a dependecy on (OpenSSL, mbedTLS, CryptoNG, or
> SecureTransport).

I think this is probably the single strongest argument for sha256.
"It's just there".

The hardware acceleration hasn't become nearly as ubiquitous as I
would have hoped, and honestly, sha256 _needs_ hw acceleration more
than some of the alternatives in the first place.

But sha256 does have the big advantage of just having been around and
existing in pretty much every single crypto library.

So I'm not a huge fan of sha256, partly because of my disappointment
in lack of hw acceleration in releant markets (sure, it's fairly
common in ARM, but nobody sane uses ARM for development because of
_other_ reasons). And partly because I don't like how the internal
data size is the same as the final hash. But that second issue is an
annoyance with it, not a real issue - in the absence of weaknesses
it's a non-issue, and any future weaknesses might affect any other
choice too.

So hey, if people are actually at the point where the lack of choice
holds up development, we should just pick one. And despite what I've
said in this discussion, sha256 would have been my first choice, just
because it's the "obvious" choice. The exact same way that SHA1 was
the obvious choice (for pretty much the same infrastructure reasons)
back in 2005.

And maybe the hw acceleration landscape will actually improve. I think
AMD actually does do the SHA extensions in Zen/TR.

So I think Junio should just pick one. And I'll stand up and say

 "Let's just pick one.

  And sha256 is certainly the safe choice in that it won't strike
  anybody as being the _wrong_ choice per se, even if not everybody will
  necessarily agree it's the _bext_ choice".

but in the end I think Junio should be the final arbiter. I think all
of the discussed choices are perfectly fine in practice.

Btw, the one thing I *would* suggest is that the git community just
also says that the current hash is not SHA1, but SHA1DC.

Support for "plain" SHA1 should be removed entirely. If we add a lot
of new infrastructure to support a new more secure hash, we should not
have the old fallback for the known-weak one. Just make SHA1DC the
only one git can be built with.

               Linus

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-24 20:31         ` Linus Torvalds
@ 2018-07-24 20:49           ` Jonathan Nieder
  2018-07-24 21:13           ` Junio C Hamano
  1 sibling, 0 replies; 66+ messages in thread
From: Jonathan Nieder @ 2018-07-24 20:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Edward Thomson, brian m. carlson, Git Mailing List,
	Johannes Schindelin, demerphq, Adam Langley, keccak

Hi,

Linus Torvalds wrote:
> On Tue, Jul 24, 2018 at 12:01 PM Edward Thomson
> <ethomson@edwardthomson.com> wrote:

>> Switching gears, if I look at this from the perspective of the libgit2
>> project, I would also prefer SHA-256 or SHA3 over blake2b.  To support
>> blake2b, we'd have to include - and support - that code ourselves.  But
>> to support SHA-256, we would simply use the system's crypto libraries
>> that we already take a dependecy on (OpenSSL, mbedTLS, CryptoNG, or
>> SecureTransport).

Just to be clear, OpenSSL has built-in blake2b support.

[...]
> So I'm not a huge fan of sha256, partly because of my disappointment
> in lack of hw acceleration in releant markets (sure, it's fairly
> common in ARM, but nobody sane uses ARM for development because of
> _other_ reasons). And partly because I don't like how the internal
> data size is the same as the final hash. But that second issue is an
> annoyance with it, not a real issue - in the absence of weaknesses
> it's a non-issue, and any future weaknesses might affect any other
> choice too.

Thanks for saying this.  With this in mind, I think we have a clear
way forward: we should use SHA-256.

My main complaint about it is that it is not a tree hash, but the
common availability in libraries trumps that (versus SHA-256x16, say).
I also was excited about K12, both because I like a world where Keccak
gets wide hardware accelaration (improving PRNGs and other
applications) and because of Keccak team's helpfulness throughout the
process of helping us evaluate this, and it's possible that some day
in the future we may want to switch to something like it.  But today,
as mentioned in [1] and [2], there is value in settling on one
standard and SHA2-256 is the obvious standard today.

Thanks,
Jonathan

[1] https://public-inbox.org/git/CAL9PXLyNVLCCqV1ftRa3r4kuoamDZOF29HJEhv2JXrbHj1nirA@mail.gmail.com/
[2] https://public-inbox.org/git/20180723183523.GB9285@aiede.svl.corp.google.com/

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-24 20:31         ` Linus Torvalds
  2018-07-24 20:49           ` Jonathan Nieder
@ 2018-07-24 21:13           ` Junio C Hamano
  2018-07-24 22:10             ` brian m. carlson
                               ` (3 more replies)
  1 sibling, 4 replies; 66+ messages in thread
From: Junio C Hamano @ 2018-07-24 21:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Edward Thomson, brian m. carlson, Jonathan Nieder,
	Git Mailing List, Johannes Schindelin, demerphq, Adam Langley,
	keccak

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Tue, Jul 24, 2018 at 12:01 PM Edward Thomson
> <ethomson@edwardthomson.com> wrote:
>>
>> Switching gears, if I look at this from the perspective of the libgit2
>> project, I would also prefer SHA-256 or SHA3 over blake2b.  To support
>> blake2b, we'd have to include - and support - that code ourselves.  But
>> to support SHA-256, we would simply use the system's crypto libraries
>> that we already take a dependecy on (OpenSSL, mbedTLS, CryptoNG, or
>> SecureTransport).
>
> I think this is probably the single strongest argument for sha256.
> "It's just there".

Yup.  I actually was leaning toward saying "all of them are OK in
practice, so the person who is actually spear-heading the work gets
to choose", but if we picked SHA-256 now, that would not be a choice
that Brian has to later justify for choosing against everybody
else's wishes, which makes it the best choice ;-)


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-24 21:13           ` Junio C Hamano
@ 2018-07-24 22:10             ` brian m. carlson
  2018-07-30  9:06               ` Johannes Schindelin
  2018-07-25  8:30             ` [PATCH 0/2] document that NewHash is now SHA-256 Ævar Arnfjörð Bjarmason
                               ` (2 subsequent siblings)
  3 siblings, 1 reply; 66+ messages in thread
From: brian m. carlson @ 2018-07-24 22:10 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Linus Torvalds, Edward Thomson, Jonathan Nieder, Git Mailing List,
	Johannes Schindelin, demerphq, Adam Langley, keccak

[-- Attachment #1: Type: text/plain, Size: 1138 bytes --]

On Tue, Jul 24, 2018 at 02:13:07PM -0700, Junio C Hamano wrote:
> Yup.  I actually was leaning toward saying "all of them are OK in
> practice, so the person who is actually spear-heading the work gets
> to choose", but if we picked SHA-256 now, that would not be a choice
> that Brian has to later justify for choosing against everybody
> else's wishes, which makes it the best choice ;-)

This looks like a rough consensus.  And fortunately, I was going to pick
SHA-256 and implemented it over the weekend.

Things I thought about in this regard:
* When you compare against SHA1DC, most vectorized SHA-256
  implementations are indeed faster, even without acceleration.
* If we're doing signatures with OpenPGP (or even, I suppose, CMS),
  we're going to be using SHA-2, so it doesn't make sense to have our
  security depend on two separate algorithms when either one of them
  alone could break the security when we could just depend on one.

I'll be sending out some patches, probably in a few days, with SHA-256
and some test fixes.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 0/2] document that NewHash is now SHA-256
  2018-07-24 21:13           ` Junio C Hamano
  2018-07-24 22:10             ` brian m. carlson
@ 2018-07-25  8:30             ` Ævar Arnfjörð Bjarmason
  2018-07-25  8:30             ` [PATCH 1/2] doc hash-function-transition: note the lack of a changelog Ævar Arnfjörð Bjarmason
  2018-07-25  8:30             ` [PATCH 2/2] doc hash-function-transition: pick SHA-256 as NewHash Ævar Arnfjörð Bjarmason
  3 siblings, 0 replies; 66+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-07-25  8:30 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Linus Torvalds, Edward Thomson, brian m . carlson,
	Jonathan Nieder, Johannes Schindelin, demerphq, Adam Langley,
	keccak, Ævar Arnfjörð Bjarmason

On Tue, Jul 24 2018, Junio C Hamano wrote:

> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
>> On Tue, Jul 24, 2018 at 12:01 PM Edward Thomson
>> <ethomson@edwardthomson.com> wrote:
>>>
>>> Switching gears, if I look at this from the perspective of the libgit2
>>> project, I would also prefer SHA-256 or SHA3 over blake2b.  To support
>>> blake2b, we'd have to include - and support - that code ourselves.  But
>>> to support SHA-256, we would simply use the system's crypto libraries
>>> that we already take a dependecy on (OpenSSL, mbedTLS, CryptoNG, or
>>> SecureTransport).
>>
>> I think this is probably the single strongest argument for sha256.
>> "It's just there".
>
> Yup.  I actually was leaning toward saying "all of them are OK in
> practice, so the person who is actually spear-heading the work gets
> to choose", but if we picked SHA-256 now, that would not be a choice
> that Brian has to later justify for choosing against everybody
> else's wishes, which makes it the best choice ;-)

Looks like it's settled then. I thought I'd do the grunt work of
updating the relevant documentation so we can officially move on from
the years-long NewHash discussion.

Ævar Arnfjörð Bjarmason (2):
  doc hash-function-transition: note the lack of a changelog
  doc hash-function-transition: pick SHA-256 as NewHash

 .../technical/hash-function-transition.txt    | 192 ++++++++++--------
 1 file changed, 102 insertions(+), 90 deletions(-)

-- 
2.17.0.290.gded63e768a


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 1/2] doc hash-function-transition: note the lack of a changelog
  2018-07-24 21:13           ` Junio C Hamano
  2018-07-24 22:10             ` brian m. carlson
  2018-07-25  8:30             ` [PATCH 0/2] document that NewHash is now SHA-256 Ævar Arnfjörð Bjarmason
@ 2018-07-25  8:30             ` Ævar Arnfjörð Bjarmason
  2018-07-25  8:30             ` [PATCH 2/2] doc hash-function-transition: pick SHA-256 as NewHash Ævar Arnfjörð Bjarmason
  3 siblings, 0 replies; 66+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-07-25  8:30 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Linus Torvalds, Edward Thomson, brian m . carlson,
	Jonathan Nieder, Johannes Schindelin, demerphq, Adam Langley,
	keccak, Ævar Arnfjörð Bjarmason

The changelog embedded in the document pre-dates the addition of the
document to git.git (it used to be a Google Doc), so it only goes up
to 752414ae43 ("technical doc: add a design doc for hash function
transition", 2017-09-27).

Since then I made some small edits to it, which would have been worthy
of including in this changelog (but weren't). Instead of amending it
to include these, just note that future changes will be noted in the
log.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 Documentation/technical/hash-function-transition.txt | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt
index 4ab6cd1012..5ee4754adb 100644
--- a/Documentation/technical/hash-function-transition.txt
+++ b/Documentation/technical/hash-function-transition.txt
@@ -814,6 +814,12 @@ Incorporated suggestions from jonathantanmy and sbeller:
 * avoid loose object overhead by packing more aggressively in
   "git gc --auto"
 
+Later history:
+
+ See the history of this file in git.git for the history of subsequent
+ edits. This document history is no longer being maintained as it
+ would now be superfluous to the commit log
+
 [1] http://public-inbox.org/git/CA+55aFzJtejiCjV0e43+9oR3QuJK2PiFiLQemytoLpyJWe6P9w@mail.gmail.com/
 [2] http://public-inbox.org/git/CA+55aFz+gkAsDZ24zmePQuEs1XPS9BP_s8O7Q4wQ7LV7X5-oDA@mail.gmail.com/
 [3] http://public-inbox.org/git/20170306084353.nrns455dvkdsfgo5@sigill.intra.peff.net/
-- 
2.17.0.290.gded63e768a


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 2/2] doc hash-function-transition: pick SHA-256 as NewHash
  2018-07-24 21:13           ` Junio C Hamano
                               ` (2 preceding siblings ...)
  2018-07-25  8:30             ` [PATCH 1/2] doc hash-function-transition: note the lack of a changelog Ævar Arnfjörð Bjarmason
@ 2018-07-25  8:30             ` Ævar Arnfjörð Bjarmason
  2018-07-25 16:45               ` Junio C Hamano
  3 siblings, 1 reply; 66+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-07-25  8:30 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Linus Torvalds, Edward Thomson, brian m . carlson,
	Jonathan Nieder, Johannes Schindelin, demerphq, Adam Langley,
	keccak, Ævar Arnfjörð Bjarmason

The consensus on the mailing list seems to be that SHA-256 should be
picked as our NewHash, see the "Hash algorithm analysis" thread as of
[1]. Linus has come around to this choice and suggested Junio make the
final pick, and he's endorsed SHA-256 [3].

This follow-up change changes occurrences of "NewHash" to
"SHA-256" (or "sha256", depending on the context). The "Selection of a
New Hash" section has also been changed to note that historically we
used the the "NewHash" name while we didn't know what the new hash
function would be.

This leaves no use of "NewHash" anywhere in git.git except in the
aforementioned section (and as a variable name in t/t9700/test.pl, but
that use from 2008 has nothing to do with this transition plan).

1. https://public-inbox.org/git/20180720215220.GB18502@genre.crustytoothpaste.net/
2. https://public-inbox.org/git/CA+55aFwSe9BF8e0hLk9pp3FVD5LaVY5GRdsV3fbNtgzekJadyA@mail.gmail.com/
3. https://public-inbox.org/git/xmqqzhygwd5o.fsf@gitster-ct.c.googlers.com/

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 .../technical/hash-function-transition.txt    | 186 +++++++++---------
 1 file changed, 96 insertions(+), 90 deletions(-)

diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt
index 5ee4754adb..8fb2d4b498 100644
--- a/Documentation/technical/hash-function-transition.txt
+++ b/Documentation/technical/hash-function-transition.txt
@@ -59,14 +59,11 @@ that are believed to be cryptographically secure.
 
 Goals
 -----
-Where NewHash is a strong 256-bit hash function to replace SHA-1 (see
-"Selection of a New Hash", below):
-
-1. The transition to NewHash can be done one local repository at a time.
+1. The transition to SHA-256 can be done one local repository at a time.
    a. Requiring no action by any other party.
-   b. A NewHash repository can communicate with SHA-1 Git servers
+   b. A SHA-256 repository can communicate with SHA-1 Git servers
       (push/fetch).
-   c. Users can use SHA-1 and NewHash identifiers for objects
+   c. Users can use SHA-1 and SHA-256 identifiers for objects
       interchangeably (see "Object names on the command line", below).
    d. New signed objects make use of a stronger hash function than
       SHA-1 for their security guarantees.
@@ -79,7 +76,7 @@ Where NewHash is a strong 256-bit hash function to replace SHA-1 (see
 
 Non-Goals
 ---------
-1. Add NewHash support to Git protocol. This is valuable and the
+1. Add SHA-256 support to Git protocol. This is valuable and the
    logical next step but it is out of scope for this initial design.
 2. Transparently improving the security of existing SHA-1 signed
    objects.
@@ -87,26 +84,26 @@ Non-Goals
    repository.
 4. Taking the opportunity to fix other bugs in Git's formats and
    protocols.
-5. Shallow clones and fetches into a NewHash repository. (This will
-   change when we add NewHash support to Git protocol.)
-6. Skip fetching some submodules of a project into a NewHash
-   repository. (This also depends on NewHash support in Git
+5. Shallow clones and fetches into a SHA-256 repository. (This will
+   change when we add SHA-256 support to Git protocol.)
+6. Skip fetching some submodules of a project into a SHA-256
+   repository. (This also depends on SHA-256 support in Git
    protocol.)
 
 Overview
 --------
 We introduce a new repository format extension. Repositories with this
-extension enabled use NewHash instead of SHA-1 to name their objects.
+extension enabled use SHA-256 instead of SHA-1 to name their objects.
 This affects both object names and object content --- both the names
 of objects and all references to other objects within an object are
 switched to the new hash function.
 
-NewHash repositories cannot be read by older versions of Git.
+SHA-256 repositories cannot be read by older versions of Git.
 
-Alongside the packfile, a NewHash repository stores a bidirectional
-mapping between NewHash and SHA-1 object names. The mapping is generated
+Alongside the packfile, a SHA-256 repository stores a bidirectional
+mapping between SHA-256 and SHA-1 object names. The mapping is generated
 locally and can be verified using "git fsck". Object lookups use this
-mapping to allow naming objects using either their SHA-1 and NewHash names
+mapping to allow naming objects using either their SHA-1 and SHA-256 names
 interchangeably.
 
 "git cat-file" and "git hash-object" gain options to display an object
@@ -116,7 +113,7 @@ object database so that they can be named using the appropriate name
 (using the bidirectional hash mapping).
 
 Fetches from a SHA-1 based server convert the fetched objects into
-NewHash form and record the mapping in the bidirectional mapping table
+SHA-256 form and record the mapping in the bidirectional mapping table
 (see below for details). Pushes to a SHA-1 based server convert the
 objects being pushed into sha1 form so the server does not have to be
 aware of the hash function the client is using.
@@ -125,19 +122,19 @@ Detailed Design
 ---------------
 Repository format extension
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
-A NewHash repository uses repository format version `1` (see
+A SHA-256 repository uses repository format version `1` (see
 Documentation/technical/repository-version.txt) with extensions
 `objectFormat` and `compatObjectFormat`:
 
 	[core]
 		repositoryFormatVersion = 1
 	[extensions]
-		objectFormat = newhash
+		objectFormat = sha256
 		compatObjectFormat = sha1
 
 The combination of setting `core.repositoryFormatVersion=1` and
 populating `extensions.*` ensures that all versions of Git later than
-`v0.99.9l` will die instead of trying to operate on the NewHash
+`v0.99.9l` will die instead of trying to operate on the SHA-256
 repository, instead producing an error message.
 
 	# Between v0.99.9l and v2.7.0
@@ -155,36 +152,36 @@ repository extensions.
 Object names
 ~~~~~~~~~~~~
 Objects can be named by their 40 hexadecimal digit sha1-name or 64
-hexadecimal digit newhash-name, plus names derived from those (see
+hexadecimal digit sha256-name, plus names derived from those (see
 gitrevisions(7)).
 
 The sha1-name of an object is the SHA-1 of the concatenation of its
 type, length, a nul byte, and the object's sha1-content. This is the
 traditional <sha1> used in Git to name objects.
 
-The newhash-name of an object is the NewHash of the concatenation of its
-type, length, a nul byte, and the object's newhash-content.
+The sha256-name of an object is the SHA-256 of the concatenation of its
+type, length, a nul byte, and the object's sha256-content.
 
 Object format
 ~~~~~~~~~~~~~
 The content as a byte sequence of a tag, commit, or tree object named
-by sha1 and newhash differ because an object named by newhash-name refers to
-other objects by their newhash-names and an object named by sha1-name
+by sha1 and sha256 differ because an object named by sha256-name refers to
+other objects by their sha256-names and an object named by sha1-name
 refers to other objects by their sha1-names.
 
-The newhash-content of an object is the same as its sha1-content, except
-that objects referenced by the object are named using their newhash-names
+The sha256-content of an object is the same as its sha1-content, except
+that objects referenced by the object are named using their sha256-names
 instead of sha1-names. Because a blob object does not refer to any
-other object, its sha1-content and newhash-content are the same.
+other object, its sha1-content and sha256-content are the same.
 
-The format allows round-trip conversion between newhash-content and
+The format allows round-trip conversion between sha256-content and
 sha1-content.
 
 Object storage
 ~~~~~~~~~~~~~~
 Loose objects use zlib compression and packed objects use the packed
 format described in Documentation/technical/pack-format.txt, just like
-today. The content that is compressed and stored uses newhash-content
+today. The content that is compressed and stored uses sha256-content
 instead of sha1-content.
 
 Pack index
@@ -255,10 +252,10 @@ network byte order):
   up to and not including the table of CRC32 values.
 - Zero or more NUL bytes.
 - The trailer consists of the following:
-  - A copy of the 20-byte NewHash checksum at the end of the
+  - A copy of the 20-byte SHA-256 checksum at the end of the
     corresponding packfile.
 
-  - 20-byte NewHash checksum of all of the above.
+  - 20-byte SHA-256 checksum of all of the above.
 
 Loose object index
 ~~~~~~~~~~~~~~~~~~
@@ -266,7 +263,7 @@ A new file $GIT_OBJECT_DIR/loose-object-idx contains information about
 all loose objects. Its format is
 
   # loose-object-idx
-  (newhash-name SP sha1-name LF)*
+  (sha256-name SP sha1-name LF)*
 
 where the object names are in hexadecimal format. The file is not
 sorted.
@@ -292,8 +289,8 @@ To remove entries (e.g. in "git pack-refs" or "git-prune"):
 Translation table
 ~~~~~~~~~~~~~~~~~
 The index files support a bidirectional mapping between sha1-names
-and newhash-names. The lookup proceeds similarly to ordinary object
-lookups. For example, to convert a sha1-name to a newhash-name:
+and sha256-names. The lookup proceeds similarly to ordinary object
+lookups. For example, to convert a sha1-name to a sha256-name:
 
  1. Look for the object in idx files. If a match is present in the
     idx's sorted list of truncated sha1-names, then:
@@ -301,8 +298,8 @@ lookups. For example, to convert a sha1-name to a newhash-name:
        name order mapping.
     b. Read the corresponding entry in the full sha1-name table to
        verify we found the right object. If it is, then
-    c. Read the corresponding entry in the full newhash-name table.
-       That is the object's newhash-name.
+    c. Read the corresponding entry in the full sha256-name table.
+       That is the object's sha256-name.
  2. Check for a loose object. Read lines from loose-object-idx until
     we find a match.
 
@@ -318,25 +315,25 @@ for all objects in the object store.
 
 Reading an object's sha1-content
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The sha1-content of an object can be read by converting all newhash-names
-its newhash-content references to sha1-names using the translation table.
+The sha1-content of an object can be read by converting all sha256-names
+its sha256-content references to sha1-names using the translation table.
 
 Fetch
 ~~~~~
 Fetching from a SHA-1 based server requires translating between SHA-1
-and NewHash based representations on the fly.
+and SHA-256 based representations on the fly.
 
 SHA-1s named in the ref advertisement that are present on the client
-can be translated to NewHash and looked up as local objects using the
+can be translated to SHA-256 and looked up as local objects using the
 translation table.
 
 Negotiation proceeds as today. Any "have"s generated locally are
 converted to SHA-1 before being sent to the server, and SHA-1s
-mentioned by the server are converted to NewHash when looking them up
+mentioned by the server are converted to SHA-256 when looking them up
 locally.
 
 After negotiation, the server sends a packfile containing the
-requested objects. We convert the packfile to NewHash format using
+requested objects. We convert the packfile to SHA-256 format using
 the following steps:
 
 1. index-pack: inflate each object in the packfile and compute its
@@ -351,12 +348,12 @@ the following steps:
    (This list only contains objects reachable from the "wants". If the
    pack from the server contained additional extraneous objects, then
    they will be discarded.)
-3. convert to newhash: open a new (newhash) packfile. Read the topologically
+3. convert to sha256: open a new (sha256) packfile. Read the topologically
    sorted list just generated. For each object, inflate its
-   sha1-content, convert to newhash-content, and write it to the newhash
-   pack. Record the new sha1<->newhash mapping entry for use in the idx.
+   sha1-content, convert to sha256-content, and write it to the sha256
+   pack. Record the new sha1<->sha256 mapping entry for use in the idx.
 4. sort: reorder entries in the new pack to match the order of objects
-   in the pack the server generated and include blobs. Write a newhash idx
+   in the pack the server generated and include blobs. Write a sha256 idx
    file
 5. clean up: remove the SHA-1 based pack file, index, and
    topologically sorted list obtained from the server in steps 1
@@ -388,16 +385,16 @@ send-pack.
 
 Signed Commits
 ~~~~~~~~~~~~~~
-We add a new field "gpgsig-newhash" to the commit object format to allow
+We add a new field "gpgsig-sha256" to the commit object format to allow
 signing commits without relying on SHA-1. It is similar to the
-existing "gpgsig" field. Its signed payload is the newhash-content of the
-commit object with any "gpgsig" and "gpgsig-newhash" fields removed.
+existing "gpgsig" field. Its signed payload is the sha256-content of the
+commit object with any "gpgsig" and "gpgsig-sha256" fields removed.
 
 This means commits can be signed
 1. using SHA-1 only, as in existing signed commit objects
-2. using both SHA-1 and NewHash, by using both gpgsig-newhash and gpgsig
+2. using both SHA-1 and SHA-256, by using both gpgsig-sha256 and gpgsig
    fields.
-3. using only NewHash, by only using the gpgsig-newhash field.
+3. using only SHA-256, by only using the gpgsig-sha256 field.
 
 Old versions of "git verify-commit" can verify the gpgsig signature in
 cases (1) and (2) without modifications and view case (3) as an
@@ -405,24 +402,24 @@ ordinary unsigned commit.
 
 Signed Tags
 ~~~~~~~~~~~
-We add a new field "gpgsig-newhash" to the tag object format to allow
+We add a new field "gpgsig-sha256" to the tag object format to allow
 signing tags without relying on SHA-1. Its signed payload is the
-newhash-content of the tag with its gpgsig-newhash field and "-----BEGIN PGP
+sha256-content of the tag with its gpgsig-sha256 field and "-----BEGIN PGP
 SIGNATURE-----" delimited in-body signature removed.
 
 This means tags can be signed
 1. using SHA-1 only, as in existing signed tag objects
-2. using both SHA-1 and NewHash, by using gpgsig-newhash and an in-body
+2. using both SHA-1 and SHA-256, by using gpgsig-sha256 and an in-body
    signature.
-3. using only NewHash, by only using the gpgsig-newhash field.
+3. using only SHA-256, by only using the gpgsig-sha256 field.
 
 Mergetag embedding
 ~~~~~~~~~~~~~~~~~~
 The mergetag field in the sha1-content of a commit contains the
 sha1-content of a tag that was merged by that commit.
 
-The mergetag field in the newhash-content of the same commit contains the
-newhash-content of the same tag.
+The mergetag field in the sha256-content of the same commit contains the
+sha256-content of the same tag.
 
 Submodules
 ~~~~~~~~~~
@@ -497,7 +494,7 @@ Caveats
 -------
 Invalid objects
 ~~~~~~~~~~~~~~~
-The conversion from sha1-content to newhash-content retains any
+The conversion from sha1-content to sha256-content retains any
 brokenness in the original object (e.g., tree entry modes encoded with
 leading 0, tree objects whose paths are not sorted correctly, and
 commit objects without an author or committer). This is a deliberate
@@ -516,7 +513,7 @@ allow lifting this restriction.
 
 Alternates
 ~~~~~~~~~~
-For the same reason, a newhash repository cannot borrow objects from a
+For the same reason, a sha256 repository cannot borrow objects from a
 sha1 repository using objects/info/alternates or
 $GIT_ALTERNATE_OBJECT_REPOSITORIES.
 
@@ -524,20 +521,20 @@ git notes
 ~~~~~~~~~
 The "git notes" tool annotates objects using their sha1-name as key.
 This design does not describe a way to migrate notes trees to use
-newhash-names. That migration is expected to happen separately (for
+sha256-names. That migration is expected to happen separately (for
 example using a file at the root of the notes tree to describe which
 hash it uses).
 
 Server-side cost
 ~~~~~~~~~~~~~~~~
-Until Git protocol gains NewHash support, using NewHash based storage
+Until Git protocol gains SHA-256 support, using SHA-256 based storage
 on public-facing Git servers is strongly discouraged. Once Git
-protocol gains NewHash support, NewHash based servers are likely not
+protocol gains SHA-256 support, SHA-256 based servers are likely not
 to support SHA-1 compatibility, to avoid what may be a very expensive
 hash reencode during clone and to encourage peers to modernize.
 
 The design described here allows fetches by SHA-1 clients of a
-personal NewHash repository because it's not much more difficult than
+personal SHA-256 repository because it's not much more difficult than
 allowing pushes from that repository. This support needs to be guarded
 by a configuration option --- servers like git.kernel.org that serve a
 large number of clients would not be expected to bear that cost.
@@ -547,7 +544,7 @@ Meaning of signatures
 The signed payload for signed commits and tags does not explicitly
 name the hash used to identify objects. If some day Git adopts a new
 hash function with the same length as the current SHA-1 (40
-hexadecimal digit) or NewHash (64 hexadecimal digit) objects then the
+hexadecimal digit) or SHA-256 (64 hexadecimal digit) objects then the
 intent behind the PGP signed payload in an object signature is
 unclear:
 
@@ -562,7 +559,7 @@ Does this mean Git v2.12.0 is the commit with sha1-name
 e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with
 new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7?
 
-Fortunately NewHash and SHA-1 have different lengths. If Git starts
+Fortunately SHA-256 and SHA-1 have different lengths. If Git starts
 using another hash with the same length to name objects, then it will
 need to change the format of signed payloads using that hash to
 address this issue.
@@ -574,24 +571,24 @@ supports four different modes of operation:
 
  1. ("dark launch") Treat object names input by the user as SHA-1 and
     convert any object names written to output to SHA-1, but store
-    objects using NewHash.  This allows users to test the code with no
+    objects using SHA-256.  This allows users to test the code with no
     visible behavior change except for performance.  This allows
     allows running even tests that assume the SHA-1 hash function, to
     sanity-check the behavior of the new mode.
 
- 2. ("early transition") Allow both SHA-1 and NewHash object names in
+ 2. ("early transition") Allow both SHA-1 and SHA-256 object names in
     input. Any object names written to output use SHA-1. This allows
     users to continue to make use of SHA-1 to communicate with peers
     (e.g. by email) that have not migrated yet and prepares for mode 3.
 
- 3. ("late transition") Allow both SHA-1 and NewHash object names in
-    input. Any object names written to output use NewHash. In this
+ 3. ("late transition") Allow both SHA-1 and SHA-256 object names in
+    input. Any object names written to output use SHA-256. In this
     mode, users are using a more secure object naming method by
     default.  The disruption is minimal as long as most of their peers
     are in mode 2 or mode 3.
 
  4. ("post-transition") Treat object names input by the user as
-    NewHash and write output using NewHash. This is safer than mode 3
+    SHA-256 and write output using SHA-256. This is safer than mode 3
     because there is less risk that input is incorrectly interpreted
     using the wrong hash function.
 
@@ -601,7 +598,7 @@ The user can also explicitly specify which format to use for a
 particular revision specifier and for output, overriding the mode. For
 example:
 
-git --output-format=sha1 log abac87a^{sha1}..f787cac^{newhash}
+git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256}
 
 Selection of a New Hash
 -----------------------
@@ -611,6 +608,10 @@ collisions in 2^69 operations. In August they published details.
 Luckily, no practical demonstrations of a collision in full SHA-1 were
 published until 10 years later, in 2017.
 
+It was decided that Git needed to transition to a new hash
+function. Initially no decision was made as to what function this was,
+the "NewHash" placeholder name was picked to describe it.
+
 The hash function NewHash to replace SHA-1 should be stronger than
 SHA-1 was: we would like it to be trustworthy and useful in practice
 for at least 10 years.
@@ -630,14 +631,19 @@ Some other relevant properties:
 4. As a tiebreaker, the hash should be fast to compute (fortunately
    many contenders are faster than SHA-1).
 
-Some hashes under consideration are SHA-256, SHA-512/256, SHA-256x16,
+Some hashes under consideration were SHA-256, SHA-512/256, SHA-256x16,
 K12, and BLAKE2bp-256.
 
+Eventually in July 2018 SHA-256 was chosen to be the NewHash. See the
+thread starting at <20180609224913.GC38834@genre.crustytoothpaste.net>
+for the discussion
+(https://public-inbox.org/git/20180609224913.GC38834@genre.crustytoothpaste.net/)
+
 Transition plan
 ---------------
 Some initial steps can be implemented independently of one another:
 - adding a hash function API (vtable)
-- teaching fsck to tolerate the gpgsig-newhash field
+- teaching fsck to tolerate the gpgsig-sha256 field
 - excluding gpgsig-* from the fields copied by "git commit --amend"
 - annotating tests that depend on SHA-1 values with a SHA1 test
   prerequisite
@@ -664,7 +670,7 @@ Next comes introduction of compatObjectFormat:
 - adding appropriate index entries when adding a new object to the
   object store
 - --output-format option
-- ^{sha1} and ^{newhash} revision notation
+- ^{sha1} and ^{sha256} revision notation
 - configuration to specify default input and output format (see
   "Object names on the command line" above)
 
@@ -672,7 +678,7 @@ The next step is supporting fetches and pushes to SHA-1 repositories:
 - allow pushes to a repository using the compat format
 - generate a topologically sorted list of the SHA-1 names of fetched
   objects
-- convert the fetched packfile to newhash format and generate an idx
+- convert the fetched packfile to sha256 format and generate an idx
   file
 - re-sort to match the order of objects in the fetched packfile
 
@@ -680,30 +686,30 @@ The infrastructure supporting fetch also allows converting an existing
 repository. In converted repositories and new clones, end users can
 gain support for the new hash function without any visible change in
 behavior (see "dark launch" in the "Object names on the command line"
-section). In particular this allows users to verify NewHash signatures
+section). In particular this allows users to verify SHA-256 signatures
 on objects in the repository, and it should ensure the transition code
 is stable in production in preparation for using it more widely.
 
 Over time projects would encourage their users to adopt the "early
 transition" and then "late transition" modes to take advantage of the
-new, more futureproof NewHash object names.
+new, more futureproof SHA-256 object names.
 
 When objectFormat and compatObjectFormat are both set, commands
-generating signatures would generate both SHA-1 and NewHash signatures
+generating signatures would generate both SHA-1 and SHA-256 signatures
 by default to support both new and old users.
 
-In projects using NewHash heavily, users could be encouraged to adopt
+In projects using SHA-256 heavily, users could be encouraged to adopt
 the "post-transition" mode to avoid accidentally making implicit use
 of SHA-1 object names.
 
 Once a critical mass of users have upgraded to a version of Git that
-can verify NewHash signatures and have converted their existing
+can verify SHA-256 signatures and have converted their existing
 repositories to support verifying them, we can add support for a
-setting to generate only NewHash signatures. This is expected to be at
+setting to generate only SHA-256 signatures. This is expected to be at
 least a year later.
 
 That is also a good moment to advertise the ability to convert
-repositories to use NewHash only, stripping out all SHA-1 related
+repositories to use SHA-256 only, stripping out all SHA-1 related
 metadata. This improves performance by eliminating translation
 overhead and security by avoiding the possibility of accidentally
 relying on the safety of SHA-1.
@@ -742,16 +748,16 @@ using the old hash function.
 
 Signed objects with multiple hashes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Instead of introducing the gpgsig-newhash field in commit and tag objects
-for newhash-content based signatures, an earlier version of this design
-added "hash newhash <newhash-name>" fields to strengthen the existing
+Instead of introducing the gpgsig-sha256 field in commit and tag objects
+for sha256-content based signatures, an earlier version of this design
+added "hash sha256 <sha256-name>" fields to strengthen the existing
 sha1-content based signatures.
 
 In other words, a single signature was used to attest to the object
 content using both hash functions. This had some advantages:
 * Using one signature instead of two speeds up the signing process.
 * Having one signed payload with both hashes allows the signer to
-  attest to the sha1-name and newhash-name referring to the same object.
+  attest to the sha1-name and sha256-name referring to the same object.
 * All users consume the same signature. Broken signatures are likely
   to be detected quickly using current versions of git.
 
@@ -760,11 +766,11 @@ However, it also came with disadvantages:
   objects it references, even after the transition is complete and
   translation table is no longer needed for anything else. To support
   this, the design added fields such as "hash sha1 tree <sha1-name>"
-  and "hash sha1 parent <sha1-name>" to the newhash-content of a signed
+  and "hash sha1 parent <sha1-name>" to the sha256-content of a signed
   commit, complicating the conversion process.
 * Allowing signed objects without a sha1 (for after the transition is
   complete) complicated the design further, requiring a "nohash sha1"
-  field to suppress including "hash sha1" fields in the newhash-content
+  field to suppress including "hash sha1" fields in the sha256-content
   and signed payload.
 
 Lazily populated translation table
@@ -772,7 +778,7 @@ Lazily populated translation table
 Some of the work of building the translation table could be deferred to
 push time, but that would significantly complicate and slow down pushes.
 Calculating the sha1-name at object creation time at the same time it is
-being streamed to disk and having its newhash-name calculated should be
+being streamed to disk and having its sha256-name calculated should be
 an acceptable cost.
 
 Document History
-- 
2.17.0.290.gded63e768a


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH 2/2] doc hash-function-transition: pick SHA-256 as NewHash
  2018-07-25  8:30             ` [PATCH 2/2] doc hash-function-transition: pick SHA-256 as NewHash Ævar Arnfjörð Bjarmason
@ 2018-07-25 16:45               ` Junio C Hamano
  2018-07-25 17:25                 ` Jonathan Nieder
  2018-07-25 22:56                 ` [PATCH " brian m. carlson
  0 siblings, 2 replies; 66+ messages in thread
From: Junio C Hamano @ 2018-07-25 16:45 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Linus Torvalds, Edward Thomson, brian m . carlson,
	Jonathan Nieder, Johannes Schindelin, demerphq, Adam Langley,
	keccak

Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

> @@ -125,19 +122,19 @@ Detailed Design
>  ---------------
>  Repository format extension
>  ~~~~~~~~~~~~~~~~~~~~~~~~~~~
> -A NewHash repository uses repository format version `1` (see
> +A SHA-256 repository uses repository format version `1` (see
>  Documentation/technical/repository-version.txt) with extensions
>  `objectFormat` and `compatObjectFormat`:
>  
>  	[core]
>  		repositoryFormatVersion = 1
>  	[extensions]
> -		objectFormat = newhash
> +		objectFormat = sha256
>  		compatObjectFormat = sha1

Whenever we said SHA1, somebody came and told us that the name of
the hash is SHA-1 (with dash).  Would we be nitpicker-prone in the
same way with "sha256" here?

> @@ -155,36 +152,36 @@ repository extensions.
>  Object names
>  ~~~~~~~~~~~~
>  Objects can be named by their 40 hexadecimal digit sha1-name or 64
> -hexadecimal digit newhash-name, plus names derived from those (see
> +hexadecimal digit sha256-name, plus names derived from those (see
>  gitrevisions(7)).

Seeing this hunk makes me respond to the above question with another
question: "having to write sha-256-name, sha-1-name, gpgsig-sha-256,
and sha-256-content is sort of ugly, no?"

I guess names with two dashes are not _too_ bad, so I dunno.

>  Selection of a New Hash
>  -----------------------
> @@ -611,6 +608,10 @@ collisions in 2^69 operations. In August they published details.
>  Luckily, no practical demonstrations of a collision in full SHA-1 were
>  published until 10 years later, in 2017.
>  
> +It was decided that Git needed to transition to a new hash
> +function. Initially no decision was made as to what function this was,
> +the "NewHash" placeholder name was picked to describe it.
> +
>  The hash function NewHash to replace SHA-1 should be stronger than
>  SHA-1 was: we would like it to be trustworthy and useful in practice
>  for at least 10 years.

This sentence needs a bit of updating to match the new paragraph
inserted above.  "should be stronger" is something said by those
who are still looking for one and/or trying to decide.  Perhaps
something like this?

	...
	the "NewHash" placeholder name was used to describe it.

	We wanted to choose a hash function to replace SHA-1 that is
	stronger than SHA-1 was, and would like it to be trustworthy
	and useful in practice for at least 10 years.

	Some other relevant properties we wanted in NewHash are:

> @@ -630,14 +631,19 @@ Some other relevant properties:
>  4. As a tiebreaker, the hash should be fast to compute (fortunately
>     many contenders are faster than SHA-1).
>  
> -Some hashes under consideration are SHA-256, SHA-512/256, SHA-256x16,
> +Some hashes under consideration were SHA-256, SHA-512/256, SHA-256x16,
>  K12, and BLAKE2bp-256.
>  
> +Eventually in July 2018 SHA-256 was chosen to be the NewHash. See the
> +thread starting at <20180609224913.GC38834@genre.crustytoothpaste.net>
> +for the discussion
> +(https://public-inbox.org/git/20180609224913.GC38834@genre.crustytoothpaste.net/)
> +


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 2/2] doc hash-function-transition: pick SHA-256 as NewHash
  2018-07-25 16:45               ` Junio C Hamano
@ 2018-07-25 17:25                 ` Jonathan Nieder
  2018-07-25 21:32                   ` Junio C Hamano
  2018-07-25 22:56                 ` [PATCH " brian m. carlson
  1 sibling, 1 reply; 66+ messages in thread
From: Jonathan Nieder @ 2018-07-25 17:25 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, git, Linus Torvalds,
	Edward Thomson, brian m . carlson, Johannes Schindelin, demerphq,
	Adam Langley, keccak

Hi,

Junio C Hamano wrote:
> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:

>> The consensus on the mailing list seems to be that SHA-256 should be
>> picked as our NewHash, see the "Hash algorithm analysis" thread as of
>> [1]. Linus has come around to this choice and suggested Junio make the
>> final pick, and he's endorsed SHA-256 [3].

I think this commit message focuses too much on the development
process, in a way that makes it not necessary useful to the target
audience that would be finding it with "git blame" or "git log".  It's
also not self-contained, which makes it less useful in the same way.

In other words, the commit message should be speaking for the project,
not speaking about the project.  I would be tempted to say something as
simple as

 hash-function-transition: pick SHA-256 as NewHash

 The project has decided.

 Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>

and let any Acked-bys on the message speak for themselves.
Alternatively, the commit message could include a summary of the
discussion:

 From a security perspective, it seems that SHA-256, BLAKE2, SHA3-256,
 K12, and so on are all believed to have similar security properties.
 All are good options from a security point of view.

 SHA-256 has a number of advantages:

 * It has been around for a while, is widely used, and is supported by
   just about every single crypto library (OpenSSL, mbedTLS, CryptoNG,
   SecureTransport, etc).

 * When you compare against SHA1DC, most vectorized SHA-256
   implementations are indeed faster, even without acceleration.

 * If we're doing signatures with OpenPGP (or even, I suppose, CMS),
   we're going to be using SHA-2, so it doesn't make sense to have our
   security depend on two separate algorithms when either one of them
   alone could break the security when we could just depend on one.

 So SHA-256 it is.

[...]
>> @@ -125,19 +122,19 @@ Detailed Design
>>  ---------------
>>  Repository format extension
>>  ~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> -A NewHash repository uses repository format version `1` (see
>> +A SHA-256 repository uses repository format version `1` (see
>>  Documentation/technical/repository-version.txt) with extensions
>>  `objectFormat` and `compatObjectFormat`:
>>  
>>  	[core]
>>  		repositoryFormatVersion = 1
>>  	[extensions]
>> -		objectFormat = newhash
>> +		objectFormat = sha256
>>  		compatObjectFormat = sha1
>
> Whenever we said SHA1, somebody came and told us that the name of
> the hash is SHA-1 (with dash).  Would we be nitpicker-prone in the
> same way with "sha256" here?

Regardless of how we spell it in prose, I think `sha256` as an
identifier in configuration is the spelling people will expect.  For
example, gpg ("gpg --version") calls it SHA256.

[...]
>>  Selection of a New Hash
>>  -----------------------
>> @@ -611,6 +608,10 @@ collisions in 2^69 operations. In August they published details.
>>  Luckily, no practical demonstrations of a collision in full SHA-1 were
>>  published until 10 years later, in 2017.
>>  
>> +It was decided that Git needed to transition to a new hash
>> +function. Initially no decision was made as to what function this was,
>> +the "NewHash" placeholder name was picked to describe it.
>> +
>>  The hash function NewHash to replace SHA-1 should be stronger than
>>  SHA-1 was: we would like it to be trustworthy and useful in practice
>>  for at least 10 years.
>
> This sentence needs a bit of updating to match the new paragraph
> inserted above.  "should be stronger" is something said by those
> who are still looking for one and/or trying to decide.

For what it's worth, I would be in favor of modifying the section
more heavily.  For example:

 Choice of Hash
 --------------
 In early 2005, around the time that Git was written,  Xiaoyun Wang,
 Yiqun Lisa Yin, and Hongbo Yu announced an attack finding SHA-1
 collisions in 2^69 operations. In August they published details.
 Luckily, no practical demonstrations of a collision in full SHA-1 were
 published until 10 years later, in 2017.

 Git v2.13.0 and later subsequently moved to a hardened SHA-1
 implementation by default that mitigates the SHAttered attack, but
 SHA-1 is still believed to be weak.

 The hash to replace this hardened SHA-1 should be stronger than SHA-1
 was: we would like it to be trustworthy and useful in practice
 for at least 10 years.

 Some other relevant properties:

 1. A 256-bit hash (long enough to match common security practice; not
    excessively long to hurt performance and disk usage).

 2. High quality implementations should be widely available (e.g., in
    OpenSSL and Apple CommonCrypto).

 3. The hash function's properties should match Git's needs (e.g. Git
    requires collision and 2nd preimage resistance and does not require
    length extension resistance).

 4. As a tiebreaker, the hash should be fast to compute (fortunately
    many contenders are faster than SHA-1).

 We choose SHA-256.

Changes:

- retitled since the hash function has already been selected
- added some notes about sha1dc
- when discussing wide implementation availability, mentioned
  CommonCrypto too, as an example of a non-OpenSSL library that the
  libgit2 authors care about
- named which function is chosen

We could put the runners up in the "alternatives considered" section,
but I don't think there's much to say about them here so I wouldn't.

Thanks and hope that helps,
Jonathan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 2/2] doc hash-function-transition: pick SHA-256 as NewHash
  2018-07-25 17:25                 ` Jonathan Nieder
@ 2018-07-25 21:32                   ` Junio C Hamano
  2018-07-26 13:41                     ` [PATCH v2 " Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 66+ messages in thread
From: Junio C Hamano @ 2018-07-25 21:32 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Ævar Arnfjörð Bjarmason, git, Linus Torvalds,
	Edward Thomson, brian m . carlson, Johannes Schindelin, demerphq,
	Adam Langley, keccak

Jonathan Nieder <jrnieder@gmail.com> writes:

> Regardless of how we spell it in prose, I think `sha256` as an
> identifier in configuration is the spelling people will expect.  For
> example, gpg ("gpg --version") calls it SHA256.

OK.

> For what it's worth, I would be in favor of modifying the section
> more heavily.  For example:
> ...
> Changes:
>
> - retitled since the hash function has already been selected
> - added some notes about sha1dc
> - when discussing wide implementation availability, mentioned
>   CommonCrypto too, as an example of a non-OpenSSL library that the
>   libgit2 authors care about
> - named which function is chosen
>
> We could put the runners up in the "alternatives considered" section,
> but I don't think there's much to say about them here so I wouldn't.

All interesting ideas and good suggestions.  I'll leave 2/2 in the
mail archive and take only 1/2 for now.  I'd expect the final
version, not too soon after mulling over the suggestions raised
here, but not in too distant future to prevent us from forgetting
;-)

Thanks.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 2/2] doc hash-function-transition: pick SHA-256 as NewHash
  2018-07-25 16:45               ` Junio C Hamano
  2018-07-25 17:25                 ` Jonathan Nieder
@ 2018-07-25 22:56                 ` brian m. carlson
  1 sibling, 0 replies; 66+ messages in thread
From: brian m. carlson @ 2018-07-25 22:56 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, git, Linus Torvalds,
	Edward Thomson, Jonathan Nieder, Johannes Schindelin, demerphq,
	Adam Langley, keccak

[-- Attachment #1: Type: text/plain, Size: 1158 bytes --]

On Wed, Jul 25, 2018 at 09:45:52AM -0700, Junio C Hamano wrote:
> Ævar Arnfjörð Bjarmason  <avarab@gmail.com> writes:
> 
> > @@ -125,19 +122,19 @@ Detailed Design
> >  ---------------
> >  Repository format extension
> >  ~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > -A NewHash repository uses repository format version `1` (see
> > +A SHA-256 repository uses repository format version `1` (see
> >  Documentation/technical/repository-version.txt) with extensions
> >  `objectFormat` and `compatObjectFormat`:
> >  
> >  	[core]
> >  		repositoryFormatVersion = 1
> >  	[extensions]
> > -		objectFormat = newhash
> > +		objectFormat = sha256
> >  		compatObjectFormat = sha1
> 
> Whenever we said SHA1, somebody came and told us that the name of
> the hash is SHA-1 (with dash).  Would we be nitpicker-prone in the
> same way with "sha256" here?

I actually have a patch to make the names "sha1" and "sha256".  My
rationale is that it's shorter and easier to type.  People can quibble
about it when I send it to the list, but that's what I'm proposing at
least.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-22 14:55               ` Eric Deplagne
@ 2018-07-26 10:05                 ` Johannes Schindelin
  0 siblings, 0 replies; 66+ messages in thread
From: Johannes Schindelin @ 2018-07-26 10:05 UTC (permalink / raw)
  To: Eric Deplagne
  Cc: brian m. carlson, Jonathan Nieder, git, demerphq, Linus Torvalds,
	Adam Langley, The Keccak Team

[-- Attachment #1: Type: text/plain, Size: 3299 bytes --]

Hi Eric,

On Sun, 22 Jul 2018, Eric Deplagne wrote:

> On Sun, 22 Jul 2018 14:21:48 +0000, brian m. carlson wrote:
> > On Sun, Jul 22, 2018 at 11:34:42AM +0200, Eric Deplagne wrote:
> > > On Sat, 21 Jul 2018 23:59:41 +0000, brian m. carlson wrote:
> > > > I don't know your colleagues, and they haven't commented here.  One
> > > > person that has commented here is Adam Langley.  It is my impression
> > > > (and anyone is free to correct me if I'm incorrect) that he is indeed a
> > > > cryptographer.  To quote him[0]:
> > > > 
> > > >   I think this group can safely assume that SHA-256, SHA-512, BLAKE2,
> > > >   K12, etc are all secure to the extent that I don't believe that making
> > > >   comparisons between them on that axis is meaningful. Thus I think the
> > > >   question is primarily concerned with performance and implementation
> > > >   availability.
> > > > 
> > > >   […]
> > > > 
> > > >   So, overall, none of these choices should obviously be excluded. The
> > > >   considerations at this point are not cryptographic and the tradeoff
> > > >   between implementation ease and performance is one that the git
> > > >   community would have to make.
> > > 
> > >   Am I completely out of the game, or the statement that
> > >     "the considerations at this point are not cryptographic"
> > >   is just the wrongest ?
> > > 
> > >   I mean, if that was true, would we not be sticking to SHA1 ?
> > 
> > I snipped a portion of the context, but AGL was referring to the
> > considerations involved in choosing from the proposed ones for NewHash.
> > In context, he meant that the candidates for NewHash “are all secure”
> > and are therefore a better choice than SHA-1.
> 
>   Maybe a little bit sensitive, but I really did read
>     "we don't care if it's weak or strong, that's not the matter".

Thank you for your concern. I agree that we need to be careful in
considering the security implications. We made that mistake before (IIRC
there was a cryptographer who was essentially shouted off the list when he
suggested *not* to hard-code SHA-1), and we should absolutely refrain from
making that same mistake again.

> > I think we can all agree that SHA-1 is weak and should be replaced.

Indeed.

So at this point, we already excluded pretty much all the unsafe options
(although it does concern me that BLAKE2b has been weakened purposefully,
I understand the reasoning, but still).

Which means that by now, considering the security implications of the
cipher is no longer a criterion that helps us whittle down the candidates
further.

So from my point of view, there are two criterions that can help us
further:

- Which cipher is the least likely to be broken (or just weakened by new
  attacks)?

- As energy considerations not only ecologically inspired, but also in
  terms of money for elecricity: which cipher is most likely to get decent
  hardware support any time soon?

Even if my original degree (prime number theory) is closer to
cryptanalysis than pretty much all other prolific core Git contributors, I
do not want you to trust *my* word on answering those questions.

Therefore, I will ask my colleagues to enter the hornet's nest that is
this mailing list.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-22 15:23           ` Joan Daemen
  2018-07-22 18:54             ` Adam Langley
@ 2018-07-26 10:31             ` Johannes Schindelin
  1 sibling, 0 replies; 66+ messages in thread
From: Johannes Schindelin @ 2018-07-26 10:31 UTC (permalink / raw)
  To: Joan Daemen
  Cc: brian m. carlson, Jonathan Nieder, git, demerphq, Linus Torvalds,
	Adam Langley, Keccak Team

Hi Joan,

On Sun, 22 Jul 2018, Joan Daemen wrote:

> I wanted to react to some statements I read in this discussion. But
> first let me introduce myself. I'm Joan Daemen and I'm working in
> symmetric cryptography since 1988. Vincent Rijmen and I designed
> Rijndael that was selected to become AES and Guido Bertoni, Michael
> Peeters and Gilles Van Assche and I (the Keccak team, later extended
> with Ronny Van Keer) designed Keccak that was selected to become SHA3.
> Of course as a member of the Keccak team I'm biased in this discussion
> but I'll try to keep it factual.

Thank you *so* much for giving your valuable time and expertise on this
subject.

I really would hate for the decision to be made due to opinions of people
who are overconfident in their abilities to judge cryptographic matters
despite clearly being out of their league (which includes me, I want to
add specifically).

On a personal note: back in the day, I have been following the Keccak with
a lot of interest, being intrigued by the deliberate deviation from the
standard primitives, and I am pretty much giddy about the fact that I am
talking to you right now.

> [... interesting, and thorough background information ...]
> 
> Anyway, these numbers support the opinion that the safety margins taken
> in K12 are better understood than those in SHA-256, SHA-512 and BLAKE2.

This is very, very useful information in my mind.

> Adam Langley continues:
> 
> 	Thus I think the question is primarily concerned with performance and implementation availability
> 
> 
> Table 2 in our ACNS paper on K12 (available at
> https://eprint.iacr.org/2016/770) shows that performance of K12 is quite
> competitive. Moreover, there is a lot of code available under CC0
> license in the Keccak Code Package on github
> https://github.com/gvanas/KeccakCodePackage. If there is shortage of
> code for some platforms in the short term, we will be happy to work on that.
> 
> In the long term, it is likely that the relative advantage of K12 will
> increase as it has more potential for hardware acceleration, e.g., by
> instruction set extension. This is thanks to the fact that it does not
> use addition, as opposed to so-called addition-xor-rotation (ARX)
> designs such as the SHA-2 and BLAKE2 families. This is already
> illustrated in our Table 2 I referred to above, in the transition from
> Skylake to SkylakeX.

I *really* hope that more accessible hardware acceleration for this
materializes at some stage. And by "more accessible", I mean commodity
hardware such as ARM or AMD/Intel processors: big hosters could relatively
easily develop appropriate FPGAs (we already do this for AI, after all).

> Maybe also interesting for this discussion are the two notes we (Keccak
> team) wrote on our choice to not go for ARX and the one on "open source
> crypto" at https://keccak.team/2017/not_arx.html and
> https://keccak.team/2017/open_source_crypto.html respectively.

I had read those posts when they came out, and still find them insightful.
Hopefully other readers of this mailing list will spend the time to read
them, too.

Again, thank you so much for a well-timed dose of domain expertise in
this thread.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH v2 2/2] doc hash-function-transition: pick SHA-256 as NewHash
  2018-07-25 21:32                   ` Junio C Hamano
@ 2018-07-26 13:41                     ` Ævar Arnfjörð Bjarmason
  2018-08-03  7:20                       ` Jonathan Nieder
  0 siblings, 1 reply; 66+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-07-26 13:41 UTC (permalink / raw)
  To: git
  Cc: Junio C Hamano, Linus Torvalds, Edward Thomson, brian m . carlson,
	Jonathan Nieder, Johannes Schindelin, demerphq, Adam Langley,
	keccak, Ævar Arnfjörð Bjarmason

From a security perspective, it seems that SHA-256, BLAKE2, SHA3-256,
K12, and so on are all believed to have similar security properties.
All are good options from a security point of view.

SHA-256 has a number of advantages:

* It has been around for a while, is widely used, and is supported by
  just about every single crypto library (OpenSSL, mbedTLS, CryptoNG,
  SecureTransport, etc).

* When you compare against SHA1DC, most vectorized SHA-256
  implementations are indeed faster, even without acceleration.

* If we're doing signatures with OpenPGP (or even, I suppose, CMS),
  we're going to be using SHA-2, so it doesn't make sense to have our
  security depend on two separate algorithms when either one of them
  alone could break the security when we could just depend on one.

So SHA-256 it is. See the "Hash algorithm analysis" thread as of
[1]. Linus has come around to this choice and suggested Junio make the
final pick, and he's endorsed SHA-256 [3].

This follow-up change changes occurrences of "NewHash" to
"SHA-256" (or "sha256", depending on the context). The "Selection of a
New Hash" section has also been changed to note that historically we
used the the "NewHash" name while we didn't know what the new hash
function would be.

This leaves no use of "NewHash" anywhere in git.git except in the
aforementioned section (and as a variable name in t/t9700/test.pl, but
that use from 2008 has nothing to do with this transition plan).

1. https://public-inbox.org/git/20180720215220.GB18502@genre.crustytoothpaste.net/
2. https://public-inbox.org/git/CA+55aFwSe9BF8e0hLk9pp3FVD5LaVY5GRdsV3fbNtgzekJadyA@mail.gmail.com/
3. https://public-inbox.org/git/xmqqzhygwd5o.fsf@gitster-ct.c.googlers.com/

Helped-by: Jonathan Nieder <jrnieder@gmail.com>
Helped-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---

On Wed, Jul 25 2018, Junio C Hamano wrote:
> Jonathan Nieder <jrnieder@gmail.com> writes:
> [...]
> All interesting ideas and good suggestions.  I'll leave 2/2 in the
> mail archive and take only 1/2 for now.  I'd expect the final
> version, not too soon after mulling over the suggestions raised
> here, but not in too distant future to prevent us from forgetting
> ;-)

Here's a v2 which uses the suggestions for both the commit message &
documentation from Jonathan and yourself. Thanks!

 .../technical/hash-function-transition.txt    | 196 +++++++++---------
 1 file changed, 99 insertions(+), 97 deletions(-)

diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt
index 5ee4754adb..5041e57273 100644
--- a/Documentation/technical/hash-function-transition.txt
+++ b/Documentation/technical/hash-function-transition.txt
@@ -59,14 +59,11 @@ that are believed to be cryptographically secure.
 
 Goals
 -----
-Where NewHash is a strong 256-bit hash function to replace SHA-1 (see
-"Selection of a New Hash", below):
-
-1. The transition to NewHash can be done one local repository at a time.
+1. The transition to SHA-256 can be done one local repository at a time.
    a. Requiring no action by any other party.
-   b. A NewHash repository can communicate with SHA-1 Git servers
+   b. A SHA-256 repository can communicate with SHA-1 Git servers
       (push/fetch).
-   c. Users can use SHA-1 and NewHash identifiers for objects
+   c. Users can use SHA-1 and SHA-256 identifiers for objects
       interchangeably (see "Object names on the command line", below).
    d. New signed objects make use of a stronger hash function than
       SHA-1 for their security guarantees.
@@ -79,7 +76,7 @@ Where NewHash is a strong 256-bit hash function to replace SHA-1 (see
 
 Non-Goals
 ---------
-1. Add NewHash support to Git protocol. This is valuable and the
+1. Add SHA-256 support to Git protocol. This is valuable and the
    logical next step but it is out of scope for this initial design.
 2. Transparently improving the security of existing SHA-1 signed
    objects.
@@ -87,26 +84,26 @@ Non-Goals
    repository.
 4. Taking the opportunity to fix other bugs in Git's formats and
    protocols.
-5. Shallow clones and fetches into a NewHash repository. (This will
-   change when we add NewHash support to Git protocol.)
-6. Skip fetching some submodules of a project into a NewHash
-   repository. (This also depends on NewHash support in Git
+5. Shallow clones and fetches into a SHA-256 repository. (This will
+   change when we add SHA-256 support to Git protocol.)
+6. Skip fetching some submodules of a project into a SHA-256
+   repository. (This also depends on SHA-256 support in Git
    protocol.)
 
 Overview
 --------
 We introduce a new repository format extension. Repositories with this
-extension enabled use NewHash instead of SHA-1 to name their objects.
+extension enabled use SHA-256 instead of SHA-1 to name their objects.
 This affects both object names and object content --- both the names
 of objects and all references to other objects within an object are
 switched to the new hash function.
 
-NewHash repositories cannot be read by older versions of Git.
+SHA-256 repositories cannot be read by older versions of Git.
 
-Alongside the packfile, a NewHash repository stores a bidirectional
-mapping between NewHash and SHA-1 object names. The mapping is generated
+Alongside the packfile, a SHA-256 repository stores a bidirectional
+mapping between SHA-256 and SHA-1 object names. The mapping is generated
 locally and can be verified using "git fsck". Object lookups use this
-mapping to allow naming objects using either their SHA-1 and NewHash names
+mapping to allow naming objects using either their SHA-1 and SHA-256 names
 interchangeably.
 
 "git cat-file" and "git hash-object" gain options to display an object
@@ -116,7 +113,7 @@ object database so that they can be named using the appropriate name
 (using the bidirectional hash mapping).
 
 Fetches from a SHA-1 based server convert the fetched objects into
-NewHash form and record the mapping in the bidirectional mapping table
+SHA-256 form and record the mapping in the bidirectional mapping table
 (see below for details). Pushes to a SHA-1 based server convert the
 objects being pushed into sha1 form so the server does not have to be
 aware of the hash function the client is using.
@@ -125,19 +122,19 @@ Detailed Design
 ---------------
 Repository format extension
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
-A NewHash repository uses repository format version `1` (see
+A SHA-256 repository uses repository format version `1` (see
 Documentation/technical/repository-version.txt) with extensions
 `objectFormat` and `compatObjectFormat`:
 
 	[core]
 		repositoryFormatVersion = 1
 	[extensions]
-		objectFormat = newhash
+		objectFormat = sha256
 		compatObjectFormat = sha1
 
 The combination of setting `core.repositoryFormatVersion=1` and
 populating `extensions.*` ensures that all versions of Git later than
-`v0.99.9l` will die instead of trying to operate on the NewHash
+`v0.99.9l` will die instead of trying to operate on the SHA-256
 repository, instead producing an error message.
 
 	# Between v0.99.9l and v2.7.0
@@ -155,36 +152,36 @@ repository extensions.
 Object names
 ~~~~~~~~~~~~
 Objects can be named by their 40 hexadecimal digit sha1-name or 64
-hexadecimal digit newhash-name, plus names derived from those (see
+hexadecimal digit sha256-name, plus names derived from those (see
 gitrevisions(7)).
 
 The sha1-name of an object is the SHA-1 of the concatenation of its
 type, length, a nul byte, and the object's sha1-content. This is the
 traditional <sha1> used in Git to name objects.
 
-The newhash-name of an object is the NewHash of the concatenation of its
-type, length, a nul byte, and the object's newhash-content.
+The sha256-name of an object is the SHA-256 of the concatenation of its
+type, length, a nul byte, and the object's sha256-content.
 
 Object format
 ~~~~~~~~~~~~~
 The content as a byte sequence of a tag, commit, or tree object named
-by sha1 and newhash differ because an object named by newhash-name refers to
-other objects by their newhash-names and an object named by sha1-name
+by sha1 and sha256 differ because an object named by sha256-name refers to
+other objects by their sha256-names and an object named by sha1-name
 refers to other objects by their sha1-names.
 
-The newhash-content of an object is the same as its sha1-content, except
-that objects referenced by the object are named using their newhash-names
+The sha256-content of an object is the same as its sha1-content, except
+that objects referenced by the object are named using their sha256-names
 instead of sha1-names. Because a blob object does not refer to any
-other object, its sha1-content and newhash-content are the same.
+other object, its sha1-content and sha256-content are the same.
 
-The format allows round-trip conversion between newhash-content and
+The format allows round-trip conversion between sha256-content and
 sha1-content.
 
 Object storage
 ~~~~~~~~~~~~~~
 Loose objects use zlib compression and packed objects use the packed
 format described in Documentation/technical/pack-format.txt, just like
-today. The content that is compressed and stored uses newhash-content
+today. The content that is compressed and stored uses sha256-content
 instead of sha1-content.
 
 Pack index
@@ -255,10 +252,10 @@ network byte order):
   up to and not including the table of CRC32 values.
 - Zero or more NUL bytes.
 - The trailer consists of the following:
-  - A copy of the 20-byte NewHash checksum at the end of the
+  - A copy of the 20-byte SHA-256 checksum at the end of the
     corresponding packfile.
 
-  - 20-byte NewHash checksum of all of the above.
+  - 20-byte SHA-256 checksum of all of the above.
 
 Loose object index
 ~~~~~~~~~~~~~~~~~~
@@ -266,7 +263,7 @@ A new file $GIT_OBJECT_DIR/loose-object-idx contains information about
 all loose objects. Its format is
 
   # loose-object-idx
-  (newhash-name SP sha1-name LF)*
+  (sha256-name SP sha1-name LF)*
 
 where the object names are in hexadecimal format. The file is not
 sorted.
@@ -292,8 +289,8 @@ To remove entries (e.g. in "git pack-refs" or "git-prune"):
 Translation table
 ~~~~~~~~~~~~~~~~~
 The index files support a bidirectional mapping between sha1-names
-and newhash-names. The lookup proceeds similarly to ordinary object
-lookups. For example, to convert a sha1-name to a newhash-name:
+and sha256-names. The lookup proceeds similarly to ordinary object
+lookups. For example, to convert a sha1-name to a sha256-name:
 
  1. Look for the object in idx files. If a match is present in the
     idx's sorted list of truncated sha1-names, then:
@@ -301,8 +298,8 @@ lookups. For example, to convert a sha1-name to a newhash-name:
        name order mapping.
     b. Read the corresponding entry in the full sha1-name table to
        verify we found the right object. If it is, then
-    c. Read the corresponding entry in the full newhash-name table.
-       That is the object's newhash-name.
+    c. Read the corresponding entry in the full sha256-name table.
+       That is the object's sha256-name.
  2. Check for a loose object. Read lines from loose-object-idx until
     we find a match.
 
@@ -318,25 +315,25 @@ for all objects in the object store.
 
 Reading an object's sha1-content
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The sha1-content of an object can be read by converting all newhash-names
-its newhash-content references to sha1-names using the translation table.
+The sha1-content of an object can be read by converting all sha256-names
+its sha256-content references to sha1-names using the translation table.
 
 Fetch
 ~~~~~
 Fetching from a SHA-1 based server requires translating between SHA-1
-and NewHash based representations on the fly.
+and SHA-256 based representations on the fly.
 
 SHA-1s named in the ref advertisement that are present on the client
-can be translated to NewHash and looked up as local objects using the
+can be translated to SHA-256 and looked up as local objects using the
 translation table.
 
 Negotiation proceeds as today. Any "have"s generated locally are
 converted to SHA-1 before being sent to the server, and SHA-1s
-mentioned by the server are converted to NewHash when looking them up
+mentioned by the server are converted to SHA-256 when looking them up
 locally.
 
 After negotiation, the server sends a packfile containing the
-requested objects. We convert the packfile to NewHash format using
+requested objects. We convert the packfile to SHA-256 format using
 the following steps:
 
 1. index-pack: inflate each object in the packfile and compute its
@@ -351,12 +348,12 @@ the following steps:
    (This list only contains objects reachable from the "wants". If the
    pack from the server contained additional extraneous objects, then
    they will be discarded.)
-3. convert to newhash: open a new (newhash) packfile. Read the topologically
+3. convert to sha256: open a new (sha256) packfile. Read the topologically
    sorted list just generated. For each object, inflate its
-   sha1-content, convert to newhash-content, and write it to the newhash
-   pack. Record the new sha1<->newhash mapping entry for use in the idx.
+   sha1-content, convert to sha256-content, and write it to the sha256
+   pack. Record the new sha1<->sha256 mapping entry for use in the idx.
 4. sort: reorder entries in the new pack to match the order of objects
-   in the pack the server generated and include blobs. Write a newhash idx
+   in the pack the server generated and include blobs. Write a sha256 idx
    file
 5. clean up: remove the SHA-1 based pack file, index, and
    topologically sorted list obtained from the server in steps 1
@@ -388,16 +385,16 @@ send-pack.
 
 Signed Commits
 ~~~~~~~~~~~~~~
-We add a new field "gpgsig-newhash" to the commit object format to allow
+We add a new field "gpgsig-sha256" to the commit object format to allow
 signing commits without relying on SHA-1. It is similar to the
-existing "gpgsig" field. Its signed payload is the newhash-content of the
-commit object with any "gpgsig" and "gpgsig-newhash" fields removed.
+existing "gpgsig" field. Its signed payload is the sha256-content of the
+commit object with any "gpgsig" and "gpgsig-sha256" fields removed.
 
 This means commits can be signed
 1. using SHA-1 only, as in existing signed commit objects
-2. using both SHA-1 and NewHash, by using both gpgsig-newhash and gpgsig
+2. using both SHA-1 and SHA-256, by using both gpgsig-sha256 and gpgsig
    fields.
-3. using only NewHash, by only using the gpgsig-newhash field.
+3. using only SHA-256, by only using the gpgsig-sha256 field.
 
 Old versions of "git verify-commit" can verify the gpgsig signature in
 cases (1) and (2) without modifications and view case (3) as an
@@ -405,24 +402,24 @@ ordinary unsigned commit.
 
 Signed Tags
 ~~~~~~~~~~~
-We add a new field "gpgsig-newhash" to the tag object format to allow
+We add a new field "gpgsig-sha256" to the tag object format to allow
 signing tags without relying on SHA-1. Its signed payload is the
-newhash-content of the tag with its gpgsig-newhash field and "-----BEGIN PGP
+sha256-content of the tag with its gpgsig-sha256 field and "-----BEGIN PGP
 SIGNATURE-----" delimited in-body signature removed.
 
 This means tags can be signed
 1. using SHA-1 only, as in existing signed tag objects
-2. using both SHA-1 and NewHash, by using gpgsig-newhash and an in-body
+2. using both SHA-1 and SHA-256, by using gpgsig-sha256 and an in-body
    signature.
-3. using only NewHash, by only using the gpgsig-newhash field.
+3. using only SHA-256, by only using the gpgsig-sha256 field.
 
 Mergetag embedding
 ~~~~~~~~~~~~~~~~~~
 The mergetag field in the sha1-content of a commit contains the
 sha1-content of a tag that was merged by that commit.
 
-The mergetag field in the newhash-content of the same commit contains the
-newhash-content of the same tag.
+The mergetag field in the sha256-content of the same commit contains the
+sha256-content of the same tag.
 
 Submodules
 ~~~~~~~~~~
@@ -497,7 +494,7 @@ Caveats
 -------
 Invalid objects
 ~~~~~~~~~~~~~~~
-The conversion from sha1-content to newhash-content retains any
+The conversion from sha1-content to sha256-content retains any
 brokenness in the original object (e.g., tree entry modes encoded with
 leading 0, tree objects whose paths are not sorted correctly, and
 commit objects without an author or committer). This is a deliberate
@@ -516,7 +513,7 @@ allow lifting this restriction.
 
 Alternates
 ~~~~~~~~~~
-For the same reason, a newhash repository cannot borrow objects from a
+For the same reason, a sha256 repository cannot borrow objects from a
 sha1 repository using objects/info/alternates or
 $GIT_ALTERNATE_OBJECT_REPOSITORIES.
 
@@ -524,20 +521,20 @@ git notes
 ~~~~~~~~~
 The "git notes" tool annotates objects using their sha1-name as key.
 This design does not describe a way to migrate notes trees to use
-newhash-names. That migration is expected to happen separately (for
+sha256-names. That migration is expected to happen separately (for
 example using a file at the root of the notes tree to describe which
 hash it uses).
 
 Server-side cost
 ~~~~~~~~~~~~~~~~
-Until Git protocol gains NewHash support, using NewHash based storage
+Until Git protocol gains SHA-256 support, using SHA-256 based storage
 on public-facing Git servers is strongly discouraged. Once Git
-protocol gains NewHash support, NewHash based servers are likely not
+protocol gains SHA-256 support, SHA-256 based servers are likely not
 to support SHA-1 compatibility, to avoid what may be a very expensive
 hash reencode during clone and to encourage peers to modernize.
 
 The design described here allows fetches by SHA-1 clients of a
-personal NewHash repository because it's not much more difficult than
+personal SHA-256 repository because it's not much more difficult than
 allowing pushes from that repository. This support needs to be guarded
 by a configuration option --- servers like git.kernel.org that serve a
 large number of clients would not be expected to bear that cost.
@@ -547,7 +544,7 @@ Meaning of signatures
 The signed payload for signed commits and tags does not explicitly
 name the hash used to identify objects. If some day Git adopts a new
 hash function with the same length as the current SHA-1 (40
-hexadecimal digit) or NewHash (64 hexadecimal digit) objects then the
+hexadecimal digit) or SHA-256 (64 hexadecimal digit) objects then the
 intent behind the PGP signed payload in an object signature is
 unclear:
 
@@ -562,7 +559,7 @@ Does this mean Git v2.12.0 is the commit with sha1-name
 e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with
 new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7?
 
-Fortunately NewHash and SHA-1 have different lengths. If Git starts
+Fortunately SHA-256 and SHA-1 have different lengths. If Git starts
 using another hash with the same length to name objects, then it will
 need to change the format of signed payloads using that hash to
 address this issue.
@@ -574,24 +571,24 @@ supports four different modes of operation:
 
  1. ("dark launch") Treat object names input by the user as SHA-1 and
     convert any object names written to output to SHA-1, but store
-    objects using NewHash.  This allows users to test the code with no
+    objects using SHA-256.  This allows users to test the code with no
     visible behavior change except for performance.  This allows
     allows running even tests that assume the SHA-1 hash function, to
     sanity-check the behavior of the new mode.
 
- 2. ("early transition") Allow both SHA-1 and NewHash object names in
+ 2. ("early transition") Allow both SHA-1 and SHA-256 object names in
     input. Any object names written to output use SHA-1. This allows
     users to continue to make use of SHA-1 to communicate with peers
     (e.g. by email) that have not migrated yet and prepares for mode 3.
 
- 3. ("late transition") Allow both SHA-1 and NewHash object names in
-    input. Any object names written to output use NewHash. In this
+ 3. ("late transition") Allow both SHA-1 and SHA-256 object names in
+    input. Any object names written to output use SHA-256. In this
     mode, users are using a more secure object naming method by
     default.  The disruption is minimal as long as most of their peers
     are in mode 2 or mode 3.
 
  4. ("post-transition") Treat object names input by the user as
-    NewHash and write output using NewHash. This is safer than mode 3
+    SHA-256 and write output using SHA-256. This is safer than mode 3
     because there is less risk that input is incorrectly interpreted
     using the wrong hash function.
 
@@ -601,18 +598,22 @@ The user can also explicitly specify which format to use for a
 particular revision specifier and for output, overriding the mode. For
 example:
 
-git --output-format=sha1 log abac87a^{sha1}..f787cac^{newhash}
+git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256}
 
-Selection of a New Hash
------------------------
+Choice of Hash
+--------------
 In early 2005, around the time that Git was written,  Xiaoyun Wang,
 Yiqun Lisa Yin, and Hongbo Yu announced an attack finding SHA-1
 collisions in 2^69 operations. In August they published details.
 Luckily, no practical demonstrations of a collision in full SHA-1 were
 published until 10 years later, in 2017.
 
-The hash function NewHash to replace SHA-1 should be stronger than
-SHA-1 was: we would like it to be trustworthy and useful in practice
+Git v2.13.0 and later subsequently moved to a hardened SHA-1
+implementation by default that mitigates the SHAttered attack, but
+SHA-1 is still believed to be weak.
+
+The hash to replace this hardened SHA-1 should be stronger than SHA-1
+was: we would like it to be trustworthy and useful in practice
 for at least 10 years.
 
 Some other relevant properties:
@@ -620,8 +621,8 @@ Some other relevant properties:
 1. A 256-bit hash (long enough to match common security practice; not
    excessively long to hurt performance and disk usage).
 
-2. High quality implementations should be widely available (e.g. in
-   OpenSSL).
+2. High quality implementations should be widely available (e.g., in
+   OpenSSL and Apple CommonCrypto).
 
 3. The hash function's properties should match Git's needs (e.g. Git
    requires collision and 2nd preimage resistance and does not require
@@ -630,14 +631,15 @@ Some other relevant properties:
 4. As a tiebreaker, the hash should be fast to compute (fortunately
    many contenders are faster than SHA-1).
 
-Some hashes under consideration are SHA-256, SHA-512/256, SHA-256x16,
-K12, and BLAKE2bp-256.
+We choose SHA-256. See the thread starting at
+<20180609224913.GC38834@genre.crustytoothpaste.net> for the discussion
+(https://public-inbox.org/git/20180609224913.GC38834@genre.crustytoothpaste.net/)
 
 Transition plan
 ---------------
 Some initial steps can be implemented independently of one another:
 - adding a hash function API (vtable)
-- teaching fsck to tolerate the gpgsig-newhash field
+- teaching fsck to tolerate the gpgsig-sha256 field
 - excluding gpgsig-* from the fields copied by "git commit --amend"
 - annotating tests that depend on SHA-1 values with a SHA1 test
   prerequisite
@@ -664,7 +666,7 @@ Next comes introduction of compatObjectFormat:
 - adding appropriate index entries when adding a new object to the
   object store
 - --output-format option
-- ^{sha1} and ^{newhash} revision notation
+- ^{sha1} and ^{sha256} revision notation
 - configuration to specify default input and output format (see
   "Object names on the command line" above)
 
@@ -672,7 +674,7 @@ The next step is supporting fetches and pushes to SHA-1 repositories:
 - allow pushes to a repository using the compat format
 - generate a topologically sorted list of the SHA-1 names of fetched
   objects
-- convert the fetched packfile to newhash format and generate an idx
+- convert the fetched packfile to sha256 format and generate an idx
   file
 - re-sort to match the order of objects in the fetched packfile
 
@@ -680,30 +682,30 @@ The infrastructure supporting fetch also allows converting an existing
 repository. In converted repositories and new clones, end users can
 gain support for the new hash function without any visible change in
 behavior (see "dark launch" in the "Object names on the command line"
-section). In particular this allows users to verify NewHash signatures
+section). In particular this allows users to verify SHA-256 signatures
 on objects in the repository, and it should ensure the transition code
 is stable in production in preparation for using it more widely.
 
 Over time projects would encourage their users to adopt the "early
 transition" and then "late transition" modes to take advantage of the
-new, more futureproof NewHash object names.
+new, more futureproof SHA-256 object names.
 
 When objectFormat and compatObjectFormat are both set, commands
-generating signatures would generate both SHA-1 and NewHash signatures
+generating signatures would generate both SHA-1 and SHA-256 signatures
 by default to support both new and old users.
 
-In projects using NewHash heavily, users could be encouraged to adopt
+In projects using SHA-256 heavily, users could be encouraged to adopt
 the "post-transition" mode to avoid accidentally making implicit use
 of SHA-1 object names.
 
 Once a critical mass of users have upgraded to a version of Git that
-can verify NewHash signatures and have converted their existing
+can verify SHA-256 signatures and have converted their existing
 repositories to support verifying them, we can add support for a
-setting to generate only NewHash signatures. This is expected to be at
+setting to generate only SHA-256 signatures. This is expected to be at
 least a year later.
 
 That is also a good moment to advertise the ability to convert
-repositories to use NewHash only, stripping out all SHA-1 related
+repositories to use SHA-256 only, stripping out all SHA-1 related
 metadata. This improves performance by eliminating translation
 overhead and security by avoiding the possibility of accidentally
 relying on the safety of SHA-1.
@@ -742,16 +744,16 @@ using the old hash function.
 
 Signed objects with multiple hashes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Instead of introducing the gpgsig-newhash field in commit and tag objects
-for newhash-content based signatures, an earlier version of this design
-added "hash newhash <newhash-name>" fields to strengthen the existing
+Instead of introducing the gpgsig-sha256 field in commit and tag objects
+for sha256-content based signatures, an earlier version of this design
+added "hash sha256 <sha256-name>" fields to strengthen the existing
 sha1-content based signatures.
 
 In other words, a single signature was used to attest to the object
 content using both hash functions. This had some advantages:
 * Using one signature instead of two speeds up the signing process.
 * Having one signed payload with both hashes allows the signer to
-  attest to the sha1-name and newhash-name referring to the same object.
+  attest to the sha1-name and sha256-name referring to the same object.
 * All users consume the same signature. Broken signatures are likely
   to be detected quickly using current versions of git.
 
@@ -760,11 +762,11 @@ However, it also came with disadvantages:
   objects it references, even after the transition is complete and
   translation table is no longer needed for anything else. To support
   this, the design added fields such as "hash sha1 tree <sha1-name>"
-  and "hash sha1 parent <sha1-name>" to the newhash-content of a signed
+  and "hash sha1 parent <sha1-name>" to the sha256-content of a signed
   commit, complicating the conversion process.
 * Allowing signed objects without a sha1 (for after the transition is
   complete) complicated the design further, requiring a "nohash sha1"
-  field to suppress including "hash sha1" fields in the newhash-content
+  field to suppress including "hash sha1" fields in the sha256-content
   and signed payload.
 
 Lazily populated translation table
@@ -772,7 +774,7 @@ Lazily populated translation table
 Some of the work of building the translation table could be deferred to
 push time, but that would significantly complicate and slow down pushes.
 Calculating the sha1-name at object creation time at the same time it is
-being streamed to disk and having its newhash-name calculated should be
+being streamed to disk and having its sha256-name calculated should be
 an acceptable cost.
 
 Document History
-- 
2.18.0.345.g5c9ce644c3


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-24 22:10             ` brian m. carlson
@ 2018-07-30  9:06               ` Johannes Schindelin
  2018-07-30 20:01                 ` Dan Shumow
  0 siblings, 1 reply; 66+ messages in thread
From: Johannes Schindelin @ 2018-07-30  9:06 UTC (permalink / raw)
  To: brian m. carlson, Dan Shumow
  Cc: Junio C Hamano, Linus Torvalds, Edward Thomson, Jonathan Nieder,
	Git Mailing List, demerphq, Adam Langley, keccak

Hi Brian,

On Tue, 24 Jul 2018, brian m. carlson wrote:

> On Tue, Jul 24, 2018 at 02:13:07PM -0700, Junio C Hamano wrote:
> > Yup.  I actually was leaning toward saying "all of them are OK in
> > practice, so the person who is actually spear-heading the work gets to
> > choose", but if we picked SHA-256 now, that would not be a choice that
> > Brian has to later justify for choosing against everybody else's
> > wishes, which makes it the best choice ;-)
> 
> This looks like a rough consensus.

As I grew really uncomfortable with having a decision that seems to be
based on hunches by non-experts (we rejected the preference of the only
cryptographer who weighed in, after all, precisely like we did over a
decade ago), I asked whether I could loop in one of our in-house experts
with this public discussion.

Y'all should be quite familiar with his work: Dan Shumow.

Dan, thank you for agreeing to chime in publicly.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 66+ messages in thread

* RE: Hash algorithm analysis
  2018-07-30  9:06               ` Johannes Schindelin
@ 2018-07-30 20:01                 ` Dan Shumow
  2018-08-03  2:57                   ` Jonathan Nieder
  2018-09-18 15:18                   ` Joan Daemen
  0 siblings, 2 replies; 66+ messages in thread
From: Dan Shumow @ 2018-07-30 20:01 UTC (permalink / raw)
  To: Johannes Schindelin, brian m. carlson
  Cc: Junio C Hamano, Linus Torvalds, Edward Thomson, Jonathan Nieder,
	Git Mailing List, demerphq, Adam Langley, keccak@noekeon.org

Hello all.   Johannes, thanks for adding me to this discussion.

So, as one of the coauthors of the SHA-1 collision detection code, I just wanted to chime in and say I'm glad to see the move to a longer hash function.  Though, as a cryptographer, I have a few thoughts on the matter that I thought I would share.

I think that moving to SHA256 is a fine change, and I support it.

I'm not anywhere near the expert in this that Joan Daeman is.  I am someone who has worked in this space more or less peripherally.  However, I agree with Adam Langley that basically all of the finalists for a hash function replacement are about the same for the security needs of Git.  I think that, for this community, other software engineering considerations should be more important to the selection process.

I think Joan's survey of cryptanalysis papers and the numbers that he gives are interesting, and I had never seen the comparison laid out like that.  So, I think that there is a good argument to be made that SHA3 has had more cryptanalysis than SHA2.  Though, Joan, are the papers that you surveyed only focused on SHA2?  I'm curious if you think that the design/construction of SHA2, as it can be seen as an iteration of MD5/SHA1, means that the cryptanalysis papers on those constructions can be considered to apply to SHA2?  Again, I'm not an expert in this, but I do know that Marc Steven's techniques for constructing collisions also provided some small cryptanalytic improvements against the SHA2 family as well.  I also think that while the paper survey is a good way to look over all of this, the more time in the position of high profile visibility that SHA2 has can give us some confidence as well.

Also something worth pointing out is that the connection SHA2 has to SHA1 means that if Marc Steven's cryptanalysis of MD5/SHA-1 were ever successfully applied to SHA2, the SHA1 collision detection approach could be applied there as well, thus providing a drop in replacement in that situation.  That said, we don't know that there is not a similar way of addressing issues with the SHA3/Sponge design.  It's just that because we haven't seen any weaknesses of this sort in similar designs, we just don't know what a similar approach would be there yet.  I don't want to put too much stock in this argument, it's just saying "Well, we already know how SHA2 is likely to break, and we've had fixes for similar things in the past."  This is pragmatic but not inspiring or confidence building.

So, I also want to state my biases in favor of SHA2 as an employee of Microsoft.  Microsoft, being a corporation headquartered in a America, with the US Gov't as a major customer definitely prefers to defer to the US Gov't NIST standardization process.  And from that perspective SHA2 or SHA3 would be good choices.  I, personally, think that the NIST process is the best we have.  It is relatively transparent, and NIST employs a fair number of very competent cryptographers.  Also, I am encouraged by the widespread international participation that the NIST competitions and selection processes attract.

As such, and reflecting this bias, in the internal discussions that Johannes alluded to, SHA2 and SHA3 were the primary suggestions.  There was a slight preference for SHA2 because SHA3 is not exposed through the windows cryptographic APIs (though Git does not use those, so this is a nonissue for this discussion.)

I also wanted to thank Johannes for keeping the cryptographers that he discussed this with anonymous.  After all, cryptographers are known for being private.  And I wanted to say that Johannes did, in fact, accurately represent our internal discussions on the matter.

I also wanted to comment on the discussion of the "internal state having the same size as the output."  Linus referred to this several times.  This is known as narrow-pipe vs wide-pipe in the hash function design literature.  Linus is correct that wide-pipe designs are more in favor currently, and IIRC, all of the serious SHA3 candidates employed this.  That said, it did seem that in the discussion this was being equated with "length extension attacks."  And that connection is just not accurate.  Length extension attacks are primarily a motivation of the HMAC liked nested hashing design for MACs, because of a potential forgery attack.  Again, this doesn't really matter because the decision has been made despite this discussion.  I just wanted to set the record straight about this, as to avoid doing the right thing for the wrong reason (T.S. Elliot's "greatest treason.")

One other thing that I wanted to throw out there for the future is that in the crypto community there is currently a very large push to post quantum cryptography.  Whether the threat of quantum computers is real or imagined this is a hot area of research, with a NIST competition to select post quantum asymmetric cryptographic algorithms.  That is not directly of concern to the selection of a hash function.  However, if we take this threat as legitimate, quantum computers reduce the strength of symmetric crypto, both encryption and hash functions, by 1/2.  So, if this is the direction that the crypto community ultimately goes in, 512bit hashes will be seen as standard over the next decade or so.  I don't think that this should be involved in this discussion, presently.   I'm just saying that not unlike the time when SHA1 was selected, I think that the replacement of a 256bit hash is on the horizon as well.

Thanks,
Dan Shumow

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-30 20:01                 ` Dan Shumow
@ 2018-08-03  2:57                   ` Jonathan Nieder
  2018-09-18 15:18                   ` Joan Daemen
  1 sibling, 0 replies; 66+ messages in thread
From: Jonathan Nieder @ 2018-08-03  2:57 UTC (permalink / raw)
  To: Dan Shumow
  Cc: Johannes Schindelin, brian m. carlson, Junio C Hamano,
	Linus Torvalds, Edward Thomson, Git Mailing List, demerphq,
	Adam Langley, keccak@noekeon.org

Hi Dan,

Dan Shumow wrote:

[replying out of order for convenience]
> However, I agree with Adam Langley that basically all of the
> finalists for a hash function replacement are about the same for the
> security needs of Git.  I think that, for this community, other
> software engineering considerations should be more important to the
> selection process.

Thanks for this clarification, which provides some useful context to
your opinion that was previously relayed by Dscho.

[...]
> So, as one of the coauthors of the SHA-1 collision detection code, I
> just wanted to chime in and say I'm glad to see the move to a longer
> hash function.  Though, as a cryptographer, I have a few thoughts on
> the matter that I thought I would share.
>
> I think that moving to SHA256 is a fine change, and I support it.

More generally, thanks for weighing in and for explaining your
rationale.  Even (especially) having already made the decision, it's
comforting to hear a qualified person endorsing that choice.

Sincerely,
Jonathan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v2 2/2] doc hash-function-transition: pick SHA-256 as NewHash
  2018-07-26 13:41                     ` [PATCH v2 " Ævar Arnfjörð Bjarmason
@ 2018-08-03  7:20                       ` Jonathan Nieder
  2018-08-03 16:40                         ` Junio C Hamano
                                           ` (3 more replies)
  0 siblings, 4 replies; 66+ messages in thread
From: Jonathan Nieder @ 2018-08-03  7:20 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Linus Torvalds, Edward Thomson,
	brian m . carlson, Johannes Schindelin, demerphq, Adam Langley,
	keccak

Hi again,

Sorry for the slow review.  I finally got a chance to look this over
again.

My main nits are about the commit message: I think it still focuses
too much on the process instead of the usual "knowing what I know now,
here's the clearest explanation for why we need this patch" approach.
I can send a patch illustrating what I mean tomorrow morning.

Ævar Arnfjörð Bjarmason wrote:

> From a security perspective, it seems that SHA-256, BLAKE2, SHA3-256,
> K12, and so on are all believed to have similar security properties.
> All are good options from a security point of view.
>
> SHA-256 has a number of advantages:
>
> * It has been around for a while, is widely used, and is supported by
>   just about every single crypto library (OpenSSL, mbedTLS, CryptoNG,
>   SecureTransport, etc).
>
> * When you compare against SHA1DC, most vectorized SHA-256
>   implementations are indeed faster, even without acceleration.
>
> * If we're doing signatures with OpenPGP (or even, I suppose, CMS),
>   we're going to be using SHA-2, so it doesn't make sense to have our
>   security depend on two separate algorithms when either one of them
>   alone could break the security when we could just depend on one.
>
> So SHA-256 it is.

The above is what I wrote, so of course I'd like it. ;-)

>                    See the "Hash algorithm analysis" thread as of
> [1]. Linus has come around to this choice and suggested Junio make the
> final pick, and he's endorsed SHA-256 [3].

The above paragraph has the same problem as before of (1) not being
self-contained and (2) focusing on the process that led to this patch
instead of the benefit of the patch itself.  I think we should omit it.

> This follow-up change changes occurrences of "NewHash" to
> "SHA-256" (or "sha256", depending on the context). The "Selection of a
> New Hash" section has also been changed to note that historically we
> used the the "NewHash" name while we didn't know what the new hash
> function would be.

nit: Commit messages are usually in the imperative tense.  This is in
the past tense, I think because it is a continuation of that
discussion about process.

For this part, I think we can let the patch speak for itself.

> This leaves no use of "NewHash" anywhere in git.git except in the
> aforementioned section (and as a variable name in t/t9700/test.pl, but
> that use from 2008 has nothing to do with this transition plan).

This part is helpful --- good.

> 1. https://public-inbox.org/git/20180720215220.GB18502@genre.crustytoothpaste.net/
> 2. https://public-inbox.org/git/CA+55aFwSe9BF8e0hLk9pp3FVD5LaVY5GRdsV3fbNtgzekJadyA@mail.gmail.com/
> 3. https://public-inbox.org/git/xmqqzhygwd5o.fsf@gitster-ct.c.googlers.com/

Footnotes to the historical part --- I'd recommend removing these.

> Helped-by: Jonathan Nieder <jrnieder@gmail.com>
> Helped-by: Junio C Hamano <gitster@pobox.com>
> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>

Here I'd want to put a pile of acks, e.g.:

 Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
 Acked-by: brian m. carlson <sandals@crustytoothpaste.net>
 Acked-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
 Acked-by: Dan Shumow <danshu@microsoft.com>
 Acked-by: Junio C Hamano <gitster@pobox.com>

[...]
> --- a/Documentation/technical/hash-function-transition.txt
> +++ b/Documentation/technical/hash-function-transition.txt
> @@ -59,14 +59,11 @@ that are believed to be cryptographically secure.
>  
>  Goals
>  -----
> -Where NewHash is a strong 256-bit hash function to replace SHA-1 (see
> -"Selection of a New Hash", below):
> -
> -1. The transition to NewHash can be done one local repository at a time.
> +1. The transition to SHA-256 can be done one local repository at a time.

Yay!

[...]
>  	[extensions]
> -		objectFormat = newhash
> +		objectFormat = sha256
>  		compatObjectFormat = sha1

Yes, makes sense.

[...]
> @@ -155,36 +152,36 @@ repository extensions.
>  Object names
>  ~~~~~~~~~~~~
>  Objects can be named by their 40 hexadecimal digit sha1-name or 64
> -hexadecimal digit newhash-name, plus names derived from those (see
> +hexadecimal digit sha256-name, plus names derived from those (see
>  gitrevisions(7)).
>  
>  The sha1-name of an object is the SHA-1 of the concatenation of its
>  type, length, a nul byte, and the object's sha1-content. This is the
>  traditional <sha1> used in Git to name objects.
>  
> -The newhash-name of an object is the NewHash of the concatenation of its
> -type, length, a nul byte, and the object's newhash-content.
> +The sha256-name of an object is the SHA-256 of the concatenation of its
> +type, length, a nul byte, and the object's sha256-content.

Sensible.

[...]
>  
>  Object format
>  ~~~~~~~~~~~~~
>  The content as a byte sequence of a tag, commit, or tree object named
> -by sha1 and newhash differ because an object named by newhash-name refers to
> +by sha1 and sha256 differ because an object named by sha256-name refers to

Not about this patch: this should say SHA-1 and SHA-256, I think.
Leaving it as is in this patch as you did is the right thing.

[...]
> @@ -255,10 +252,10 @@ network byte order):
>    up to and not including the table of CRC32 values.
>  - Zero or more NUL bytes.
>  - The trailer consists of the following:
> -  - A copy of the 20-byte NewHash checksum at the end of the
> +  - A copy of the 20-byte SHA-256 checksum at the end of the

Not about this patch: a SHA-256 is 32 bytes.  Leaving that for a
separate patch as you did is the right thing, though.

[...]
> -  - 20-byte NewHash checksum of all of the above.
> +  - 20-byte SHA-256 checksum of all of the above.

Likewise.

[...]
> @@ -351,12 +348,12 @@ the following steps:
>     (This list only contains objects reachable from the "wants". If the
>     pack from the server contained additional extraneous objects, then
>     they will be discarded.)
> -3. convert to newhash: open a new (newhash) packfile. Read the topologically
> +3. convert to sha256: open a new (sha256) packfile. Read the topologically

Not about this patch: this one's more murky, since it's talking about
the object names instead of the hash function.  I guess "sha256"
instead of "SHA-256" for this could be right, but I worry it's going
to take time for me to figure out the exact distinction.  That seems
like a reason to just call it SHA-256 (but in a separate patch).

[...]
> -   sha1-content, convert to newhash-content, and write it to the newhash
> -   pack. Record the new sha1<->newhash mapping entry for use in the idx.
> +   sha1-content, convert to sha256-content, and write it to the sha256
> +   pack. Record the new sha1<->sha256 mapping entry for use in the idx.
>  4. sort: reorder entries in the new pack to match the order of objects
> -   in the pack the server generated and include blobs. Write a newhash idx
> +   in the pack the server generated and include blobs. Write a sha256 idx
>     file

Likewise.

[...]
> @@ -388,16 +385,16 @@ send-pack.
>  
>  Signed Commits
>  ~~~~~~~~~~~~~~
> -We add a new field "gpgsig-newhash" to the commit object format to allow
> +We add a new field "gpgsig-sha256" to the commit object format to allow
>  signing commits without relying on SHA-1. It is similar to the
> -existing "gpgsig" field. Its signed payload is the newhash-content of the
> -commit object with any "gpgsig" and "gpgsig-newhash" fields removed.
> +existing "gpgsig" field. Its signed payload is the sha256-content of the
> +commit object with any "gpgsig" and "gpgsig-sha256" fields removed.

That reminds me --- we need to add support for stripping these out.

[...]
> @@ -601,18 +598,22 @@ The user can also explicitly specify which format to use for a
>  particular revision specifier and for output, overriding the mode. For
>  example:
>  
> -git --output-format=sha1 log abac87a^{sha1}..f787cac^{newhash}
> +git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256}
>  
> -Selection of a New Hash
> ------------------------
> +Choice of Hash
> +--------------

Yay!

[...]
> -Some hashes under consideration are SHA-256, SHA-512/256, SHA-256x16,
> -K12, and BLAKE2bp-256.
> +We choose SHA-256. See the thread starting at
> +<20180609224913.GC38834@genre.crustytoothpaste.net> for the discussion
> +(https://public-inbox.org/git/20180609224913.GC38834@genre.crustytoothpaste.net/)

Can this reference be moved to a footnote?  It's not part of the
design, but it's a good reference.

Thanks again for getting this documented.

Sincerely,
Jonathan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v2 2/2] doc hash-function-transition: pick SHA-256 as NewHash
  2018-08-03  7:20                       ` Jonathan Nieder
@ 2018-08-03 16:40                         ` Junio C Hamano
  2018-08-03 17:01                           ` Linus Torvalds
  2018-08-03 16:42                         ` Linus Torvalds
                                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 66+ messages in thread
From: Junio C Hamano @ 2018-08-03 16:40 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Ævar Arnfjörð Bjarmason, git, Linus Torvalds,
	Edward Thomson, brian m . carlson, Johannes Schindelin, demerphq,
	Adam Langley, keccak

Jonathan Nieder <jrnieder@gmail.com> writes:

> Sorry for the slow review.  I finally got a chance to look this over
> again.
>
> My main nits are about the commit message: I think it still focuses
> too much on the process instead of the usual "knowing what I know now,
> here's the clearest explanation for why we need this patch" approach.
> I can send a patch illustrating what I mean tomorrow morning.

I'll turn 'Will merge to next' to 'Hold' for now.

>>  Object format
>>  ~~~~~~~~~~~~~
>>  The content as a byte sequence of a tag, commit, or tree object named
>> -by sha1 and newhash differ because an object named by newhash-name refers to
>> +by sha1 and sha256 differ because an object named by sha256-name refers to
>
> Not about this patch: this should say SHA-1 and SHA-256, I think.
> Leaving it as is in this patch as you did is the right thing.

Perhaps deserves a "NEEDSWORK" comment, though.

> [...]
>> @@ -255,10 +252,10 @@ network byte order):
>>    up to and not including the table of CRC32 values.
>>  - Zero or more NUL bytes.
>>  - The trailer consists of the following:
>> -  - A copy of the 20-byte NewHash checksum at the end of the
>> +  - A copy of the 20-byte SHA-256 checksum at the end of the
>
> Not about this patch: a SHA-256 is 32 bytes.  Leaving that for a
> separate patch as you did is the right thing, though.
>
> [...]
>> -  - 20-byte NewHash checksum of all of the above.
>> +  - 20-byte SHA-256 checksum of all of the above.
>
> Likewise.

Hmph, I've always assumed since NewHash plan was written that this
part was not about tamper resistance but was about bit-flipping
detection and it was deliberate to stick to 20-byte sum, truncating
as necessary.

It definitely is a good idea to leave it for a separate patch to
update this part.

Thanks.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v2 2/2] doc hash-function-transition: pick SHA-256 as NewHash
  2018-08-03  7:20                       ` Jonathan Nieder
  2018-08-03 16:40                         ` Junio C Hamano
@ 2018-08-03 16:42                         ` Linus Torvalds
  2018-08-03 17:43                         ` Ævar Arnfjörð Bjarmason
  2018-08-03 17:45                         ` brian m. carlson
  3 siblings, 0 replies; 66+ messages in thread
From: Linus Torvalds @ 2018-08-03 16:42 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Ævar Arnfjörð Bjarmason, Git Mailing List,
	Junio C Hamano, Edward Thomson, brian m. carlson,
	Johannes Schindelin, demerphq, Adam Langley, keccak

On Fri, Aug 3, 2018 at 12:20 AM Jonathan Nieder <jrnieder@gmail.com> wrote:
>
>
> Here I'd want to put a pile of acks, e.g.:
>
>  Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
> ..

Sure, feel free to add my Ack for this.

                       Linus

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v2 2/2] doc hash-function-transition: pick SHA-256 as NewHash
  2018-08-03 16:40                         ` Junio C Hamano
@ 2018-08-03 17:01                           ` Linus Torvalds
  0 siblings, 0 replies; 66+ messages in thread
From: Linus Torvalds @ 2018-08-03 17:01 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jonathan Nieder, Ævar Arnfjörð Bjarmason,
	Git Mailing List, Edward Thomson, brian m. carlson,
	Johannes Schindelin, demerphq, Adam Langley, keccak

On Fri, Aug 3, 2018 at 9:40 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> > [...]
> >> -  - 20-byte NewHash checksum of all of the above.
> >> +  - 20-byte SHA-256 checksum of all of the above.
> >
> > Likewise.
>
> Hmph, I've always assumed since NewHash plan was written that this
> part was not about tamper resistance but was about bit-flipping
> detection and it was deliberate to stick to 20-byte sum, truncating
> as necessary.

Yeah, in fact, since this was one area where people actually had
performance issues with the hash, it might be worth considering
_weakening_ the hash.

Things like the pack index files (and just the regular file index) are
entirely local, and the SHA1 in those is really just a fancy CRC. It's
really just "good protection against disk corruption" (it happens),
and in fact you cannot use it as protection against active tampering,
since there's no secret there and any active attacker that has access
to your local filesystem could just recompute the hash anyway.

And because they are local anyway and aren't really transported
(modulo "shared repositories" where you use them across users or
legacy rsync-like synchronization), they can be handled separately
from any hashing changes. The pack and index file formats have in fact
been changed before.

It might make sense to either keep it as SHA1 (just to minimize any
changes) or if there are still issues with index file performance it
could even be made to use something fast-but-not-cryptographic like
just making it use crc32().

Remember: one of the original core git design requirements was
"corruption detection".

For normal hashed objects, that came naturally, and the normal object
store additionally has active tamper protection thanks to the
interconnected nature of the hashes and the distribution of the
objects.

But for things like packfiles and the file index, it is just a
separate checksum. There is no tamper protection anyway, since anybody
who can modify them directly can just recompute the hash checksum.

The fact that both of these things used SHA1 was more of a convenience
issue than anything else, and the pack/index file checksum is
fundamentally not cryptographic (but a cryptographic hash is obviously
by definition also a very good corruption detector).

               Linus

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v2 2/2] doc hash-function-transition: pick SHA-256 as NewHash
  2018-08-03  7:20                       ` Jonathan Nieder
  2018-08-03 16:40                         ` Junio C Hamano
  2018-08-03 16:42                         ` Linus Torvalds
@ 2018-08-03 17:43                         ` Ævar Arnfjörð Bjarmason
  2018-08-04  8:52                           ` Jonathan Nieder
  2018-08-03 17:45                         ` brian m. carlson
  3 siblings, 1 reply; 66+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2018-08-03 17:43 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: git, Junio C Hamano, Linus Torvalds, Edward Thomson,
	brian m . carlson, Johannes Schindelin, demerphq, Adam Langley,
	keccak


On Fri, Aug 03 2018, Jonathan Nieder wrote:

> Hi again,
>
> Sorry for the slow review.  I finally got a chance to look this over
> again.
>
> My main nits are about the commit message: I think it still focuses
> too much on the process instead of the usual "knowing what I know now,
> here's the clearest explanation for why we need this patch" approach.
> I can send a patch illustrating what I mean tomorrow morning.

I think it makes if you just take over 2/2 of this series (or even the
whole thing), since the meat of it is already something I copy/pasted
from you, and you've got more of an opinion / idea about how to proceed
(which is good!); it's more efficient than me trying to fix various
stuff you're pointing out at this point, I also think it makes sense
that you change the "Author" line for that, since the rest of the
changes will mainly be search-replace by me.

Perhaps it's better for readability if those search-replace changes go
into their own change, i.e. make it a three-part where 2/3 does the
search-replace, and promises that 3/3 has the full rationale etc.

> Ævar Arnfjörð Bjarmason wrote:
>
>> From a security perspective, it seems that SHA-256, BLAKE2, SHA3-256,
>> K12, and so on are all believed to have similar security properties.
>> All are good options from a security point of view.
>>
>> SHA-256 has a number of advantages:
>>
>> * It has been around for a while, is widely used, and is supported by
>>   just about every single crypto library (OpenSSL, mbedTLS, CryptoNG,
>>   SecureTransport, etc).
>>
>> * When you compare against SHA1DC, most vectorized SHA-256
>>   implementations are indeed faster, even without acceleration.
>>
>> * If we're doing signatures with OpenPGP (or even, I suppose, CMS),
>>   we're going to be using SHA-2, so it doesn't make sense to have our
>>   security depend on two separate algorithms when either one of them
>>   alone could break the security when we could just depend on one.
>>
>> So SHA-256 it is.
>
> The above is what I wrote, so of course I'd like it. ;-)
>
>>                    See the "Hash algorithm analysis" thread as of
>> [1]. Linus has come around to this choice and suggested Junio make the
>> final pick, and he's endorsed SHA-256 [3].
>
> The above paragraph has the same problem as before of (1) not being
> self-contained and (2) focusing on the process that led to this patch
> instead of the benefit of the patch itself.  I think we should omit it.
>
>> This follow-up change changes occurrences of "NewHash" to
>> "SHA-256" (or "sha256", depending on the context). The "Selection of a
>> New Hash" section has also been changed to note that historically we
>> used the the "NewHash" name while we didn't know what the new hash
>> function would be.
>
> nit: Commit messages are usually in the imperative tense.  This is in
> the past tense, I think because it is a continuation of that
> discussion about process.
>
> For this part, I think we can let the patch speak for itself.
>
>> This leaves no use of "NewHash" anywhere in git.git except in the
>> aforementioned section (and as a variable name in t/t9700/test.pl, but
>> that use from 2008 has nothing to do with this transition plan).
>
> This part is helpful --- good.
>
>> 1. https://public-inbox.org/git/20180720215220.GB18502@genre.crustytoothpaste.net/
>> 2. https://public-inbox.org/git/CA+55aFwSe9BF8e0hLk9pp3FVD5LaVY5GRdsV3fbNtgzekJadyA@mail.gmail.com/
>> 3. https://public-inbox.org/git/xmqqzhygwd5o.fsf@gitster-ct.c.googlers.com/
>
> Footnotes to the historical part --- I'd recommend removing these.
>
>> Helped-by: Jonathan Nieder <jrnieder@gmail.com>
>> Helped-by: Junio C Hamano <gitster@pobox.com>
>> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
>
> Here I'd want to put a pile of acks, e.g.:
>
>  Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
>  Acked-by: brian m. carlson <sandals@crustytoothpaste.net>
>  Acked-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
>  Acked-by: Dan Shumow <danshu@microsoft.com>
>  Acked-by: Junio C Hamano <gitster@pobox.com>
>
> [...]
>> --- a/Documentation/technical/hash-function-transition.txt
>> +++ b/Documentation/technical/hash-function-transition.txt
>> @@ -59,14 +59,11 @@ that are believed to be cryptographically secure.
>>
>>  Goals
>>  -----
>> -Where NewHash is a strong 256-bit hash function to replace SHA-1 (see
>> -"Selection of a New Hash", below):
>> -
>> -1. The transition to NewHash can be done one local repository at a time.
>> +1. The transition to SHA-256 can be done one local repository at a time.
>
> Yay!
>
> [...]
>>  	[extensions]
>> -		objectFormat = newhash
>> +		objectFormat = sha256
>>  		compatObjectFormat = sha1
>
> Yes, makes sense.
>
> [...]
>> @@ -155,36 +152,36 @@ repository extensions.
>>  Object names
>>  ~~~~~~~~~~~~
>>  Objects can be named by their 40 hexadecimal digit sha1-name or 64
>> -hexadecimal digit newhash-name, plus names derived from those (see
>> +hexadecimal digit sha256-name, plus names derived from those (see
>>  gitrevisions(7)).
>>
>>  The sha1-name of an object is the SHA-1 of the concatenation of its
>>  type, length, a nul byte, and the object's sha1-content. This is the
>>  traditional <sha1> used in Git to name objects.
>>
>> -The newhash-name of an object is the NewHash of the concatenation of its
>> -type, length, a nul byte, and the object's newhash-content.
>> +The sha256-name of an object is the SHA-256 of the concatenation of its
>> +type, length, a nul byte, and the object's sha256-content.
>
> Sensible.
>
> [...]
>>
>>  Object format
>>  ~~~~~~~~~~~~~
>>  The content as a byte sequence of a tag, commit, or tree object named
>> -by sha1 and newhash differ because an object named by newhash-name refers to
>> +by sha1 and sha256 differ because an object named by sha256-name refers to
>
> Not about this patch: this should say SHA-1 and SHA-256, I think.
> Leaving it as is in this patch as you did is the right thing.
>
> [...]
>> @@ -255,10 +252,10 @@ network byte order):
>>    up to and not including the table of CRC32 values.
>>  - Zero or more NUL bytes.
>>  - The trailer consists of the following:
>> -  - A copy of the 20-byte NewHash checksum at the end of the
>> +  - A copy of the 20-byte SHA-256 checksum at the end of the
>
> Not about this patch: a SHA-256 is 32 bytes.  Leaving that for a
> separate patch as you did is the right thing, though.
>
> [...]
>> -  - 20-byte NewHash checksum of all of the above.
>> +  - 20-byte SHA-256 checksum of all of the above.
>
> Likewise.
>
> [...]
>> @@ -351,12 +348,12 @@ the following steps:
>>     (This list only contains objects reachable from the "wants". If the
>>     pack from the server contained additional extraneous objects, then
>>     they will be discarded.)
>> -3. convert to newhash: open a new (newhash) packfile. Read the topologically
>> +3. convert to sha256: open a new (sha256) packfile. Read the topologically
>
> Not about this patch: this one's more murky, since it's talking about
> the object names instead of the hash function.  I guess "sha256"
> instead of "SHA-256" for this could be right, but I worry it's going
> to take time for me to figure out the exact distinction.  That seems
> like a reason to just call it SHA-256 (but in a separate patch).
>
> [...]
>> -   sha1-content, convert to newhash-content, and write it to the newhash
>> -   pack. Record the new sha1<->newhash mapping entry for use in the idx.
>> +   sha1-content, convert to sha256-content, and write it to the sha256
>> +   pack. Record the new sha1<->sha256 mapping entry for use in the idx.
>>  4. sort: reorder entries in the new pack to match the order of objects
>> -   in the pack the server generated and include blobs. Write a newhash idx
>> +   in the pack the server generated and include blobs. Write a sha256 idx
>>     file
>
> Likewise.
>
> [...]
>> @@ -388,16 +385,16 @@ send-pack.
>>
>>  Signed Commits
>>  ~~~~~~~~~~~~~~
>> -We add a new field "gpgsig-newhash" to the commit object format to allow
>> +We add a new field "gpgsig-sha256" to the commit object format to allow
>>  signing commits without relying on SHA-1. It is similar to the
>> -existing "gpgsig" field. Its signed payload is the newhash-content of the
>> -commit object with any "gpgsig" and "gpgsig-newhash" fields removed.
>> +existing "gpgsig" field. Its signed payload is the sha256-content of the
>> +commit object with any "gpgsig" and "gpgsig-sha256" fields removed.
>
> That reminds me --- we need to add support for stripping these out.
>
> [...]
>> @@ -601,18 +598,22 @@ The user can also explicitly specify which format to use for a
>>  particular revision specifier and for output, overriding the mode. For
>>  example:
>>
>> -git --output-format=sha1 log abac87a^{sha1}..f787cac^{newhash}
>> +git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256}
>>
>> -Selection of a New Hash
>> ------------------------
>> +Choice of Hash
>> +--------------
>
> Yay!
>
> [...]
>> -Some hashes under consideration are SHA-256, SHA-512/256, SHA-256x16,
>> -K12, and BLAKE2bp-256.
>> +We choose SHA-256. See the thread starting at
>> +<20180609224913.GC38834@genre.crustytoothpaste.net> for the discussion
>> +(https://public-inbox.org/git/20180609224913.GC38834@genre.crustytoothpaste.net/)
>
> Can this reference be moved to a footnote?  It's not part of the
> design, but it's a good reference.
>
> Thanks again for getting this documented.
>
> Sincerely,
> Jonathan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v2 2/2] doc hash-function-transition: pick SHA-256 as NewHash
  2018-08-03  7:20                       ` Jonathan Nieder
                                           ` (2 preceding siblings ...)
  2018-08-03 17:43                         ` Ævar Arnfjörð Bjarmason
@ 2018-08-03 17:45                         ` brian m. carlson
  3 siblings, 0 replies; 66+ messages in thread
From: brian m. carlson @ 2018-08-03 17:45 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Ævar Arnfjörð Bjarmason, git, Junio C Hamano,
	Linus Torvalds, Edward Thomson, Johannes Schindelin, demerphq,
	Adam Langley, keccak

[-- Attachment #1: Type: text/plain, Size: 2397 bytes --]

On Fri, Aug 03, 2018 at 12:20:14AM -0700, Jonathan Nieder wrote:
> Ævar Arnfjörð Bjarmason wrote:
> >  Object format
> >  ~~~~~~~~~~~~~
> >  The content as a byte sequence of a tag, commit, or tree object named
> > -by sha1 and newhash differ because an object named by newhash-name refers to
> > +by sha1 and sha256 differ because an object named by sha256-name refers to
> 
> Not about this patch: this should say SHA-1 and SHA-256, I think.
> Leaving it as is in this patch as you did is the right thing.
> 
> [...]
> > @@ -255,10 +252,10 @@ network byte order):
> >    up to and not including the table of CRC32 values.
> >  - Zero or more NUL bytes.
> >  - The trailer consists of the following:
> > -  - A copy of the 20-byte NewHash checksum at the end of the
> > +  - A copy of the 20-byte SHA-256 checksum at the end of the
> 
> Not about this patch: a SHA-256 is 32 bytes.  Leaving that for a
> separate patch as you did is the right thing, though.
> 
> [...]
> > -  - 20-byte NewHash checksum of all of the above.
> > +  - 20-byte SHA-256 checksum of all of the above.
> 
> Likewise.

For the record, my code for these does use 32 bytes.  I'm fine with this
being a separate patch, though.

> [...]
> > @@ -351,12 +348,12 @@ the following steps:
> >     (This list only contains objects reachable from the "wants". If the
> >     pack from the server contained additional extraneous objects, then
> >     they will be discarded.)
> > -3. convert to newhash: open a new (newhash) packfile. Read the topologically
> > +3. convert to sha256: open a new (sha256) packfile. Read the topologically
> 
> Not about this patch: this one's more murky, since it's talking about
> the object names instead of the hash function.  I guess "sha256"
> instead of "SHA-256" for this could be right, but I worry it's going
> to take time for me to figure out the exact distinction.  That seems
> like a reason to just call it SHA-256 (but in a separate patch).

My assumption has been that when we are referring to the algorithm,
we'll use SHA-1 and SHA-256, and when we're referring to the input to
Git (in config files or in ^{sha256} notation), we use "sha1" and
"sha256".  I see this as analogous to "Git" and "git".

Otherwise, I'm fine with this document as it is.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 867 bytes --]

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH v2 2/2] doc hash-function-transition: pick SHA-256 as NewHash
  2018-08-03 17:43                         ` Ævar Arnfjörð Bjarmason
@ 2018-08-04  8:52                           ` Jonathan Nieder
  0 siblings, 0 replies; 66+ messages in thread
From: Jonathan Nieder @ 2018-08-04  8:52 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Junio C Hamano, Linus Torvalds, Edward Thomson,
	brian m . carlson, Johannes Schindelin, demerphq, Adam Langley,
	keccak

Subject: doc hash-function-transition: pick SHA-256 as NewHash

From a security perspective, it seems that SHA-256, BLAKE2, SHA3-256,
K12, and so on are all believed to have similar security properties.
All are good options from a security point of view.

SHA-256 has a number of advantages:

* It has been around for a while, is widely used, and is supported by
  just about every single crypto library (OpenSSL, mbedTLS, CryptoNG,
  SecureTransport, etc).

* When you compare against SHA1DC, most vectorized SHA-256
  implementations are indeed faster, even without acceleration.

* If we're doing signatures with OpenPGP (or even, I suppose, CMS),
  we're going to be using SHA-2, so it doesn't make sense to have our
  security depend on two separate algorithms when either one of them
  alone could break the security when we could just depend on one.

So SHA-256 it is.  Update the hash-function-transition design doc to
say so.

After this patch, there are no remaining instances of the string
"NewHash", except for an unrelated use from 2008 as a variable name in
t/t9700/test.pl.

Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: brian m. carlson <sandals@crustytoothpaste.net>
Acked-by: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Acked-by: Dan Shumow <danshu@microsoft.com>
Signed-off-by: Jonathan Nieder <jrnieder@gmail.com>
---
Hi,

Ævar Arnfjörð Bjarmason wrote:

> I think it makes if you just take over 2/2 of this series (or even the
> whole thing), since the meat of it is already something I copy/pasted
> from you, and you've got more of an opinion / idea about how to proceed
> (which is good!); it's more efficient than me trying to fix various
> stuff you're pointing out at this point, I also think it makes sense
> that you change the "Author" line for that, since the rest of the
> changes will mainly be search-replace by me.

Fair enough.  Here's that updated patch 2/2.

I'll try to make a more comprehensive set of proposed edits tomorrow,
in a fresh thread (dealing with the cksum-trailer, etc).

Brian, is your latest work in progress available somewhere (e.g. a
branch on https://git.crustytoothpaste.net/git/bmc/git) so I can make
sure any edits I make match well with it?

Thanks,
Jonathan

 .../technical/hash-function-transition.txt    | 196 +++++++++---------
 1 file changed, 98 insertions(+), 98 deletions(-)

diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt
index 5ee4754adb..bc2ace2a6e 100644
--- a/Documentation/technical/hash-function-transition.txt
+++ b/Documentation/technical/hash-function-transition.txt
@@ -59,14 +59,11 @@ that are believed to be cryptographically secure.
 
 Goals
 -----
-Where NewHash is a strong 256-bit hash function to replace SHA-1 (see
-"Selection of a New Hash", below):
-
-1. The transition to NewHash can be done one local repository at a time.
+1. The transition to SHA-256 can be done one local repository at a time.
    a. Requiring no action by any other party.
-   b. A NewHash repository can communicate with SHA-1 Git servers
+   b. A SHA-256 repository can communicate with SHA-1 Git servers
       (push/fetch).
-   c. Users can use SHA-1 and NewHash identifiers for objects
+   c. Users can use SHA-1 and SHA-256 identifiers for objects
       interchangeably (see "Object names on the command line", below).
    d. New signed objects make use of a stronger hash function than
       SHA-1 for their security guarantees.
@@ -79,7 +76,7 @@ Where NewHash is a strong 256-bit hash function to replace SHA-1 (see
 
 Non-Goals
 ---------
-1. Add NewHash support to Git protocol. This is valuable and the
+1. Add SHA-256 support to Git protocol. This is valuable and the
    logical next step but it is out of scope for this initial design.
 2. Transparently improving the security of existing SHA-1 signed
    objects.
@@ -87,26 +84,26 @@ Non-Goals
    repository.
 4. Taking the opportunity to fix other bugs in Git's formats and
    protocols.
-5. Shallow clones and fetches into a NewHash repository. (This will
-   change when we add NewHash support to Git protocol.)
-6. Skip fetching some submodules of a project into a NewHash
-   repository. (This also depends on NewHash support in Git
+5. Shallow clones and fetches into a SHA-256 repository. (This will
+   change when we add SHA-256 support to Git protocol.)
+6. Skip fetching some submodules of a project into a SHA-256
+   repository. (This also depends on SHA-256 support in Git
    protocol.)
 
 Overview
 --------
 We introduce a new repository format extension. Repositories with this
-extension enabled use NewHash instead of SHA-1 to name their objects.
+extension enabled use SHA-256 instead of SHA-1 to name their objects.
 This affects both object names and object content --- both the names
 of objects and all references to other objects within an object are
 switched to the new hash function.
 
-NewHash repositories cannot be read by older versions of Git.
+SHA-256 repositories cannot be read by older versions of Git.
 
-Alongside the packfile, a NewHash repository stores a bidirectional
-mapping between NewHash and SHA-1 object names. The mapping is generated
+Alongside the packfile, a SHA-256 repository stores a bidirectional
+mapping between SHA-256 and SHA-1 object names. The mapping is generated
 locally and can be verified using "git fsck". Object lookups use this
-mapping to allow naming objects using either their SHA-1 and NewHash names
+mapping to allow naming objects using either their SHA-1 and SHA-256 names
 interchangeably.
 
 "git cat-file" and "git hash-object" gain options to display an object
@@ -116,7 +113,7 @@ object database so that they can be named using the appropriate name
 (using the bidirectional hash mapping).
 
 Fetches from a SHA-1 based server convert the fetched objects into
-NewHash form and record the mapping in the bidirectional mapping table
+SHA-256 form and record the mapping in the bidirectional mapping table
 (see below for details). Pushes to a SHA-1 based server convert the
 objects being pushed into sha1 form so the server does not have to be
 aware of the hash function the client is using.
@@ -125,19 +122,19 @@ Detailed Design
 ---------------
 Repository format extension
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~
-A NewHash repository uses repository format version `1` (see
+A SHA-256 repository uses repository format version `1` (see
 Documentation/technical/repository-version.txt) with extensions
 `objectFormat` and `compatObjectFormat`:
 
 	[core]
 		repositoryFormatVersion = 1
 	[extensions]
-		objectFormat = newhash
+		objectFormat = sha256
 		compatObjectFormat = sha1
 
 The combination of setting `core.repositoryFormatVersion=1` and
 populating `extensions.*` ensures that all versions of Git later than
-`v0.99.9l` will die instead of trying to operate on the NewHash
+`v0.99.9l` will die instead of trying to operate on the SHA-256
 repository, instead producing an error message.
 
 	# Between v0.99.9l and v2.7.0
@@ -155,36 +152,36 @@ repository extensions.
 Object names
 ~~~~~~~~~~~~
 Objects can be named by their 40 hexadecimal digit sha1-name or 64
-hexadecimal digit newhash-name, plus names derived from those (see
+hexadecimal digit sha256-name, plus names derived from those (see
 gitrevisions(7)).
 
 The sha1-name of an object is the SHA-1 of the concatenation of its
 type, length, a nul byte, and the object's sha1-content. This is the
 traditional <sha1> used in Git to name objects.
 
-The newhash-name of an object is the NewHash of the concatenation of its
-type, length, a nul byte, and the object's newhash-content.
+The sha256-name of an object is the SHA-256 of the concatenation of its
+type, length, a nul byte, and the object's sha256-content.
 
 Object format
 ~~~~~~~~~~~~~
 The content as a byte sequence of a tag, commit, or tree object named
-by sha1 and newhash differ because an object named by newhash-name refers to
-other objects by their newhash-names and an object named by sha1-name
+by sha1 and sha256 differ because an object named by sha256-name refers to
+other objects by their sha256-names and an object named by sha1-name
 refers to other objects by their sha1-names.
 
-The newhash-content of an object is the same as its sha1-content, except
-that objects referenced by the object are named using their newhash-names
+The sha256-content of an object is the same as its sha1-content, except
+that objects referenced by the object are named using their sha256-names
 instead of sha1-names. Because a blob object does not refer to any
-other object, its sha1-content and newhash-content are the same.
+other object, its sha1-content and sha256-content are the same.
 
-The format allows round-trip conversion between newhash-content and
+The format allows round-trip conversion between sha256-content and
 sha1-content.
 
 Object storage
 ~~~~~~~~~~~~~~
 Loose objects use zlib compression and packed objects use the packed
 format described in Documentation/technical/pack-format.txt, just like
-today. The content that is compressed and stored uses newhash-content
+today. The content that is compressed and stored uses sha256-content
 instead of sha1-content.
 
 Pack index
@@ -255,10 +252,10 @@ network byte order):
   up to and not including the table of CRC32 values.
 - Zero or more NUL bytes.
 - The trailer consists of the following:
-  - A copy of the 20-byte NewHash checksum at the end of the
+  - A copy of the 20-byte SHA-256 checksum at the end of the
     corresponding packfile.
 
-  - 20-byte NewHash checksum of all of the above.
+  - 20-byte SHA-256 checksum of all of the above.
 
 Loose object index
 ~~~~~~~~~~~~~~~~~~
@@ -266,7 +263,7 @@ A new file $GIT_OBJECT_DIR/loose-object-idx contains information about
 all loose objects. Its format is
 
   # loose-object-idx
-  (newhash-name SP sha1-name LF)*
+  (sha256-name SP sha1-name LF)*
 
 where the object names are in hexadecimal format. The file is not
 sorted.
@@ -292,8 +289,8 @@ To remove entries (e.g. in "git pack-refs" or "git-prune"):
 Translation table
 ~~~~~~~~~~~~~~~~~
 The index files support a bidirectional mapping between sha1-names
-and newhash-names. The lookup proceeds similarly to ordinary object
-lookups. For example, to convert a sha1-name to a newhash-name:
+and sha256-names. The lookup proceeds similarly to ordinary object
+lookups. For example, to convert a sha1-name to a sha256-name:
 
  1. Look for the object in idx files. If a match is present in the
     idx's sorted list of truncated sha1-names, then:
@@ -301,8 +298,8 @@ lookups. For example, to convert a sha1-name to a newhash-name:
        name order mapping.
     b. Read the corresponding entry in the full sha1-name table to
        verify we found the right object. If it is, then
-    c. Read the corresponding entry in the full newhash-name table.
-       That is the object's newhash-name.
+    c. Read the corresponding entry in the full sha256-name table.
+       That is the object's sha256-name.
  2. Check for a loose object. Read lines from loose-object-idx until
     we find a match.
 
@@ -318,25 +315,25 @@ for all objects in the object store.
 
 Reading an object's sha1-content
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The sha1-content of an object can be read by converting all newhash-names
-its newhash-content references to sha1-names using the translation table.
+The sha1-content of an object can be read by converting all sha256-names
+its sha256-content references to sha1-names using the translation table.
 
 Fetch
 ~~~~~
 Fetching from a SHA-1 based server requires translating between SHA-1
-and NewHash based representations on the fly.
+and SHA-256 based representations on the fly.
 
 SHA-1s named in the ref advertisement that are present on the client
-can be translated to NewHash and looked up as local objects using the
+can be translated to SHA-256 and looked up as local objects using the
 translation table.
 
 Negotiation proceeds as today. Any "have"s generated locally are
 converted to SHA-1 before being sent to the server, and SHA-1s
-mentioned by the server are converted to NewHash when looking them up
+mentioned by the server are converted to SHA-256 when looking them up
 locally.
 
 After negotiation, the server sends a packfile containing the
-requested objects. We convert the packfile to NewHash format using
+requested objects. We convert the packfile to SHA-256 format using
 the following steps:
 
 1. index-pack: inflate each object in the packfile and compute its
@@ -351,12 +348,12 @@ the following steps:
    (This list only contains objects reachable from the "wants". If the
    pack from the server contained additional extraneous objects, then
    they will be discarded.)
-3. convert to newhash: open a new (newhash) packfile. Read the topologically
+3. convert to sha256: open a new (sha256) packfile. Read the topologically
    sorted list just generated. For each object, inflate its
-   sha1-content, convert to newhash-content, and write it to the newhash
-   pack. Record the new sha1<->newhash mapping entry for use in the idx.
+   sha1-content, convert to sha256-content, and write it to the sha256
+   pack. Record the new sha1<->sha256 mapping entry for use in the idx.
 4. sort: reorder entries in the new pack to match the order of objects
-   in the pack the server generated and include blobs. Write a newhash idx
+   in the pack the server generated and include blobs. Write a sha256 idx
    file
 5. clean up: remove the SHA-1 based pack file, index, and
    topologically sorted list obtained from the server in steps 1
@@ -388,16 +385,16 @@ send-pack.
 
 Signed Commits
 ~~~~~~~~~~~~~~
-We add a new field "gpgsig-newhash" to the commit object format to allow
+We add a new field "gpgsig-sha256" to the commit object format to allow
 signing commits without relying on SHA-1. It is similar to the
-existing "gpgsig" field. Its signed payload is the newhash-content of the
-commit object with any "gpgsig" and "gpgsig-newhash" fields removed.
+existing "gpgsig" field. Its signed payload is the sha256-content of the
+commit object with any "gpgsig" and "gpgsig-sha256" fields removed.
 
 This means commits can be signed
 1. using SHA-1 only, as in existing signed commit objects
-2. using both SHA-1 and NewHash, by using both gpgsig-newhash and gpgsig
+2. using both SHA-1 and SHA-256, by using both gpgsig-sha256 and gpgsig
    fields.
-3. using only NewHash, by only using the gpgsig-newhash field.
+3. using only SHA-256, by only using the gpgsig-sha256 field.
 
 Old versions of "git verify-commit" can verify the gpgsig signature in
 cases (1) and (2) without modifications and view case (3) as an
@@ -405,24 +402,24 @@ ordinary unsigned commit.
 
 Signed Tags
 ~~~~~~~~~~~
-We add a new field "gpgsig-newhash" to the tag object format to allow
+We add a new field "gpgsig-sha256" to the tag object format to allow
 signing tags without relying on SHA-1. Its signed payload is the
-newhash-content of the tag with its gpgsig-newhash field and "-----BEGIN PGP
+sha256-content of the tag with its gpgsig-sha256 field and "-----BEGIN PGP
 SIGNATURE-----" delimited in-body signature removed.
 
 This means tags can be signed
 1. using SHA-1 only, as in existing signed tag objects
-2. using both SHA-1 and NewHash, by using gpgsig-newhash and an in-body
+2. using both SHA-1 and SHA-256, by using gpgsig-sha256 and an in-body
    signature.
-3. using only NewHash, by only using the gpgsig-newhash field.
+3. using only SHA-256, by only using the gpgsig-sha256 field.
 
 Mergetag embedding
 ~~~~~~~~~~~~~~~~~~
 The mergetag field in the sha1-content of a commit contains the
 sha1-content of a tag that was merged by that commit.
 
-The mergetag field in the newhash-content of the same commit contains the
-newhash-content of the same tag.
+The mergetag field in the sha256-content of the same commit contains the
+sha256-content of the same tag.
 
 Submodules
 ~~~~~~~~~~
@@ -497,7 +494,7 @@ Caveats
 -------
 Invalid objects
 ~~~~~~~~~~~~~~~
-The conversion from sha1-content to newhash-content retains any
+The conversion from sha1-content to sha256-content retains any
 brokenness in the original object (e.g., tree entry modes encoded with
 leading 0, tree objects whose paths are not sorted correctly, and
 commit objects without an author or committer). This is a deliberate
@@ -516,7 +513,7 @@ allow lifting this restriction.
 
 Alternates
 ~~~~~~~~~~
-For the same reason, a newhash repository cannot borrow objects from a
+For the same reason, a sha256 repository cannot borrow objects from a
 sha1 repository using objects/info/alternates or
 $GIT_ALTERNATE_OBJECT_REPOSITORIES.
 
@@ -524,20 +521,20 @@ git notes
 ~~~~~~~~~
 The "git notes" tool annotates objects using their sha1-name as key.
 This design does not describe a way to migrate notes trees to use
-newhash-names. That migration is expected to happen separately (for
+sha256-names. That migration is expected to happen separately (for
 example using a file at the root of the notes tree to describe which
 hash it uses).
 
 Server-side cost
 ~~~~~~~~~~~~~~~~
-Until Git protocol gains NewHash support, using NewHash based storage
+Until Git protocol gains SHA-256 support, using SHA-256 based storage
 on public-facing Git servers is strongly discouraged. Once Git
-protocol gains NewHash support, NewHash based servers are likely not
+protocol gains SHA-256 support, SHA-256 based servers are likely not
 to support SHA-1 compatibility, to avoid what may be a very expensive
 hash reencode during clone and to encourage peers to modernize.
 
 The design described here allows fetches by SHA-1 clients of a
-personal NewHash repository because it's not much more difficult than
+personal SHA-256 repository because it's not much more difficult than
 allowing pushes from that repository. This support needs to be guarded
 by a configuration option --- servers like git.kernel.org that serve a
 large number of clients would not be expected to bear that cost.
@@ -547,7 +544,7 @@ Meaning of signatures
 The signed payload for signed commits and tags does not explicitly
 name the hash used to identify objects. If some day Git adopts a new
 hash function with the same length as the current SHA-1 (40
-hexadecimal digit) or NewHash (64 hexadecimal digit) objects then the
+hexadecimal digit) or SHA-256 (64 hexadecimal digit) objects then the
 intent behind the PGP signed payload in an object signature is
 unclear:
 
@@ -562,7 +559,7 @@ Does this mean Git v2.12.0 is the commit with sha1-name
 e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with
 new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7?
 
-Fortunately NewHash and SHA-1 have different lengths. If Git starts
+Fortunately SHA-256 and SHA-1 have different lengths. If Git starts
 using another hash with the same length to name objects, then it will
 need to change the format of signed payloads using that hash to
 address this issue.
@@ -574,24 +571,24 @@ supports four different modes of operation:
 
  1. ("dark launch") Treat object names input by the user as SHA-1 and
     convert any object names written to output to SHA-1, but store
-    objects using NewHash.  This allows users to test the code with no
+    objects using SHA-256.  This allows users to test the code with no
     visible behavior change except for performance.  This allows
     allows running even tests that assume the SHA-1 hash function, to
     sanity-check the behavior of the new mode.
 
- 2. ("early transition") Allow both SHA-1 and NewHash object names in
+ 2. ("early transition") Allow both SHA-1 and SHA-256 object names in
     input. Any object names written to output use SHA-1. This allows
     users to continue to make use of SHA-1 to communicate with peers
     (e.g. by email) that have not migrated yet and prepares for mode 3.
 
- 3. ("late transition") Allow both SHA-1 and NewHash object names in
-    input. Any object names written to output use NewHash. In this
+ 3. ("late transition") Allow both SHA-1 and SHA-256 object names in
+    input. Any object names written to output use SHA-256. In this
     mode, users are using a more secure object naming method by
     default.  The disruption is minimal as long as most of their peers
     are in mode 2 or mode 3.
 
  4. ("post-transition") Treat object names input by the user as
-    NewHash and write output using NewHash. This is safer than mode 3
+    SHA-256 and write output using SHA-256. This is safer than mode 3
     because there is less risk that input is incorrectly interpreted
     using the wrong hash function.
 
@@ -601,27 +598,31 @@ The user can also explicitly specify which format to use for a
 particular revision specifier and for output, overriding the mode. For
 example:
 
-git --output-format=sha1 log abac87a^{sha1}..f787cac^{newhash}
+git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256}
 
-Selection of a New Hash
------------------------
+Choice of Hash
+--------------
 In early 2005, around the time that Git was written,  Xiaoyun Wang,
 Yiqun Lisa Yin, and Hongbo Yu announced an attack finding SHA-1
 collisions in 2^69 operations. In August they published details.
 Luckily, no practical demonstrations of a collision in full SHA-1 were
 published until 10 years later, in 2017.
 
-The hash function NewHash to replace SHA-1 should be stronger than
-SHA-1 was: we would like it to be trustworthy and useful in practice
-for at least 10 years.
+Git v2.13.0 and later subsequently moved to a hardened SHA-1
+implementation by default that mitigates the SHAttered attack, but
+SHA-1 is still believed to be weak.
+
+The hash to replace this hardened SHA-1 should be stronger than SHA-1
+was: we would like it to be trustworthy and useful in practice for at
+least 10 years.
 
 Some other relevant properties:
 
 1. A 256-bit hash (long enough to match common security practice; not
    excessively long to hurt performance and disk usage).
 
-2. High quality implementations should be widely available (e.g. in
-   OpenSSL).
+2. High quality implementations should be widely available (e.g., in
+   OpenSSL and Apple CommonCrypto).
 
 3. The hash function's properties should match Git's needs (e.g. Git
    requires collision and 2nd preimage resistance and does not require
@@ -630,14 +631,13 @@ Some other relevant properties:
 4. As a tiebreaker, the hash should be fast to compute (fortunately
    many contenders are faster than SHA-1).
 
-Some hashes under consideration are SHA-256, SHA-512/256, SHA-256x16,
-K12, and BLAKE2bp-256.
+We choose SHA-256.
 
 Transition plan
 ---------------
 Some initial steps can be implemented independently of one another:
 - adding a hash function API (vtable)
-- teaching fsck to tolerate the gpgsig-newhash field
+- teaching fsck to tolerate the gpgsig-sha256 field
 - excluding gpgsig-* from the fields copied by "git commit --amend"
 - annotating tests that depend on SHA-1 values with a SHA1 test
   prerequisite
@@ -664,7 +664,7 @@ Next comes introduction of compatObjectFormat:
 - adding appropriate index entries when adding a new object to the
   object store
 - --output-format option
-- ^{sha1} and ^{newhash} revision notation
+- ^{sha1} and ^{sha256} revision notation
 - configuration to specify default input and output format (see
   "Object names on the command line" above)
 
@@ -672,7 +672,7 @@ The next step is supporting fetches and pushes to SHA-1 repositories:
 - allow pushes to a repository using the compat format
 - generate a topologically sorted list of the SHA-1 names of fetched
   objects
-- convert the fetched packfile to newhash format and generate an idx
+- convert the fetched packfile to sha256 format and generate an idx
   file
 - re-sort to match the order of objects in the fetched packfile
 
@@ -680,30 +680,30 @@ The infrastructure supporting fetch also allows converting an existing
 repository. In converted repositories and new clones, end users can
 gain support for the new hash function without any visible change in
 behavior (see "dark launch" in the "Object names on the command line"
-section). In particular this allows users to verify NewHash signatures
+section). In particular this allows users to verify SHA-256 signatures
 on objects in the repository, and it should ensure the transition code
 is stable in production in preparation for using it more widely.
 
 Over time projects would encourage their users to adopt the "early
 transition" and then "late transition" modes to take advantage of the
-new, more futureproof NewHash object names.
+new, more futureproof SHA-256 object names.
 
 When objectFormat and compatObjectFormat are both set, commands
-generating signatures would generate both SHA-1 and NewHash signatures
+generating signatures would generate both SHA-1 and SHA-256 signatures
 by default to support both new and old users.
 
-In projects using NewHash heavily, users could be encouraged to adopt
+In projects using SHA-256 heavily, users could be encouraged to adopt
 the "post-transition" mode to avoid accidentally making implicit use
 of SHA-1 object names.
 
 Once a critical mass of users have upgraded to a version of Git that
-can verify NewHash signatures and have converted their existing
+can verify SHA-256 signatures and have converted their existing
 repositories to support verifying them, we can add support for a
-setting to generate only NewHash signatures. This is expected to be at
+setting to generate only SHA-256 signatures. This is expected to be at
 least a year later.
 
 That is also a good moment to advertise the ability to convert
-repositories to use NewHash only, stripping out all SHA-1 related
+repositories to use SHA-256 only, stripping out all SHA-1 related
 metadata. This improves performance by eliminating translation
 overhead and security by avoiding the possibility of accidentally
 relying on the safety of SHA-1.
@@ -742,16 +742,16 @@ using the old hash function.
 
 Signed objects with multiple hashes
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Instead of introducing the gpgsig-newhash field in commit and tag objects
-for newhash-content based signatures, an earlier version of this design
-added "hash newhash <newhash-name>" fields to strengthen the existing
+Instead of introducing the gpgsig-sha256 field in commit and tag objects
+for sha256-content based signatures, an earlier version of this design
+added "hash sha256 <sha256-name>" fields to strengthen the existing
 sha1-content based signatures.
 
 In other words, a single signature was used to attest to the object
 content using both hash functions. This had some advantages:
 * Using one signature instead of two speeds up the signing process.
 * Having one signed payload with both hashes allows the signer to
-  attest to the sha1-name and newhash-name referring to the same object.
+  attest to the sha1-name and sha256-name referring to the same object.
 * All users consume the same signature. Broken signatures are likely
   to be detected quickly using current versions of git.
 
@@ -760,11 +760,11 @@ However, it also came with disadvantages:
   objects it references, even after the transition is complete and
   translation table is no longer needed for anything else. To support
   this, the design added fields such as "hash sha1 tree <sha1-name>"
-  and "hash sha1 parent <sha1-name>" to the newhash-content of a signed
+  and "hash sha1 parent <sha1-name>" to the sha256-content of a signed
   commit, complicating the conversion process.
 * Allowing signed objects without a sha1 (for after the transition is
   complete) complicated the design further, requiring a "nohash sha1"
-  field to suppress including "hash sha1" fields in the newhash-content
+  field to suppress including "hash sha1" fields in the sha256-content
   and signed payload.
 
 Lazily populated translation table
@@ -772,7 +772,7 @@ Lazily populated translation table
 Some of the work of building the translation table could be deferred to
 push time, but that would significantly complicate and slow down pushes.
 Calculating the sha1-name at object creation time at the same time it is
-being streamed to disk and having its newhash-name calculated should be
+being streamed to disk and having its sha256-name calculated should be
 an acceptable cost.
 
 Document History
-- 
2.18.0.597.ga71716f1ad


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-07-30 20:01                 ` Dan Shumow
  2018-08-03  2:57                   ` Jonathan Nieder
@ 2018-09-18 15:18                   ` Joan Daemen
  2018-09-18 15:32                     ` Jonathan Nieder
  2018-09-18 16:50                     ` Linus Torvalds
  1 sibling, 2 replies; 66+ messages in thread
From: Joan Daemen @ 2018-09-18 15:18 UTC (permalink / raw)
  To: Dan Shumow, Johannes Schindelin, brian m. carlson
  Cc: Junio C Hamano, Linus Torvalds, Edward Thomson, Jonathan Nieder,
	Git Mailing List, demerphq, Adam Langley

Dear all,

when going over my todo list I was confronted with the mail of Dan Shumow on the successor of SHA-1 for git. I know the decision was made and it is not my intention to change it, but please see below some comments on Dan's arguments.

In short, I argue below that SHA256 has no serious advantages when compared to KangarooTwelve. In that light, the fact that SHA2 was designed behind closed doors (like SHA-1 was) should be enough reason to skip it entirely in an undertaking that takes open-source seriously.

Kind regards,

Joan

PS: In my comments below I use "we" as I discussed them with the members of the Keccak team being Gilles Van Assche, Michaël Peeters, Guido Bertoni, Ronny Van Keer and Seth Hoffert, and we agree on all of them.

On 30/07/2018 22:01, Dan Shumow wrote:
> Hello all.   Johannes, thanks for adding me to this discussion.
>
> So, as one of the coauthors of the SHA-1 collision detection code, I just wanted to chime in and say I'm glad to see the move to a longer hash function.  Though, as a cryptographer, I have a few thoughts on the matter that I thought I would share.
>
> I think that moving to SHA256 is a fine change, and I support it.
>
> I'm not anywhere near the expert in this that Joan Daeman is. 

Note that the correct spelling is "Daemen". But anyway, it is not a matter of being expert, but a matter of taking the arguments at face value. 

> I am someone who has worked in this space more or less peripherally.  However, I agree with Adam Langley that basically all of the finalists for a hash function replacement are about the same for the security needs of Git.  I think that, for this community, other software engineering considerations should be more important to the selection process.

We are also with Adam on the engineering considerations. We see the parallelism that K12 can exploit adaptively (unlike SHA256) as an example of such a consideration.

> I think Joan's survey of cryptanalysis papers and the numbers that he gives are interesting, and I had never seen the comparison laid out like that.  So, I think that there is a good argument to be made that SHA3 has had more cryptanalysis than SHA2.  Though, Joan, are the papers that you surveyed only focused on SHA2?  I'm curious if you think that the design/construction of SHA2, as it can be seen as an iteration of MD5/SHA1, means that the cryptanalysis papers on those constructions can be considered to apply to SHA2?  

This argument works both ways, i.e., the knowledge and experience of the symmetric cryptography community in general has also contributed to our choices in Keccak and in K12 (including the experience gained by Rijndael/AES). But in the end, the only objective metric we have for comparing public scrutiny is the amount of cryptanalysis (and analysis) published, and there Keccak simply scores better.

> Again, I'm not an expert in this, but I do know that Marc Steven's techniques for constructing collisions also provided some small cryptanalytic improvements against the SHA2 family as well.  I also think that while the paper survey is a good way to look over all of this, the more time in the position of high profile visibility that SHA2 has can give us some confidence as well.

High profile visibility to implementers does not mean more cryptanalysis, since users and implementers are usually not cryptanalysts. Actually, one of the reasons that SHA2 attracted much less cryptanalysis than you would expect due to its age is that during the SHA3 competition all cryptanalysts pointed their arrows to SHA3 candidates.

> Also something worth pointing out is that the connection SHA2 has to SHA1 means that if Marc Steven's cryptanalysis of MD5/SHA-1 were ever successfully applied to SHA2, the SHA1 collision detection approach could be applied there as well, thus providing a drop in replacement in that situation.  That said, we don't know that there is not a similar way of addressing issues with the SHA3/Sponge design.  It's just that because we haven't seen any weaknesses of this sort in similar designs, we just don't know what a similar approach would be there yet.  I don't want to put too much stock in this argument, it's just saying "Well, we already know how SHA2 is likely to break, and we've had fixes for similar things in the past."  This is pragmatic but not inspiring or confidence building.
>
> So, I also want to state my biases in favor of SHA2 as an employee of Microsoft.  Microsoft, being a corporation headquartered in a America, with the US Gov't as a major customer definitely prefers to defer to the US Gov't NIST standardization process.  And from that perspective SHA2 or SHA3 would be good choices.  I, personally, think that the NIST process is the best we have.  It is relatively transparent, and NIST employs a fair number of very competent cryptographers.  Also, I am encouraged by the widespread international participation that the NIST competitions and selection processes attract.

Of course, NIST has done (and is still doing) a great job at organizing public competitions, where all submissions have to include a design rationale and where the final selection is based on extensive openly published cryptanalysis and comparisons done by the cryptographic community. This is obviously AES and SHA3. However, NIST also put forward NSA designs as standards, without design rationale or public cryptanalysis whatsoever, and in some cases even with built-in backdoors
(EC-DRBG as Dan probably remembers). Examples of this are DES, SHA(0), SHA-1 and, yes, SHA2. The former we would call open-source philosophy and
the latter closed-source.

> As such, and reflecting this bias, in the internal discussions that Johannes alluded to, SHA2 and SHA3 were the primary suggestions.  There was a slight preference for SHA2 because SHA3 is not exposed through the windows cryptographic APIs (though Git does not use those, so this is a nonissue for this discussion.)

We find it cynical to bring up a Microsoft-internal argument that is actually not relevant to Git.

> I also wanted to thank Johannes for keeping the cryptographers that he discussed this with anonymous.  After all, cryptographers are known for being private.  And I wanted to say that Johannes did, in fact, accurately represent our internal discussions on the matter.

Our experience is that in the cryptographic community there are many outspoken individuals that fearlessly ventilate their opinions (sometimes even controversial ones).

> I also wanted to comment on the discussion of the "internal state having the same size as the output."  Linus referred to this several times.  This is known as narrow-pipe vs wide-pipe in the hash function design literature.  Linus is correct that wide-pipe designs are more in favor currently, and IIRC, all of the serious SHA3 candidates employed this.  That said, it did seem that in the discussion this was being equated with "length extension attacks."  And that connection is just not accurate.  Length extension attacks are primarily a motivation of the HMAC liked nested hashing design for MACs, because of a potential forgery attack.  Again, this doesn't really matter because the decision has been made despite this discussion.  I just wanted to set the record straight about this, as to avoid doing the right thing for the wrong reason (T.S. Elliot's "greatest treason.")

Indeed, vulnerability to length extension attacks and size of the internal state (chaining value in compression function based designs, or capacity in sponge) are two different things. Still, we would like to make a few points here.

1) There were SHA3 submissions that were narrow-pipe, i.e., finalist Blake is narrow-pipe.

2) SHA2 and its predecessors are vulnerable to length extension, SHA3 (or any of the SHA3 finalists) isn't. Length extension is a problem when using the hash function for MAC computation but this can be fixed by putting a construction on top of it. That construction is HMAC, that comes with some fixed overhead (up to a factor 4 for short messages).

3) The relatively large state in the sponge construction increases the generic strength against attacks when the input contains redundancy or
has a certain form. For instance, if the input is restricted to be text in ASCII (such as source code), then the collision-resistance grows
higher than the nominal 2^{c/2}. Such an effect does not exist with narrow-pipe Merkle-Damgård. (This may be what Linus had intuitively in mind.)

> One other thing that I wanted to throw out there for the future is that in the crypto community there is currently a very large push to post quantum cryptography.  Whether the threat of quantum computers is real or imagined this is a hot area of research, with a NIST competition to select post quantum asymmetric cryptographic algorithms.  That is not directly of concern to the selection of a hash function.  However, if we take this threat as legitimate, quantum computers reduce the strength of symmetric crypto, both encryption and hash functions, by 1/2.  

This is not what the experts say. In [1] a quantum algorithm is given that reduces the effort to generate a hash collision to 2^{n/3} (instead of 2^{n/2} classically). So according to [1] the strength reduction is 2/3 rather than 1/2. Moreover, in [2], Dan Bernstein takes a more detailed look at the actual cost of that algorithm and argues that the quantum algorithm of [1] performs worse than classical ones and that there is no security reduction at all for collision-resistance.

[1] Gilles Brassard, Peter Høyer, Alain Tapp, Quantum cryptanalysis of hash and claw-free functions, in LATIN'98 proceedings (1998), 163–169.

[2] Daniel J. Bernstein, Cost analysis of hash collisions: Will quantum computers make SHARCS obsolete? Workshop Record of SHARCS'09.

> So, if this is the direction that the crypto community ultimately goes in, 512bit hashes will be seen as standard over the next decade or so.  I don't think that this should be involved in this discussion, presently.   I'm just saying that not unlike the time when SHA1 was selected, I think that the replacement of a 256bit hash is on the horizon as well.
>
> Thanks,
> Dan Shumow
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-09-18 15:18                   ` Joan Daemen
@ 2018-09-18 15:32                     ` Jonathan Nieder
  2018-09-18 16:50                     ` Linus Torvalds
  1 sibling, 0 replies; 66+ messages in thread
From: Jonathan Nieder @ 2018-09-18 15:32 UTC (permalink / raw)
  To: Joan Daemen
  Cc: Dan Shumow, Johannes Schindelin, brian m. carlson, Junio C Hamano,
	Linus Torvalds, Edward Thomson, Git Mailing List, demerphq,
	Adam Langley

Hi,

A quick note.

Joan Daemen wrote:

> when going over my todo list I was confronted with the mail of Dan
> Shumow on the successor of SHA-1 for git. I know the decision was
> made and it is not my intention to change it, but please see below
> some comments on Dan's arguments.

When the time comes for the next hash change in Git, it will be useful
to be able to look back over this discussion.  Thanks for adding
details.

[...]
> On 30/07/2018 22:01, Dan Shumow wrote:

>> So, I also want to state my biases in favor of SHA2 as an employee
>> of Microsoft. [...] As such, and reflecting this bias, in the
>> internal discussions that Johannes alluded to, SHA2 and SHA3 were
>> the primary suggestions.  There was a slight preference for SHA2
>> because SHA3 is not exposed through the windows cryptographic APIs
>> (though Git does not use those, so this is a nonissue for this
>> discussion.)
>
> We find it cynical to bring up a Microsoft-internal argument that is
> actually not relevant to Git.

On the contrary, I am quite grateful that Dan was up front about where
his preference comes from, *especially* when the reasons are not
relevant to Git.  It is useful background for better understanding his
rationale and understanding the ramifications for some subset of
users.

In other words, consider someone active in the Git project that
disagrees with the decision to use SHA2.  This explanation by Dan can
help such a person understand where the disagreement is coming from
and whether we are making the decision for the wrong reasons (because
Git on Windows does not even use those APIs).

[...]
> 3) The relatively large state in the sponge construction increases
> the generic strength against attacks when the input contains
> redundancy or has a certain form. For instance, if the input is
> restricted to be text in ASCII (such as source code), then the
> collision-resistance grows higher than the nominal 2^{c/2}. Such an
> effect does not exist with narrow-pipe Merkle-Damgård. (This may be
> what Linus had intuitively in mind.)

Interesting.

[...]
> [2] Daniel J. Bernstein, Cost analysis of hash collisions: Will
> quantum computers make SHARCS obsolete? Workshop Record of
> SHARCS'09.

I remember that paper!  Thanks for the pointer.

Sincerely,
Jonathan

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: Hash algorithm analysis
  2018-09-18 15:18                   ` Joan Daemen
  2018-09-18 15:32                     ` Jonathan Nieder
@ 2018-09-18 16:50                     ` Linus Torvalds
  1 sibling, 0 replies; 66+ messages in thread
From: Linus Torvalds @ 2018-09-18 16:50 UTC (permalink / raw)
  To: jda
  Cc: Dan Shumow, Johannes Schindelin, brian m. carlson, Junio C Hamano,
	Edward Thomson, Jonathan Nieder, Git Mailing List, demerphq,
	Adam Langley

On Tue, Sep 18, 2018 at 8:18 AM Joan Daemen <jda@noekeon.org> wrote:
>
> 3) The relatively large state in the sponge construction increases the generic strength against attacks when the input contains redundancy or
> has a certain form. For instance, if the input is restricted to be text in ASCII (such as source code), then the collision-resistance grows
> higher than the nominal 2^{c/2}. Such an effect does not exist with narrow-pipe Merkle-Damgård. (This may be what Linus had intuitively in mind.)

Answering to just this part:

No, what I had in mind was literally just exactly the kind of attack
that SHA1 broke for - attacking the internal state vector directly,
and not paying any penalty for it, because the stat size is the same
as the final hash size.

The length extension attack is just the simplest and most trivial
version of that kind of attack - because the internal state vector
*is* the result, and you just continue using it.

But that trivial length extension thing not the real problem, it's
just the absolutely simplest symptom of the real problem.

I think that the model where the internal state of the hash is the
same width as the final result is simply broken. It was what broke
SHA1, and that problem is shared with SHA2.

"Length extension" is just the simplest way to say "broken by design", imho.

Because the length extension attack is just the most trivial attack,
but it isn't the fundamental problem. It was just the first and the
cheapest attack found, but it was also the most special-cased and
least interesting. You need to have a very special case (with that
secret at the beginning etc) to make the pure length extension attack
interesting. And git has no secrets, so in that sense "length
extension" by itself is totally immaterial. But the basic problem of
internal hash size obviously wasn't.

So I would say that length extension is a direct result of the _real_
problem, which is that the hash exposes _all_ of the internal data.

That is what makes length extension possible - because you can just
continue from a known state, and there is absolutely nothing hidden -
and yes, that's a really easy special case where you don't even need
to actually break the hash at all.

But I argue that it's _also_ one big part of what made SHAttered
practical, and I think the underlying problem is exactly the same.
When the internal state is the same size as the hash, you can attack
the internal state itself for basically the same cost as attacking the
whole hash.

So you can pick-and-choose the weakest point.

Which is basically exactly what SHAttered did. No, it wasn't the
trivial "just add to the end", but it used the exact same underlying
weakness as one part of the attack.

*This* is why I dislike SHA2. It has basically the exact same basic
weakness that we already know SHA1 fell for. The hashing details are
different, and hopefully that means that there aren't the same kind of
patterns that can be generated to do the "attack the internal hash
state" part, but I don't understand why people seem to ignore that
other fundamental issue.

Something like SHA-512/256 would have been better, but I think almost
nobody does that in hardware, which was one of the big advantages of
plain SHA2.

The main reason I think SHA2 is acceptable is simply that 256 bits is
a lot. So even if somebody comes up with a shortcut that weakens it by
tens of bits, nobody really cares. Plus I'm obviously not a
cryptographer, so I didn't feel like I was going to fight it a lot.

But yes, I'd have probably gone with any of the other alternatives,
because I think it's a bit silly that we're switching hashes to
another hash that has (at least in part) the *exact* same issue as the
one people call broken.

(And yes, the hashing details are different, so it's "exactly the
same" only wrt that internal state part - not the bitpattern finding
part that made the attack on the internal state much cheaper. Real
cryptographers obviously found that "figure out the weakness of the
hashing" to be the more interesting and novel part over the trivial
internal hash size part).

That said..

The real reason I think SHA2 is the right choice was simply that there
needs to be a decision, and none of the choices were *wrong*.
Sometimes just the _act_ of making a decision is more important than
_what_ the decision is.

And hey, it is also likely that the reason _I_ get hung up on just the
size of the internal state is that exactly because I am _not_ a
cryptographer, that kind of high-level stuff is the part I understand.
When you start talking about why the exact rules of Merkle–Damgård
constructions work, my eyes just glaze over.

So I'm probably - no, certainly - myopic and looking at only one part
of the issue to begin with.

The end result is that I argued for more bits in the internal state
(and apparently wide vs narrow is the technical term), and I would
have seen parallel algorithms as a bonus for the large-file case. None
of which argued for SHA2.

But see above on why I think SHA2 is if not *the* right choice, at
least *a* right choice.

                    Linus

^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2018-09-18 16:50 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-09 20:56 State of NewHash work, future directions, and discussion brian m. carlson
2018-06-09 21:26 ` Ævar Arnfjörð Bjarmason
2018-06-09 22:49 ` Hash algorithm analysis brian m. carlson
2018-06-11 19:29   ` Jonathan Nieder
2018-06-11 20:20     ` Linus Torvalds
2018-06-11 23:27       ` Ævar Arnfjörð Bjarmason
2018-06-12  0:11         ` David Lang
2018-06-12  0:45         ` Linus Torvalds
2018-06-11 22:35     ` brian m. carlson
2018-06-12 16:21       ` Gilles Van Assche
2018-06-13 23:58         ` brian m. carlson
2018-06-15 10:33           ` Gilles Van Assche
2018-07-20 21:52     ` brian m. carlson
2018-07-21  0:31       ` Jonathan Nieder
2018-07-21 19:52       ` Ævar Arnfjörð Bjarmason
2018-07-21 20:25         ` brian m. carlson
2018-07-21 22:38       ` Johannes Schindelin
2018-07-21 23:09         ` Linus Torvalds
2018-07-21 23:59         ` brian m. carlson
2018-07-22  9:34           ` Eric Deplagne
2018-07-22 14:21             ` brian m. carlson
2018-07-22 14:55               ` Eric Deplagne
2018-07-26 10:05                 ` Johannes Schindelin
2018-07-22 15:23           ` Joan Daemen
2018-07-22 18:54             ` Adam Langley
2018-07-26 10:31             ` Johannes Schindelin
2018-07-23 12:40           ` demerphq
2018-07-23 12:48             ` Sitaram Chamarty
2018-07-23 12:55               ` demerphq
2018-07-23 18:23               ` Linus Torvalds
2018-07-23 17:57             ` Stefan Beller
2018-07-23 18:35             ` Jonathan Nieder
2018-07-24 19:01       ` Edward Thomson
2018-07-24 20:31         ` Linus Torvalds
2018-07-24 20:49           ` Jonathan Nieder
2018-07-24 21:13           ` Junio C Hamano
2018-07-24 22:10             ` brian m. carlson
2018-07-30  9:06               ` Johannes Schindelin
2018-07-30 20:01                 ` Dan Shumow
2018-08-03  2:57                   ` Jonathan Nieder
2018-09-18 15:18                   ` Joan Daemen
2018-09-18 15:32                     ` Jonathan Nieder
2018-09-18 16:50                     ` Linus Torvalds
2018-07-25  8:30             ` [PATCH 0/2] document that NewHash is now SHA-256 Ævar Arnfjörð Bjarmason
2018-07-25  8:30             ` [PATCH 1/2] doc hash-function-transition: note the lack of a changelog Ævar Arnfjörð Bjarmason
2018-07-25  8:30             ` [PATCH 2/2] doc hash-function-transition: pick SHA-256 as NewHash Ævar Arnfjörð Bjarmason
2018-07-25 16:45               ` Junio C Hamano
2018-07-25 17:25                 ` Jonathan Nieder
2018-07-25 21:32                   ` Junio C Hamano
2018-07-26 13:41                     ` [PATCH v2 " Ævar Arnfjörð Bjarmason
2018-08-03  7:20                       ` Jonathan Nieder
2018-08-03 16:40                         ` Junio C Hamano
2018-08-03 17:01                           ` Linus Torvalds
2018-08-03 16:42                         ` Linus Torvalds
2018-08-03 17:43                         ` Ævar Arnfjörð Bjarmason
2018-08-04  8:52                           ` Jonathan Nieder
2018-08-03 17:45                         ` brian m. carlson
2018-07-25 22:56                 ` [PATCH " brian m. carlson
2018-06-11 21:19   ` Hash algorithm analysis Ævar Arnfjörð Bjarmason
2018-06-21  8:20     ` Johannes Schindelin
2018-06-21 22:39     ` brian m. carlson
2018-06-11 18:09 ` State of NewHash work, future directions, and discussion Duy Nguyen
2018-06-12  1:28   ` brian m. carlson
2018-06-11 19:01 ` Jonathan Nieder
2018-06-12  2:28   ` brian m. carlson
2018-06-12  2:42     ` Jonathan Nieder

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).