* State of NewHash work, future directions, and discussion @ 2018-06-09 20:56 brian m. carlson 2018-06-09 21:26 ` Ævar Arnfjörð Bjarmason ` (3 more replies) 0 siblings, 4 replies; 66+ messages in thread From: brian m. carlson @ 2018-06-09 20:56 UTC (permalink / raw) To: git [-- Attachment #1: Type: text/plain, Size: 4661 bytes --] Since there's been a lot of questions recently about the state of the NewHash work, I thought I'd send out a summary. == Status I have patches to make the entire codebase work, including passing all tests, when Git is converted to use a 256-bit hash algorithm. Obviously, such a Git is incompatible with the current version, but it means that we've fixed essentially all of the hard-coded 20 and 40 constants (and therefore Git doesn't segfault). I'm working on getting a 256-bit Git to work with SHA-1 being the default. Currently, this involves doing things like writing transport code, since in order to clone a repository, you need to be able to set up the hash algorithm correctly. I know that this was a non-goal in the transition plan, but since the testsuite doesn't pass without it, it's become necessary. Some of these patches will be making their way to the list soon. They're hanging out in the normal places in the object-id-part14 branch (which may be rebased). == Future Design The work I've done necessarily involves porting everything to use the_hash_algo. Essentially, when the piece I'm currently working on is complete, we'll have a transition stage 4 implementation (all NewHash). Stage 2 and 3 will be implemented next. My vision of how data is stored is that the .git directory is, except for pack indices and the loose object lookup table, entirely in one format. It will be all SHA-1 or all NewHash. This algorithm will be stored in the_hash_algo. I plan on introducing an array of hash algorithms into struct repository (and wrapper macros) which stores, in order, the output hash, and if used, the additional input hash. Functions like get_oid_hex and parse_oid_hex will acquire an internal version, which knows about parsing things (like refs) in the internal format, and one which knows about parsing in the UI formats. Similarly, oid_to_hex will have an internal version that handles data in the .git directory, and an external version that produces data in the output format. Translation will take place at the outer edges of the program. The transition plan anticipates a stage 1 where accept only SHA-1 on input and produce only SHA-1 on output, but store in NewHash. As I've worked with our tests, I've realized such an implementation is not entirely possible. We have various tools that expect to accept invalid object IDs, and obviously there's no way to have those continue to work. We'd have to either reject invalid data in such a case or combine stages 1 and 2. == Compatibility with this Work If you're working on new features and you'd like to implement the best possible compatibility with this work, here are some recommendations: * Assume everything in the .git directory but pack indices and the loose object index will be in the same algorithm and that that algorithm is the_hash_algo. * For the moment, use the_hash_algo to look up the size of all hash-related constants. Use GIT_MAX_* for allocations. * If you are writing a new data format, add a version number. * If you need to serialize an algorithm identifier into your data format, use the format_id field of struct git_hash_algo. It's designed specifically for that purpose. * You can safely assume that the_hash_algo will be suitably initialized to the correct algorithm for your repository. * Keep using the object ID functions and struct object_id. * Try not to use mmap'd structs for reading and writing formats on disk, since these are hard to make hash size agnostic. == Discussion about an Actual NewHash Since I'll be writing new code, I'll be writing tests for this code. However, writing tests for creating and initializing repositories requires that I be able to test that objects are being serialized correctly, and therefore requires that I actually know what the hash algorithm is going to be. I also can't submit code for multi-hash packs when we officially only support one hash algorithm. I know that we have long tried to avoid discussing the specific algorithm to use, in part because the last discussion generated more heat than light, and settled on referring to it as NewHash for the time being. However, I think it's time to pick this topic back up, since I can't really continue work in this direction without us picking a NewHash. If people are interested, I've done some analysis on availability of implementations, performance, and other attributes described in the transition plan and can send that to the list. -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 867 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: State of NewHash work, future directions, and discussion 2018-06-09 20:56 State of NewHash work, future directions, and discussion brian m. carlson @ 2018-06-09 21:26 ` Ævar Arnfjörð Bjarmason 2018-06-09 22:49 ` Hash algorithm analysis brian m. carlson ` (2 subsequent siblings) 3 siblings, 0 replies; 66+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-06-09 21:26 UTC (permalink / raw) To: brian m. carlson; +Cc: git On Sat, Jun 09 2018, brian m. carlson wrote: > Since there's been a lot of questions recently about the state of the > NewHash work, I thought I'd send out a summary. Thanks for all your work on this. > I know that we have long tried to avoid discussing the specific > algorithm to use, in part because the last discussion generated more > heat than light, and settled on referring to it as NewHash for the time > being. However, I think it's time to pick this topic back up, since I > can't really continue work in this direction without us picking a > NewHash. > > If people are interested, I've done some analysis on availability of > implementations, performance, and other attributes described in the > transition plan and can send that to the list. Let's see it! ^ permalink raw reply [flat|nested] 66+ messages in thread
* Hash algorithm analysis 2018-06-09 20:56 State of NewHash work, future directions, and discussion brian m. carlson 2018-06-09 21:26 ` Ævar Arnfjörð Bjarmason @ 2018-06-09 22:49 ` brian m. carlson 2018-06-11 19:29 ` Jonathan Nieder 2018-06-11 21:19 ` Hash algorithm analysis Ævar Arnfjörð Bjarmason 2018-06-11 18:09 ` State of NewHash work, future directions, and discussion Duy Nguyen 2018-06-11 19:01 ` Jonathan Nieder 3 siblings, 2 replies; 66+ messages in thread From: brian m. carlson @ 2018-06-09 22:49 UTC (permalink / raw) To: git [-- Attachment #1: Type: text/plain, Size: 6836 bytes --] == Discussion of Candidates I've implemented and tested the following algorithms, all of which are 256-bit (in alphabetical order): * BLAKE2b (libb2) * BLAKE2bp (libb2) * KangarooTwelve (imported from the Keccak Code Package) * SHA-256 (OpenSSL) * SHA-512/256 (OpenSSL) * SHA3-256 (OpenSSL) * SHAKE128 (OpenSSL) I also rejected some other candidates. I couldn't find any reference or implementation of SHA256×16, so I didn't implement it. I didn't consider SHAKE256 because it is nearly identical to SHA3-256 in almost all characteristics (including performance). I imported the optimized 64-bit implementation of KangarooTwelve. The AVX2 implementation was not considered for licensing reasons (it's partially generated from external code, which falls foul of the GPL's "preferred form for modifications" rule). === BLAKE2b and BLAKE2bp These are the non-parallelized and parallelized 64-bit variants of BLAKE2. Benefits: * Both algorithms provide 256-bit preimage resistance. Downsides: * Some people are uncomfortable that the security margin has been decreased from the original SHA-3 submission, although it is still considered secure. * BLAKE2bp, as implemented in libb2, uses OpenMP (and therefore multithreading) by default. It was no longer possible to run the testsuite with -j3 on my laptop in this configuration. === Keccak-based Algorithms SHA3-256 is the 256-bit Keccak algorithm with 24 rounds, processing 136 bytes at a time. SHAKE128 is an extendable output function with 24 rounds, processing 168 bytes at a time. KangarooTwelve is an extendable output function with 12 rounds, processing 136 bytes at a time. Benefits: * SHA3-256 provides 256-bit preimage resistance. * SHA3-256 has been heavily studied and is believed to have a large security margin. I noted the following downsides: * There's a lack of a availability of KangarooTwelve in other implementations. It may be the least available option in terms of implementations. * Some people are uncomfortable that the security margin of KangarooTwelve has been decreased, although it is still considered secure. * SHAKE128 and KangarooTwelve provide only 128-bit preimage resistance. === SHA-256 and SHA-512/256 These are the 32-bit and 64-bit SHA-2 algorithms that are 256 bits in size. I noted the following benefits: * Both algorithms are well known and heavily analyzed. * Both algorithms provide 256-bit preimage resistance. == Implementation Support |=== | Implementation | OpenSSL | libb2 | NSS | ACC | gcrypt | Nettle| CL | | SHA-1 | 🗸 | | 🗸 | 🗸 | 🗸 | 🗸 | {1} | | BLAKE2b | f | 🗸 | | | 🗸 | | {2} | | BLAKE2bp | | 🗸 | | | | | | | KangarooTwelve | | | | | | | | | SHA-256 | 🗸 | | 🗸 | 🗸 | 🗸 | 🗸 | {1} | | SHA-512/256 | 🗸 | | | | | 🗸 | {3} | | SHA3-256 | 🗸 | | | | 🗸 | 🗸 | {4} | | SHAKE128 | 🗸 | | | | 🗸 | | {5} | |=== f: future version (expected 1.2.0) ACC: Apple Common Crypto CL: Command-line :1: OpenSSL, coreutils, Perl Digest::SHA. :2: OpenSSL, coreutils. :3: OpenSSL :4: OpenSSL, Perl Digest::SHA3. :5: Perl Digest::SHA3. === Performance Analysis The test system used below is my personal laptop, a 2016 Lenovo ThinkPad X1 Carbon with an Intel i7-6600U CPU (2.60 GHz) running Debian unstable. I implemented a test tool helper to compute speed much like OpenSSL does. Below is a comparison of speeds. The columns indicate the speed in KiB/s for chunks of the given size. The runs are representative of multiple similar runs. 256 and 1024 bytes were chosen to represent common tree and commit object sizes and the 8 KiB is an approximate average blob size. Algorithms are sorted by performance on the 1 KiB column. |=== | Implementation | 256 B | 1 KiB | 8 KiB | 16 KiB | | SHA-1 (OpenSSL) | 513963 | 685966 | 748993 | 754270 | | BLAKE2b (libb2) | 488123 | 552839 | 576246 | 579292 | | SHA-512/256 (OpenSSL) | 181177 | 349002 | 499113 | 495169 | | BLAKE2bp (libb2) | 139891 | 344786 | 488390 | 522575 | | SHA-256 (OpenSSL) | 264276 | 333560 | 357830 | 355761 | | KangarooTwelve | 239305 | 307300 | 355257 | 364261 | | SHAKE128 (OpenSSL) | 154775 | 253344 | 337811 | 346732 | | SHA3-256 (OpenSSL) | 128597 | 185381 | 198931 | 207365 | | BLAKE2bp (libb2; threaded) | 12223 | 49306 | 132833 | 179616 | |=== SUPERCOP (a crypto benchmarking tool; https://bench.cr.yp.to/results-hash.html) has also benchmarked these algorithms. Note that BLAKE2bp is not listed, KangarooTwelve is k12, SHA-512/256 is equivalent to sha512, SHA3-256 is keccakc512, and SHAKE128 is keccakc256. Information is for kizomba, a Kaby Lake system. Counts are in cycles per byte (smaller is better; sorted by 1536 B column): |=== | Algorithm | 576 B | 1536 B | 4096 B | long | | BLAKE2b | 3.51 | 3.10 | 3.08 | 3.07 | | SHA-1 | 4.36 | 3.81 | 3.59 | 3.49 | | KangarooTwelve | 4.99 | 4.57 | 4.13 | 3.86 | | SHA-512/256 | 6.39 | 5.76 | 5.31 | 5.05 | | SHAKE128 | 8.23 | 7.67 | 7.17 | 6.97 | | SHA-256 | 8.90 | 8.08 | 7.77 | 7.59 | | SHA3-256 | 10.26 | 9.15 | 8.84 | 8.57 | |=== Numbers for genji262, an AMD Ryzen System, which has SHA acceleration: |=== | Algorithm | 576 B | 1536 B | 4096 B | long | | SHA-1 | 1.87 | 1.69 | 1.60 | 1.54 | | SHA-256 | 1.95 | 1.72 | 1.68 | 1.64 | | BLAKE2b | 2.94 | 2.59 | 2.59 | 2.59 | | KangarooTwelve | 4.09 | 3.65 | 3.35 | 3.17 | | SHA-512/256 | 5.54 | 5.08 | 4.71 | 4.48 | | SHAKE128 | 6.95 | 6.23 | 5.71 | 5.49 | | SHA3-256 | 8.29 | 7.35 | 7.04 | 6.81 | |=== Note that no mid- to high-end Intel processors provide acceleration. AMD Ryzen and some ARM64 processors do. == Summary The algorithms with the greatest implementation availability are SHA-256, SHA3-256, BLAKE2b, and SHAKE128. In terms of command-line availability, BLAKE2b, SHA-256, SHA-512/256, and SHA3-256 should be available in the near future on a reasonably small Debian, Ubuntu, or Fedora install. As far as security, the most conservative choices appear to be SHA-256, SHA-512/256, and SHA3-256. The performance winners are BLAKE2b unaccelerated and SHA-256 accelerated. -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 867 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-06-09 22:49 ` Hash algorithm analysis brian m. carlson @ 2018-06-11 19:29 ` Jonathan Nieder 2018-06-11 20:20 ` Linus Torvalds ` (2 more replies) 2018-06-11 21:19 ` Hash algorithm analysis Ævar Arnfjörð Bjarmason 1 sibling, 3 replies; 66+ messages in thread From: Jonathan Nieder @ 2018-06-11 19:29 UTC (permalink / raw) To: brian m. carlson Cc: git, Johannes Schindelin, demerphq, Linus Torvalds, Adam Langley, The Keccak Team Hi, brian m. carlson wrote: > == Discussion of Candidates > > I've implemented and tested the following algorithms, all of which are > 256-bit (in alphabetical order): Thanks for this. Where can I read your code? [...] > I also rejected some other candidates. I couldn't find any reference or > implementation of SHA256×16, so I didn't implement it. Reference: https://eprint.iacr.org/2012/476.pdf If consensus turns toward it being the right hash function to use, then we can pursue finding or writing a good high-quality implementation. But I tend to suspect that the lack of wide implementation availability is a reason to avoid it unless we find SHA-256 to be too slow. [...] > * BLAKE2bp, as implemented in libb2, uses OpenMP (and therefore > multithreading) by default. It was no longer possible to run the > testsuite with -j3 on my laptop in this configuration. My understanding is that BLAKE2bp is better able to make use of simd instructions than BLAKE2b. Is there a way to configure libb2 to take advantage of that without multithreading? E.g. https://github.com/sneves/blake2-avx2 looks promising for that. [...] > |=== > | Implementation | 256 B | 1 KiB | 8 KiB | 16 KiB | > | SHA-1 (OpenSSL) | 513963 | 685966 | 748993 | 754270 | > | BLAKE2b (libb2) | 488123 | 552839 | 576246 | 579292 | > | SHA-512/256 (OpenSSL) | 181177 | 349002 | 499113 | 495169 | > | BLAKE2bp (libb2) | 139891 | 344786 | 488390 | 522575 | > | SHA-256 (OpenSSL) | 264276 | 333560 | 357830 | 355761 | > | KangarooTwelve | 239305 | 307300 | 355257 | 364261 | > | SHAKE128 (OpenSSL) | 154775 | 253344 | 337811 | 346732 | > | SHA3-256 (OpenSSL) | 128597 | 185381 | 198931 | 207365 | > | BLAKE2bp (libb2; threaded) | 12223 | 49306 | 132833 | 179616 | > |=== That's a bit surprising, since my impression (e.g. in the SUPERCOP benchmarks you cite) is that there are secure hash functions that allow comparable or even faster performance than SHA-1 with large inputs on a single core. In Git we also care about performance with small inputs, creating a bit of a trade-off. More on the subject of blake2b vs blake2bp: - blake2b is faster for small inputs (under 1k, say). If this is important then we could set a threshold, e.g. 512 bytes, for swtiching to blake2bp. - blake2b is supported in OpenSSL and likely to get x86-optimized versions in the future. blake2bp is not in OpenSSL. [...] > == Summary > > The algorithms with the greatest implementation availability are > SHA-256, SHA3-256, BLAKE2b, and SHAKE128. > > In terms of command-line availability, BLAKE2b, SHA-256, SHA-512/256, > and SHA3-256 should be available in the near future on a reasonably > small Debian, Ubuntu, or Fedora install. > > As far as security, the most conservative choices appear to be SHA-256, > SHA-512/256, and SHA3-256. SHA-256x16 has the same security properties as SHA-256. Picking between those two is a tradeoff between performance and implementation availability. My understanding is that all the algorithms we're discussing are believed to be approximately equivalent in security. That's a strange thing to say when e.g. K12 uses fewer rounds than SHA3 of the same permutation, but it is my understanding nonetheless. We don't know yet how these hash algorithms will ultimately break. My understanding of the discussion so far: Keccak team encourages us[1] to consider a variant like K12 instead of SHA3. AGL explains[2] that the algorithms considered all seem like reasonable choices and we should decide using factors like implementation ease and performance. If we choose a Keccak-based function, AGL also[3] encourages using a variant like K12 instead of SHA3. Dscho strongly prefers[4] SHA-256, because of - wide implementation availability, including in future hardware - has been widely analyzed - is fast Yves Orton and Linus Torvalds prefer[5] SHA3 over SHA2 because of how it is constructed. Thanks, Jonathan [1] https://public-inbox.org/git/91a34c5b-7844-3db2-cf29-411df5bcf886@noekeon.org/ [2] https://public-inbox.org/git/CAL9PXLzhPyE+geUdcLmd=pidT5P8eFEBbSgX_dS88knz2q_LSw@mail.gmail.com/ [3] https://www.imperialviolet.org/2017/05/31/skipsha3.html [4] https://public-inbox.org/git/alpine.DEB.2.21.1.1706151122180.4200@virtualbox/ [5] https://public-inbox.org/git/CA+55aFwUn0KibpDQK2ZrxzXKOk8-aAub2nJZQqKCpq1ddhDcMQ@mail.gmail.com/ ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-06-11 19:29 ` Jonathan Nieder @ 2018-06-11 20:20 ` Linus Torvalds 2018-06-11 23:27 ` Ævar Arnfjörð Bjarmason 2018-06-11 22:35 ` brian m. carlson 2018-07-20 21:52 ` brian m. carlson 2 siblings, 1 reply; 66+ messages in thread From: Linus Torvalds @ 2018-06-11 20:20 UTC (permalink / raw) To: Jonathan Nieder Cc: brian m. carlson, Git Mailing List, Johannes Schindelin, demerphq, agl, keccak On Mon, Jun 11, 2018 at 12:29 PM Jonathan Nieder <jrnieder@gmail.com> wrote: > > Yves Orton and Linus Torvalds prefer[5] SHA3 over SHA2 because of how > it is constructed. Yeah, I really think that it's a mistake to switch to something that has the same problem SHA1 had. That doesn't necessarily mean SHA3, but it does mean "bigger intermediate hash state" (so no length extension attack), which could be SHA3, but also SHA-512/256 or K12. Honestly, git has effectively already moved from SHA1 to SHA1DC. So the actual known attack and weakness of SHA1 should simply not be part of the discussion for the next hash. You can basically say "we're _already_ on the second hash, we just picked one that was so compatible with SHA1 that nobody even really noticed. The reason to switch is (a) 160 bits may not be enough (b) maybe there are other weaknesses in SHA1 that SHA1DC doesn't catch. (c) others? Obviously all of the choices address (a). But at least for me, (b) makes me go "well, SHA2 has the exact same weak inter-block state attack, so if there are unknown weaknesses in SHA1, then what about unknown weaknesses in SHA2"? And no, I'm not a cryptographer. But honestly, length extension attacks were how both md5 and sha1 were broken in practice, so I'm just going "why would we go with a crypto choice that has that known weakness? That's just crazy". From a performance standpoint, I have to say (once more) that crypto performance actually mattered a lot less than I originally thought it would. Yes, there are phases that do care, but they are rare. For example, I think SHA1 performance has probably mattered most for the index and pack-file, where it's really only used as a fancy CRC. For most individual object cases, it is almost never an issue. From a performance angle, I think the whole "256-bit hashes are bigger" is going to be the more noticeable performance issue, just because things like delta compression and zlib - both of which are very *real* and present performance issues - will have more data that they need to work on. The performance difference between different hashing functions is likely not all that noticeable in most common cases as long as we're not talking orders of magnitude. And yes, I guess we're in the "approaching an order of magnitude" performance difference, but we should actually compare not to OpenSSL SHA1, but to SHA1DC. See above. Personally, the fact that the Keccak people would suggest K12 makes me think that should be a front-runner, but whatever. I don't think the 128-bit preimage case is an issue, since 128 bits is the brute-force cost for any 256-bit hash. But hey, I picked sha1 to begin with, so take any input from me with that historical pinch of salt in mind ;) Linus ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-06-11 20:20 ` Linus Torvalds @ 2018-06-11 23:27 ` Ævar Arnfjörð Bjarmason 2018-06-12 0:11 ` David Lang 2018-06-12 0:45 ` Linus Torvalds 0 siblings, 2 replies; 66+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-06-11 23:27 UTC (permalink / raw) To: Linus Torvalds Cc: Jonathan Nieder, brian m. carlson, Git Mailing List, Johannes Schindelin, demerphq, agl, keccak On Mon, Jun 11 2018, Linus Torvalds wrote: > On Mon, Jun 11, 2018 at 12:29 PM Jonathan Nieder <jrnieder@gmail.com> wrote: >> >> Yves Orton and Linus Torvalds prefer[5] SHA3 over SHA2 because of how >> it is constructed. > > Yeah, I really think that it's a mistake to switch to something that > has the same problem SHA1 had. > > That doesn't necessarily mean SHA3, but it does mean "bigger > intermediate hash state" (so no length extension attack), which could > be SHA3, but also SHA-512/256 or K12. > > Honestly, git has effectively already moved from SHA1 to SHA1DC. > > So the actual known attack and weakness of SHA1 should simply not be > part of the discussion for the next hash. You can basically say "we're > _already_ on the second hash, we just picked one that was so > compatible with SHA1 that nobody even really noticed. > > The reason to switch is > > (a) 160 bits may not be enough > > (b) maybe there are other weaknesses in SHA1 that SHA1DC doesn't catch. > > (c) others? > > Obviously all of the choices address (a). FWIW I updated our docs 3 months ago to try to address some of this: https://github.com/git/git/commit/5988eb631a > But at least for me, (b) makes me go "well, SHA2 has the exact same > weak inter-block state attack, so if there are unknown weaknesses in > SHA1, then what about unknown weaknesses in SHA2"? > > And no, I'm not a cryptographer. But honestly, length extension > attacks were how both md5 and sha1 were broken in practice, so I'm > just going "why would we go with a crypto choice that has that known > weakness? That's just crazy". What do you think about Johannes's summary of this being a non-issue for Git in https://public-inbox.org/git/alpine.DEB.2.21.1.1706151122180.4200@virtualbox/ ? > From a performance standpoint, I have to say (once more) that crypto > performance actually mattered a lot less than I originally thought it > would. Yes, there are phases that do care, but they are rare. One real-world case is rebasing[1]. As noted in that E-Mail of mine a year ago we can use SHA1DC v.s. OpenSSL as a stand-in for the sort of performance difference we might expect between hash functions, although as you note this doesn't account for the difference in length. With our perf tests, in t/perf on linux.git: $ GIT_PERF_LARGE_REPO=~/g/linux GIT_PERF_REPEAT_COUNT=10 GIT_PERF_MAKE_COMMAND='if pwd | grep -q $(git rev-parse origin/master); then make -j8 CFLAGS=-O3 DC_SHA1=Y; else make -j8 CFLAGS=-O3 OPENSSL_SHA1=Y; fi' ./run origin/master~ origin/master -- p3400-rebase.sh Test origin/master~ origin/master -------------------------------------------------------------------------------------------------------- 3400.2: rebase on top of a lot of unrelated changes 1.38(1.19+0.11) 1.40(1.23+0.10) +1.4% 3400.4: rebase a lot of unrelated changes without split-index 4.07(3.28+0.66) 4.62(3.71+0.76) +13.5% 3400.6: rebase a lot of unrelated changes with split-index 3.41(2.94+0.38) 3.35(2.87+0.37) -1.8% On a bigger monorepo I have here: Test origin/master~ origin/master ------------------------------------------------------------------------------------------------------- 3400.2: rebase on top of a lot of unrelated changes 1.39(1.19+0.17) 1.34(1.16+0.16) -3.6% 3400.4: rebase a lot of unrelated changes without split-index 6.67(3.37+0.63) 6.95(3.90+0.62) +4.2% 3400.6: rebase a lot of unrelated changes with split-index 3.70(2.85+0.45) 3.73(2.85+0.41) +0.8% I didn't paste any numbers in that E-Mail a year ago, maybe I produced them differently, but this is clerly not that of a "big difference". But this is one way to see the difference. > For example, I think SHA1 performance has probably mattered most for > the index and pack-file, where it's really only used as a fancy CRC. > For most individual object cases, it is almost never an issue. Yeah there's lots of things we could optimize there, but we are going to need to hash things to create the commit in e.g. the rebase case, but much of that could probably be done more efficiently without switching the hash. > From a performance angle, I think the whole "256-bit hashes are > bigger" is going to be the more noticeable performance issue, just > because things like delta compression and zlib - both of which are > very *real* and present performance issues - will have more data that > they need to work on. The performance difference between different > hashing functions is likely not all that noticeable in most common > cases as long as we're not talking orders of magnitude. > > And yes, I guess we're in the "approaching an order of magnitude" > performance difference, but we should actually compare not to OpenSSL > SHA1, but to SHA1DC. See above. > > Personally, the fact that the Keccak people would suggest K12 makes me > think that should be a front-runner, but whatever. I don't think the > 128-bit preimage case is an issue, since 128 bits is the brute-force > cost for any 256-bit hash. > > But hey, I picked sha1 to begin with, so take any input from me with > that historical pinch of salt in mind ;) 1. https://public-inbox.org/git/87tw3f8vez.fsf@gmail.com/ ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-06-11 23:27 ` Ævar Arnfjörð Bjarmason @ 2018-06-12 0:11 ` David Lang 2018-06-12 0:45 ` Linus Torvalds 1 sibling, 0 replies; 66+ messages in thread From: David Lang @ 2018-06-12 0:11 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Linus Torvalds, Jonathan Nieder, brian m. carlson, Git Mailing List, Johannes Schindelin, demerphq, agl, keccak [-- Attachment #1: Type: TEXT/PLAIN, Size: 956 bytes --] On Tue, 12 Jun 2018, Ævar Arnfjörð Bjarmason wrote: >> From a performance standpoint, I have to say (once more) that crypto >> performance actually mattered a lot less than I originally thought it >> would. Yes, there are phases that do care, but they are rare. > > One real-world case is rebasing[1]. As noted in that E-Mail of mine a > year ago we can use SHA1DC v.s. OpenSSL as a stand-in for the sort of > performance difference we might expect between hash functions, although > as you note this doesn't account for the difference in length. when you are rebasing, how many hashes do you need to calculate? a few dozen, a few hundred, a few thousand, a few hundered thousand? If the common uses of rebasing are on the low end, then the fact that the hash takes a bit longer won't matter much because the entire job is so fast. And at the high end, I/O will probably dominate. so where does it really make a human visible difference? David Lang ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-06-11 23:27 ` Ævar Arnfjörð Bjarmason 2018-06-12 0:11 ` David Lang @ 2018-06-12 0:45 ` Linus Torvalds 1 sibling, 0 replies; 66+ messages in thread From: Linus Torvalds @ 2018-06-12 0:45 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Jonathan Nieder, brian m. carlson, Git Mailing List, Johannes Schindelin, demerphq, agl, keccak On Mon, Jun 11, 2018 at 4:27 PM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > > > And no, I'm not a cryptographer. But honestly, length extension > > attacks were how both md5 and sha1 were broken in practice, so I'm > > just going "why would we go with a crypto choice that has that known > > weakness? That's just crazy". > > What do you think about Johannes's summary of this being a non-issue for > Git in > https://public-inbox.org/git/alpine.DEB.2.21.1.1706151122180.4200@virtualbox/ > ? I agree that the fact that git internal data is structured and all meaningful (and doesn't really have ignored state) makes it *much* harder to attack the basic git objects, since you not only have to generate a good hash, the end result has to also *parse* and there is not really any hidden non-parsed data that you can use to hide the attack. And *if* you are using git for source code, the same is pretty much true even for the blob objects - an attacking object will stand out like a sore thumb in "diff" etc. So I don't disagree with Johannes in that sense: I think git does fundamentally tend to have some extra validation in place, and there's a reason why the examples for both the md5 and the sha1 attack were pdf files. That said, even if git internal ("metadata") objects like trees and commits tend to not have opaque parts to them and are thus pretty hard to attack, the blob objects are still an attack vector for projects that use git for non-source-code (and even source projects do embed binary files - including pdf files - even though they might not be "as interesting" to attack). So you do want to protect those too. And hey, protecting the metadata objects is good just to protect against annoyances. Sure, you should always sanity check the object at receive time anyway, but even so, if somebody is able to generate a blob object that hashes to the same hash as a metadata object (ie tree or commit), that really could be pretty damn annoying. And the whole "intermediate hashed state is same size as final hash state" just _fundamentally_ means that if you find a weakness in the hash, you can now attack that weakness without having to worry about the attack being fundamentally more expensive. That's essentially what SHAttered relied on. It didn't rely on a secret and a hash and length extension, but it *did* rely on the same mechanism that a length extension attack relies on, where you can basically attack the state in the middle with no extra cost. Maybe some people don't consider it a length extension attack for that reason, but it boils down to much the same basic situation where you can attack the internal hash state and cause a state collision. And you can try to find the patterns that then cause that state collision when you've found a weakness in the hash. With SHA3 or k12, you can obviously _also_ try to attack the hash state and cause a collision, but because the intermediate state is much bigger than the final hash, you're just making things *way* harder for yourself if you try that. Linus ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-06-11 19:29 ` Jonathan Nieder 2018-06-11 20:20 ` Linus Torvalds @ 2018-06-11 22:35 ` brian m. carlson 2018-06-12 16:21 ` Gilles Van Assche 2018-07-20 21:52 ` brian m. carlson 2 siblings, 1 reply; 66+ messages in thread From: brian m. carlson @ 2018-06-11 22:35 UTC (permalink / raw) To: Jonathan Nieder Cc: git, Johannes Schindelin, demerphq, Linus Torvalds, Adam Langley, The Keccak Team [-- Attachment #1: Type: text/plain, Size: 4498 bytes --] On Mon, Jun 11, 2018 at 12:29:42PM -0700, Jonathan Nieder wrote: > brian m. carlson wrote: > > > == Discussion of Candidates > > > > I've implemented and tested the following algorithms, all of which are > > 256-bit (in alphabetical order): > > Thanks for this. Where can I read your code? https://github.com/bk2204/git.git, test-hashes branch. You will need to have libb2 and OPENSSL_SHA1 set. It's a bit of a hack, so don't look too hard. > [...] > > I also rejected some other candidates. I couldn't find any reference or > > implementation of SHA256×16, so I didn't implement it. > > Reference: https://eprint.iacr.org/2012/476.pdf Thanks for that reference. > If consensus turns toward it being the right hash function to use, > then we can pursue finding or writing a good high-quality > implementation. But I tend to suspect that the lack of wide > implementation availability is a reason to avoid it unless we find > SHA-256 to be too slow. I agree. Implementation availability is important. Whatever we provide is likely going to be portable C code, which is going to be slower than an optimized implementation. > [...] > > * BLAKE2bp, as implemented in libb2, uses OpenMP (and therefore > > multithreading) by default. It was no longer possible to run the > > testsuite with -j3 on my laptop in this configuration. > > My understanding is that BLAKE2bp is better able to make use of simd > instructions than BLAKE2b. Is there a way to configure libb2 to take > advantage of that without multithreading? You'll notice below that I have both BLAKE2bp with and without threading. I recompiled libb2 to not use threading, and it still didn't perform as well. libb2 is written by the authors of BLAKE2, so it's the most favorable implementation we're likely to get. > [...] > > |=== > > | Implementation | 256 B | 1 KiB | 8 KiB | 16 KiB | > > | SHA-1 (OpenSSL) | 513963 | 685966 | 748993 | 754270 | > > | BLAKE2b (libb2) | 488123 | 552839 | 576246 | 579292 | > > | SHA-512/256 (OpenSSL) | 181177 | 349002 | 499113 | 495169 | > > | BLAKE2bp (libb2) | 139891 | 344786 | 488390 | 522575 | > > | SHA-256 (OpenSSL) | 264276 | 333560 | 357830 | 355761 | > > | KangarooTwelve | 239305 | 307300 | 355257 | 364261 | > > | SHAKE128 (OpenSSL) | 154775 | 253344 | 337811 | 346732 | > > | SHA3-256 (OpenSSL) | 128597 | 185381 | 198931 | 207365 | > > | BLAKE2bp (libb2; threaded) | 12223 | 49306 | 132833 | 179616 | > > |=== > > That's a bit surprising, since my impression (e.g. in the SUPERCOP > benchmarks you cite) is that there are secure hash functions that > allow comparable or even faster performance than SHA-1 with large > inputs on a single core. In Git we also care about performance with > small inputs, creating a bit of a trade-off. > > More on the subject of blake2b vs blake2bp: > > - blake2b is faster for small inputs (under 1k, say). If this is > important then we could set a threshold, e.g. 512 bytes, for > swtiching to blake2bp. > > - blake2b is supported in OpenSSL and likely to get x86-optimized > versions in the future. blake2bp is not in OpenSSL. Correct. BLAKE2b in OpenSSL is currently 512-bit only, but it's intended to add support for 256-bit versions soon. I think the benefit of sticking to one hash function altogether is significant, so I think we should one that has good all-around performance instead of trying to split between different ones. > My understanding is that all the algorithms we're discussing are > believed to be approximately equivalent in security. That's a strange > thing to say when e.g. K12 uses fewer rounds than SHA3 of the same > permutation, but it is my understanding nonetheless. We don't know > yet how these hash algorithms will ultimately break. With the exception of variations in preimage security, I expect that's correct. I think implementation availability and performance are the best candidates for consideration. > My understanding of the discussion so far: > > Keccak team encourages us[1] to consider a variant like K12 instead of > SHA3. While I think K12 is an interesting algorithm, I'm not sure we're going to get as good of performance out of it as we might want due to the lack of implementations. -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 867 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-06-11 22:35 ` brian m. carlson @ 2018-06-12 16:21 ` Gilles Van Assche 2018-06-13 23:58 ` brian m. carlson 0 siblings, 1 reply; 66+ messages in thread From: Gilles Van Assche @ 2018-06-12 16:21 UTC (permalink / raw) To: brian m. carlson Cc: Jonathan Nieder, git, Johannes Schindelin, demerphq, Linus Torvalds, Adam Langley, Keccak Team Hi, On 10/06/18 00:49, brian m. carlson wrote: > I imported the optimized 64-bit implementation of KangarooTwelve. The > AVX2 implementation was not considered for licensing reasons (it's > partially generated from external code, which falls foul of the GPL's > "preferred form for modifications" rule). Indeed part of the AVX2 code in the Keccak code package is an extension of the implementation in OpenSSL (written by Andy Polyakov). The assembly code is generated by a Perl script, and we extended it to fit in the KCP's internal API. Would it solve this licensing problem if we remap our extensions to the Perl script, which would then become "the source"? On 12/06/18 00:35, brian m. carlson wrote: >> My understanding is that all the algorithms we're discussing are >> believed to be approximately equivalent in security. That's a strange >> thing to say when e.g. K12 uses fewer rounds than SHA3 of the same >> permutation, but it is my understanding nonetheless. We don't know >> yet how these hash algorithms will ultimately break. > > With the exception of variations in preimage security, I expect that's > correct. I think implementation availability and performance are the > best candidates for consideration. Note that we recently updated the paper on K12 (accepted at ACNS 2018), with more details on performance and security. https://eprint.iacr.org/2016/770 >> My understanding of the discussion so far: >> >> Keccak team encourages us[1] to consider a variant like K12 instead >> of SHA3. > > While I think K12 is an interesting algorithm, I'm not sure we're > going to get as good of performance out of it as we might want due to > the lack of implementations. Implementation availability is indeed important. The effort to transform an implementation of SHAKE128 into one of K12 is limited due to the reuse of their main components (round function, sponge construction). So the availability of SHA-3/Keccak implementations can benefit that of K12 if there is sufficient interest. E.g., the SHA-3/Keccak instructions in ARMv8.2 can speed up K12 as well. Kind regards, Gilles ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-06-12 16:21 ` Gilles Van Assche @ 2018-06-13 23:58 ` brian m. carlson 2018-06-15 10:33 ` Gilles Van Assche 0 siblings, 1 reply; 66+ messages in thread From: brian m. carlson @ 2018-06-13 23:58 UTC (permalink / raw) To: Gilles Van Assche Cc: Jonathan Nieder, git, Johannes Schindelin, demerphq, Linus Torvalds, Adam Langley, Keccak Team [-- Attachment #1: Type: text/plain, Size: 2187 bytes --] On Tue, Jun 12, 2018 at 06:21:21PM +0200, Gilles Van Assche wrote: > Hi, > > On 10/06/18 00:49, brian m. carlson wrote: > > I imported the optimized 64-bit implementation of KangarooTwelve. The > > AVX2 implementation was not considered for licensing reasons (it's > > partially generated from external code, which falls foul of the GPL's > > "preferred form for modifications" rule). > > Indeed part of the AVX2 code in the Keccak code package is an extension > of the implementation in OpenSSL (written by Andy Polyakov). The > assembly code is generated by a Perl script, and we extended it to fit > in the KCP's internal API. > > Would it solve this licensing problem if we remap our extensions to the > Perl script, which would then become "the source"? The GPLv2 requires "the preferred form of the work for making modifications to it". If that form is the Perl script, then yes, that would be sufficient. If your code is dissimilar enough that editing it directly is better than editing the Perl script, then it might already meet the definition. I don't do assembly programming, so I don't know what forms one generally wants for editing assembly. Apparently OpenSSL wants a Perl script, but that is, I understand, less common. What would you use if you were going to improve it? > On 12/06/18 00:35, brian m. carlson wrote: > > While I think K12 is an interesting algorithm, I'm not sure we're > > going to get as good of performance out of it as we might want due to > > the lack of implementations. > > Implementation availability is indeed important. The effort to transform > an implementation of SHAKE128 into one of K12 is limited due to the > reuse of their main components (round function, sponge construction). So > the availability of SHA-3/Keccak implementations can benefit that of K12 > if there is sufficient interest. E.g., the SHA-3/Keccak instructions in > ARMv8.2 can speed up K12 as well. That's good to know. I wasn't aware that ARM was providing Keccak instructions, but it's good to see that new chips are providing them. -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 867 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-06-13 23:58 ` brian m. carlson @ 2018-06-15 10:33 ` Gilles Van Assche 0 siblings, 0 replies; 66+ messages in thread From: Gilles Van Assche @ 2018-06-15 10:33 UTC (permalink / raw) To: brian m. carlson Cc: Jonathan Nieder, git, Johannes Schindelin, demerphq, Linus Torvalds, Adam Langley, Keccak Team On 14/06/18 01:58, brian m. carlson wrote: >>> I imported the optimized 64-bit implementation of KangarooTwelve. >>> The AVX2 implementation was not considered for licensing reasons >>> (it's partially generated from external code, which falls foul of >>> the GPL's "preferred form for modifications" rule). >> >> Indeed part of the AVX2 code in the Keccak code package is an >> extension of the implementation in OpenSSL (written by Andy >> Polyakov). The assembly code is generated by a Perl script, and we >> extended it to fit in the KCP's internal API. >> >> Would it solve this licensing problem if we remap our extensions to >> the Perl script, which would then become "the source"? > > The GPLv2 requires "the preferred form of the work for making > modifications to it". If that form is the Perl script, then yes, that > would be sufficient. If your code is dissimilar enough that editing it > directly is better than editing the Perl script, then it might already > meet the definition. > > I don't do assembly programming, so I don't know what forms one > generally wants for editing assembly. Apparently OpenSSL wants a Perl > script, but that is, I understand, less common. What would you use if > you were going to improve it? The Perl script would be more flexible in case one needs to improve the implementation. It allows one to use meaningful symbolic names for the registers. My colleague Ronny, who did the extension, worked directly with physical register names and considered the output of the Perl script as his source. But this extension could probably be done also at the level of the Perl script. Kind regards, Gilles ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-06-11 19:29 ` Jonathan Nieder 2018-06-11 20:20 ` Linus Torvalds 2018-06-11 22:35 ` brian m. carlson @ 2018-07-20 21:52 ` brian m. carlson 2018-07-21 0:31 ` Jonathan Nieder ` (3 more replies) 2 siblings, 4 replies; 66+ messages in thread From: brian m. carlson @ 2018-07-20 21:52 UTC (permalink / raw) To: Jonathan Nieder Cc: git, Johannes Schindelin, demerphq, Linus Torvalds, Adam Langley, The Keccak Team [-- Attachment #1: Type: text/plain, Size: 1334 bytes --] On Mon, Jun 11, 2018 at 12:29:42PM -0700, Jonathan Nieder wrote: > My understanding of the discussion so far: > > Keccak team encourages us[1] to consider a variant like K12 instead of > SHA3. > > AGL explains[2] that the algorithms considered all seem like > reasonable choices and we should decide using factors like > implementation ease and performance. > > If we choose a Keccak-based function, AGL also[3] encourages using a > variant like K12 instead of SHA3. > > Dscho strongly prefers[4] SHA-256, because of > - wide implementation availability, including in future hardware > - has been widely analyzed > - is fast > > Yves Orton and Linus Torvalds prefer[5] SHA3 over SHA2 because of how > it is constructed. I know this discussion has sort of petered out, but I'd like to see if we can revive it. I'm writing index v3 and having a decision would help me write tests for it. To summarize the discussion that's been had in addition to the above, Ævar has also stated a preference for SHA-256 and I would prefer BLAKE2b over SHA-256 over SHA3-256, although any of them would be fine. Are there other contributors who have a strong opinion? Are there things I can do to help us coalesce around an option? -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 867 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-20 21:52 ` brian m. carlson @ 2018-07-21 0:31 ` Jonathan Nieder 2018-07-21 19:52 ` Ævar Arnfjörð Bjarmason ` (2 subsequent siblings) 3 siblings, 0 replies; 66+ messages in thread From: Jonathan Nieder @ 2018-07-21 0:31 UTC (permalink / raw) To: brian m. carlson Cc: git, Johannes Schindelin, demerphq, Linus Torvalds, Adam Langley, The Keccak Team Hi, brian m. carlson wrote: > I know this discussion has sort of petered out, but I'd like to see if > we can revive it. I'm writing index v3 and having a decision would help > me write tests for it. Nice! That's awesome. > To summarize the discussion that's been had in addition to the above, > Ævar has also stated a preference for SHA-256 and I would prefer BLAKE2b > over SHA-256 over SHA3-256, although any of them would be fine. > > Are there other contributors who have a strong opinion? Are there > things I can do to help us coalesce around an option? My advice would be to go with BLAKE2b. If anyone objects, we can deal with their objection at that point (and I volunteer to help with cleaning up any mess in rewriting test cases that this advice causes). Full disclosure: my preference order (speaking for myself and no one else) is K12 > BLAKE2bp-256 > SHA-256x16 > BLAKE2b > SHA-256 > SHA-512/256 > SHA3-256 so I'm not just asking you to go with my favorite. ;-) Jonathan ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-20 21:52 ` brian m. carlson 2018-07-21 0:31 ` Jonathan Nieder @ 2018-07-21 19:52 ` Ævar Arnfjörð Bjarmason 2018-07-21 20:25 ` brian m. carlson 2018-07-21 22:38 ` Johannes Schindelin 2018-07-24 19:01 ` Edward Thomson 3 siblings, 1 reply; 66+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-07-21 19:52 UTC (permalink / raw) To: brian m. carlson Cc: Jonathan Nieder, git, Johannes Schindelin, demerphq, Linus Torvalds, Adam Langley, The Keccak Team On Fri, Jul 20 2018, brian m. carlson wrote: > On Mon, Jun 11, 2018 at 12:29:42PM -0700, Jonathan Nieder wrote: >> My understanding of the discussion so far: >> >> Keccak team encourages us[1] to consider a variant like K12 instead of >> SHA3. >> >> AGL explains[2] that the algorithms considered all seem like >> reasonable choices and we should decide using factors like >> implementation ease and performance. >> >> If we choose a Keccak-based function, AGL also[3] encourages using a >> variant like K12 instead of SHA3. >> >> Dscho strongly prefers[4] SHA-256, because of >> - wide implementation availability, including in future hardware >> - has been widely analyzed >> - is fast >> >> Yves Orton and Linus Torvalds prefer[5] SHA3 over SHA2 because of how >> it is constructed. > > I know this discussion has sort of petered out, but I'd like to see if > we can revive it. I'm writing index v3 and having a decision would help > me write tests for it. > > To summarize the discussion that's been had in addition to the above, > Ævar has also stated a preference for SHA-256 and I would prefer BLAKE2b > over SHA-256 over SHA3-256, although any of them would be fine. > > Are there other contributors who have a strong opinion? Are there > things I can do to help us coalesce around an option? I have a vague recollection of suggesting something similar in the past, but can't find that E-Mail (and maybe it never happened), but for testing purposes isn't in simplest if we just have some "test SHA-1" algorithm where we pretent that all inputs like "STRING" are really "PREFIX-STRING" for the purposes of hashing, or fake shortening / lengthening the hash to test arbitrary lenghts of N (just by repeating the hash from the beginning is probably good enough...). That would make such patches easier to review, since we wouldn't need to carry hundreds/thousands of lines of dense hashing code, but a more trivial wrapper around SHA-1, and we could have some test mode where we could compile & run tests with an arbitrary hash length to make sure everything's future proof even after we move to NewHash. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-21 19:52 ` Ævar Arnfjörð Bjarmason @ 2018-07-21 20:25 ` brian m. carlson 0 siblings, 0 replies; 66+ messages in thread From: brian m. carlson @ 2018-07-21 20:25 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: Jonathan Nieder, git, Johannes Schindelin, demerphq, Linus Torvalds, Adam Langley, The Keccak Team [-- Attachment #1: Type: text/plain, Size: 2925 bytes --] On Sat, Jul 21, 2018 at 09:52:05PM +0200, Ævar Arnfjörð Bjarmason wrote: > > On Fri, Jul 20 2018, brian m. carlson wrote: > > I know this discussion has sort of petered out, but I'd like to see if > > we can revive it. I'm writing index v3 and having a decision would help > > me write tests for it. > > > > To summarize the discussion that's been had in addition to the above, > > Ævar has also stated a preference for SHA-256 and I would prefer BLAKE2b > > over SHA-256 over SHA3-256, although any of them would be fine. > > > > Are there other contributors who have a strong opinion? Are there > > things I can do to help us coalesce around an option? > > I have a vague recollection of suggesting something similar in the past, > but can't find that E-Mail (and maybe it never happened), but for > testing purposes isn't in simplest if we just have some "test SHA-1" > algorithm where we pretent that all inputs like "STRING" are really > "PREFIX-STRING" for the purposes of hashing, or fake shortening / > lengthening the hash to test arbitrary lenghts of N (just by repeating > the hash from the beginning is probably good enough...). > > That would make such patches easier to review, since we wouldn't need to > carry hundreds/thousands of lines of dense hashing code, but a more > trivial wrapper around SHA-1, and we could have some test mode where we > could compile & run tests with an arbitrary hash length to make sure > everything's future proof even after we move to NewHash. I think Stefan suggested this approach. It is viable for testing some aspects of the code, but not others. It doesn't work for synthesizing partial collisions or the bisect tests (since bisect falls back to object ID as a disambiguator). I had tried this approach (using a single zero-byte as a prefix), but for whatever reason, it ended up producing inconsistent results when I hashed. I'm unclear what went wrong in that approach, but I finally discarded it after spending an hour or two staring at it. I'm not opposed to someone else providing it as an option, though. Also, after feedback from Eric Sunshine, I decided to adopt an approach for my hash-independent tests series that used the name of the hash within the tests so that we could support additional algorithms (such as a pseudo-SHA-1). That work necessarily involves having a name for the hash, which is why I haven't revisited it. As for arbitrary hash sizes, there is some code which necessarily needs to depend on a fixed hash size. A lot of our Perl code matches [0-9a-f]{40}, which needs to change. There's no reason we couldn't adopt such testing in the future, but it might end up being more complicated than we want. I have strived to reduce the dependence on fixed-size constants wherever possible, though. -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 867 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-20 21:52 ` brian m. carlson 2018-07-21 0:31 ` Jonathan Nieder 2018-07-21 19:52 ` Ævar Arnfjörð Bjarmason @ 2018-07-21 22:38 ` Johannes Schindelin 2018-07-21 23:09 ` Linus Torvalds 2018-07-21 23:59 ` brian m. carlson 2018-07-24 19:01 ` Edward Thomson 3 siblings, 2 replies; 66+ messages in thread From: Johannes Schindelin @ 2018-07-21 22:38 UTC (permalink / raw) To: brian m. carlson Cc: Jonathan Nieder, git, demerphq, Linus Torvalds, Adam Langley, The Keccak Team [-- Attachment #1: Type: text/plain, Size: 1847 bytes --] Hi Brian, On Fri, 20 Jul 2018, brian m. carlson wrote: > On Mon, Jun 11, 2018 at 12:29:42PM -0700, Jonathan Nieder wrote: > > My understanding of the discussion so far: > > > > Keccak team encourages us[1] to consider a variant like K12 instead of > > SHA3. > > > > AGL explains[2] that the algorithms considered all seem like > > reasonable choices and we should decide using factors like > > implementation ease and performance. > > > > If we choose a Keccak-based function, AGL also[3] encourages using a > > variant like K12 instead of SHA3. > > > > Dscho strongly prefers[4] SHA-256, because of > > - wide implementation availability, including in future hardware > > - has been widely analyzed > > - is fast > > > > Yves Orton and Linus Torvalds prefer[5] SHA3 over SHA2 because of how > > it is constructed. > > I know this discussion has sort of petered out, but I'd like to see if > we can revive it. I'm writing index v3 and having a decision would help > me write tests for it. > > To summarize the discussion that's been had in addition to the above, > Ævar has also stated a preference for SHA-256 and I would prefer BLAKE2b > over SHA-256 over SHA3-256, although any of them would be fine. > > Are there other contributors who have a strong opinion? Are there > things I can do to help us coalesce around an option? Do you really want to value contributors' opinion more than cryptographers'? I mean, that's exactly what got us into this hard-coded SHA-1 mess in the first place. And to set the record straight: I do not have a strong preference of the hash algorithm. But cryprographers I have the incredible luck to have access to, by virtue of being a colleague, did mention their preference. I see no good reason to just blow their advice into the wind. Ciao, Dscho ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-21 22:38 ` Johannes Schindelin @ 2018-07-21 23:09 ` Linus Torvalds 2018-07-21 23:59 ` brian m. carlson 1 sibling, 0 replies; 66+ messages in thread From: Linus Torvalds @ 2018-07-21 23:09 UTC (permalink / raw) To: Johannes Schindelin Cc: brian m. carlson, Jonathan Nieder, Git Mailing List, demerphq, Adam Langley, keccak On Sat, Jul 21, 2018 at 3:39 PM Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > > Do you really want to value contributors' opinion more than > cryptographers'? I mean, that's exactly what got us into this hard-coded > SHA-1 mess in the first place. Don't be silly. Other real cryptographers consider SHA256 to be a problem. Really. It's amenable to the same hack on the internal hash that made for the SHAttered break. So your argument that "cryptographers prefer SHA256" is simply not true. Your real argument is that know at least one cryptographer that you work with that prefers it. Don't try to make that into some generic "cryptographers prefer it". It's not like cryptographers have issues with blake2b either, afaik. And blake2b really _does_ have very real advantages. If you can actually point to some "a large percentage of cryptographers prefer it", you'd have a point. But as it is, you don't have data, you have an anecdote, and you try to use that anecdote to put down other peoples opinions. Intellectually dishonest, that is. Linus ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-21 22:38 ` Johannes Schindelin 2018-07-21 23:09 ` Linus Torvalds @ 2018-07-21 23:59 ` brian m. carlson 2018-07-22 9:34 ` Eric Deplagne ` (2 more replies) 1 sibling, 3 replies; 66+ messages in thread From: brian m. carlson @ 2018-07-21 23:59 UTC (permalink / raw) To: Johannes Schindelin Cc: Jonathan Nieder, git, demerphq, Linus Torvalds, Adam Langley, The Keccak Team [-- Attachment #1: Type: text/plain, Size: 3089 bytes --] On Sun, Jul 22, 2018 at 12:38:41AM +0200, Johannes Schindelin wrote: > Do you really want to value contributors' opinion more than > cryptographers'? I mean, that's exactly what got us into this hard-coded > SHA-1 mess in the first place. I agree (believe me, of all people, I agree) that hard-coding SHA-1 was a bad choice in retrospect. But I've solicited contributors' opinions because the Git Project needs to make a decision *for this project* about the algorithm we're going to use going forward. > And to set the record straight: I do not have a strong preference of the > hash algorithm. But cryprographers I have the incredible luck to have > access to, by virtue of being a colleague, did mention their preference. I don't know your colleagues, and they haven't commented here. One person that has commented here is Adam Langley. It is my impression (and anyone is free to correct me if I'm incorrect) that he is indeed a cryptographer. To quote him[0]: I think this group can safely assume that SHA-256, SHA-512, BLAKE2, K12, etc are all secure to the extent that I don't believe that making comparisons between them on that axis is meaningful. Thus I think the question is primarily concerned with performance and implementation availability. […] So, overall, none of these choices should obviously be excluded. The considerations at this point are not cryptographic and the tradeoff between implementation ease and performance is one that the git community would have to make. I'm aware that cryptographers tend to prefer algorithms that have been studied longer over ones that have been studied less. They also prefer algorithms built in the open to ones developed behind closed doors. SHA-256 has the benefit that it has been studied for a long time, but it was also designed in secret by the NSA. SHA3-256 was created with significant study in the open, but is not as mature. BLAKE2b has been incorporated into standards like Argon2, but has been weakened slightly for performance. I'm not sure that there's a really obvious choice here. I'm at the point where to continue the work that I'm doing, I need to make a decision. I'm happy to follow the consensus if there is one, but it does not appear that there is. I will admit that I don't love making this decision by myself, because right now, whatever I pick, somebody is going to be unhappy. I want to state, unambiguously, that I'm trying to make a decision that is in the interests of the Git Project, the community, and our users. I'm happy to wait a few more days to see if a consensus develops; if so, I'll follow it. If we haven't come to one by, say, Wednesday, I'll make a decision and write my patches accordingly. The community is free, as always, to reject my patches if taking them is not in the interest of the project. [0] https://public-inbox.org/git/CAL9PXLzhPyE+geUdcLmd=pidT5P8eFEBbSgX_dS88knz2q_LSw@mail.gmail.com/ -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 867 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-21 23:59 ` brian m. carlson @ 2018-07-22 9:34 ` Eric Deplagne 2018-07-22 14:21 ` brian m. carlson 2018-07-22 15:23 ` Joan Daemen 2018-07-23 12:40 ` demerphq 2 siblings, 1 reply; 66+ messages in thread From: Eric Deplagne @ 2018-07-22 9:34 UTC (permalink / raw) To: brian m. carlson, Johannes Schindelin, Jonathan Nieder, git, demerphq, Linus Torvalds, Adam Langley, The Keccak Team [-- Attachment #1: Type: text/plain, Size: 3516 bytes --] On Sat, 21 Jul 2018 23:59:41 +0000, brian m. carlson wrote: > On Sun, Jul 22, 2018 at 12:38:41AM +0200, Johannes Schindelin wrote: > > Do you really want to value contributors' opinion more than > > cryptographers'? I mean, that's exactly what got us into this hard-coded > > SHA-1 mess in the first place. > > I agree (believe me, of all people, I agree) that hard-coding SHA-1 was > a bad choice in retrospect. But I've solicited contributors' opinions > because the Git Project needs to make a decision *for this project* > about the algorithm we're going to use going forward. > > > And to set the record straight: I do not have a strong preference of the > > hash algorithm. But cryprographers I have the incredible luck to have > > access to, by virtue of being a colleague, did mention their preference. > > I don't know your colleagues, and they haven't commented here. One > person that has commented here is Adam Langley. It is my impression > (and anyone is free to correct me if I'm incorrect) that he is indeed a > cryptographer. To quote him[0]: > > I think this group can safely assume that SHA-256, SHA-512, BLAKE2, > K12, etc are all secure to the extent that I don't believe that making > comparisons between them on that axis is meaningful. Thus I think the > question is primarily concerned with performance and implementation > availability. > > […] > > So, overall, none of these choices should obviously be excluded. The > considerations at this point are not cryptographic and the tradeoff > between implementation ease and performance is one that the git > community would have to make. Am I completely out of the game, or the statement that "the considerations at this point are not cryptographic" is just the wrongest ? I mean, if that was true, would we not be sticking to SHA1 ? > I'm aware that cryptographers tend to prefer algorithms that have been > studied longer over ones that have been studied less. They also prefer > algorithms built in the open to ones developed behind closed doors. > > SHA-256 has the benefit that it has been studied for a long time, but it > was also designed in secret by the NSA. SHA3-256 was created with > significant study in the open, but is not as mature. BLAKE2b has been > incorporated into standards like Argon2, but has been weakened slightly > for performance. > > I'm not sure that there's a really obvious choice here. > > I'm at the point where to continue the work that I'm doing, I need to > make a decision. I'm happy to follow the consensus if there is one, but > it does not appear that there is. > > I will admit that I don't love making this decision by myself, because > right now, whatever I pick, somebody is going to be unhappy. I want to > state, unambiguously, that I'm trying to make a decision that is in the > interests of the Git Project, the community, and our users. > > I'm happy to wait a few more days to see if a consensus develops; if so, > I'll follow it. If we haven't come to one by, say, Wednesday, I'll make > a decision and write my patches accordingly. The community is free, as > always, to reject my patches if taking them is not in the interest of > the project. > > [0] https://public-inbox.org/git/CAL9PXLzhPyE+geUdcLmd=pidT5P8eFEBbSgX_dS88knz2q_LSw@mail.gmail.com/ > -- > brian m. carlson: Houston, Texas, US > OpenPGP: https://keybase.io/bk2204 -- Eric Deplagne [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-22 9:34 ` Eric Deplagne @ 2018-07-22 14:21 ` brian m. carlson 2018-07-22 14:55 ` Eric Deplagne 0 siblings, 1 reply; 66+ messages in thread From: brian m. carlson @ 2018-07-22 14:21 UTC (permalink / raw) To: Eric Deplagne Cc: Johannes Schindelin, Jonathan Nieder, git, demerphq, Linus Torvalds, Adam Langley, The Keccak Team [-- Attachment #1: Type: text/plain, Size: 1644 bytes --] On Sun, Jul 22, 2018 at 11:34:42AM +0200, Eric Deplagne wrote: > On Sat, 21 Jul 2018 23:59:41 +0000, brian m. carlson wrote: > > I don't know your colleagues, and they haven't commented here. One > > person that has commented here is Adam Langley. It is my impression > > (and anyone is free to correct me if I'm incorrect) that he is indeed a > > cryptographer. To quote him[0]: > > > > I think this group can safely assume that SHA-256, SHA-512, BLAKE2, > > K12, etc are all secure to the extent that I don't believe that making > > comparisons between them on that axis is meaningful. Thus I think the > > question is primarily concerned with performance and implementation > > availability. > > > > […] > > > > So, overall, none of these choices should obviously be excluded. The > > considerations at this point are not cryptographic and the tradeoff > > between implementation ease and performance is one that the git > > community would have to make. > > Am I completely out of the game, or the statement that > "the considerations at this point are not cryptographic" > is just the wrongest ? > > I mean, if that was true, would we not be sticking to SHA1 ? I snipped a portion of the context, but AGL was referring to the considerations involved in choosing from the proposed ones for NewHash. In context, he meant that the candidates for NewHash “are all secure” and are therefore a better choice than SHA-1. I think we can all agree that SHA-1 is weak and should be replaced. -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 867 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-22 14:21 ` brian m. carlson @ 2018-07-22 14:55 ` Eric Deplagne 2018-07-26 10:05 ` Johannes Schindelin 0 siblings, 1 reply; 66+ messages in thread From: Eric Deplagne @ 2018-07-22 14:55 UTC (permalink / raw) To: brian m. carlson, Eric Deplagne, Johannes Schindelin, Jonathan Nieder, git, demerphq, Linus Torvalds, Adam Langley, The Keccak Team [-- Attachment #1: Type: text/plain, Size: 1926 bytes --] On Sun, 22 Jul 2018 14:21:48 +0000, brian m. carlson wrote: > On Sun, Jul 22, 2018 at 11:34:42AM +0200, Eric Deplagne wrote: > > On Sat, 21 Jul 2018 23:59:41 +0000, brian m. carlson wrote: > > > I don't know your colleagues, and they haven't commented here. One > > > person that has commented here is Adam Langley. It is my impression > > > (and anyone is free to correct me if I'm incorrect) that he is indeed a > > > cryptographer. To quote him[0]: > > > > > > I think this group can safely assume that SHA-256, SHA-512, BLAKE2, > > > K12, etc are all secure to the extent that I don't believe that making > > > comparisons between them on that axis is meaningful. Thus I think the > > > question is primarily concerned with performance and implementation > > > availability. > > > > > > […] > > > > > > So, overall, none of these choices should obviously be excluded. The > > > considerations at this point are not cryptographic and the tradeoff > > > between implementation ease and performance is one that the git > > > community would have to make. > > > > Am I completely out of the game, or the statement that > > "the considerations at this point are not cryptographic" > > is just the wrongest ? > > > > I mean, if that was true, would we not be sticking to SHA1 ? > > I snipped a portion of the context, but AGL was referring to the > considerations involved in choosing from the proposed ones for NewHash. > In context, he meant that the candidates for NewHash “are all secure” > and are therefore a better choice than SHA-1. Maybe a little bit sensitive, but I really did read "we don't care if it's weak or strong, that's not the matter". > I think we can all agree that SHA-1 is weak and should be replaced. > -- > brian m. carlson: Houston, Texas, US > OpenPGP: https://keybase.io/bk2204 -- Eric Deplagne [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-22 14:55 ` Eric Deplagne @ 2018-07-26 10:05 ` Johannes Schindelin 0 siblings, 0 replies; 66+ messages in thread From: Johannes Schindelin @ 2018-07-26 10:05 UTC (permalink / raw) To: Eric Deplagne Cc: brian m. carlson, Jonathan Nieder, git, demerphq, Linus Torvalds, Adam Langley, The Keccak Team [-- Attachment #1: Type: text/plain, Size: 3299 bytes --] Hi Eric, On Sun, 22 Jul 2018, Eric Deplagne wrote: > On Sun, 22 Jul 2018 14:21:48 +0000, brian m. carlson wrote: > > On Sun, Jul 22, 2018 at 11:34:42AM +0200, Eric Deplagne wrote: > > > On Sat, 21 Jul 2018 23:59:41 +0000, brian m. carlson wrote: > > > > I don't know your colleagues, and they haven't commented here. One > > > > person that has commented here is Adam Langley. It is my impression > > > > (and anyone is free to correct me if I'm incorrect) that he is indeed a > > > > cryptographer. To quote him[0]: > > > > > > > > I think this group can safely assume that SHA-256, SHA-512, BLAKE2, > > > > K12, etc are all secure to the extent that I don't believe that making > > > > comparisons between them on that axis is meaningful. Thus I think the > > > > question is primarily concerned with performance and implementation > > > > availability. > > > > > > > > […] > > > > > > > > So, overall, none of these choices should obviously be excluded. The > > > > considerations at this point are not cryptographic and the tradeoff > > > > between implementation ease and performance is one that the git > > > > community would have to make. > > > > > > Am I completely out of the game, or the statement that > > > "the considerations at this point are not cryptographic" > > > is just the wrongest ? > > > > > > I mean, if that was true, would we not be sticking to SHA1 ? > > > > I snipped a portion of the context, but AGL was referring to the > > considerations involved in choosing from the proposed ones for NewHash. > > In context, he meant that the candidates for NewHash “are all secure” > > and are therefore a better choice than SHA-1. > > Maybe a little bit sensitive, but I really did read > "we don't care if it's weak or strong, that's not the matter". Thank you for your concern. I agree that we need to be careful in considering the security implications. We made that mistake before (IIRC there was a cryptographer who was essentially shouted off the list when he suggested *not* to hard-code SHA-1), and we should absolutely refrain from making that same mistake again. > > I think we can all agree that SHA-1 is weak and should be replaced. Indeed. So at this point, we already excluded pretty much all the unsafe options (although it does concern me that BLAKE2b has been weakened purposefully, I understand the reasoning, but still). Which means that by now, considering the security implications of the cipher is no longer a criterion that helps us whittle down the candidates further. So from my point of view, there are two criterions that can help us further: - Which cipher is the least likely to be broken (or just weakened by new attacks)? - As energy considerations not only ecologically inspired, but also in terms of money for elecricity: which cipher is most likely to get decent hardware support any time soon? Even if my original degree (prime number theory) is closer to cryptanalysis than pretty much all other prolific core Git contributors, I do not want you to trust *my* word on answering those questions. Therefore, I will ask my colleagues to enter the hornet's nest that is this mailing list. Ciao, Dscho ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-21 23:59 ` brian m. carlson 2018-07-22 9:34 ` Eric Deplagne @ 2018-07-22 15:23 ` Joan Daemen 2018-07-22 18:54 ` Adam Langley 2018-07-26 10:31 ` Johannes Schindelin 2018-07-23 12:40 ` demerphq 2 siblings, 2 replies; 66+ messages in thread From: Joan Daemen @ 2018-07-22 15:23 UTC (permalink / raw) To: brian m. carlson, Johannes Schindelin, Jonathan Nieder, git, demerphq, Linus Torvalds, Adam Langley Cc: Keccak Team Dear all, I wanted to react to some statements I read in this discussion. But first let me introduce myself. I'm Joan Daemen and I'm working in symmetric cryptography since 1988. Vincent Rijmen and I designed Rijndael that was selected to become AES and Guido Bertoni, Michael Peeters and Gilles Van Assche and I (the Keccak team, later extended with Ronny Van Keer) designed Keccak that was selected to become SHA3. Of course as a member of the Keccak team I'm biased in this discussion but I'll try to keep it factual. Adam Langley says: I think this group can safely assume that SHA-256, SHA-512, BLAKE2, K12, etc are all secure to the extent that I don't believe that making comparisons between them on that axis is meaningful. If never any cryptographic algorithms would be broken, this would be true. Actually, one can manage the risk by going for cryptographic algorithms with higher security assurance. In symmetric crypto one compares security assurance of cryptographic algorithms by the amount of third-party cryptanalysis, and a good indication of that is the number of peer-reviewed papers published. People tend to believe that the SHA2 functions have received more third-party cryptanalysis than Keccak, but this is not supported by the facts. We recently did a count of number of cryptanalysis papers for different hash functions and found the following: - Keccak: 35 third-party cryptanalysis papers dealing with the permutation underlying Keccak, most of them at venues with peer review (see https://keccak.team/third_party.html) This cryptanalysis carries over to K12 as it is a tree hashing mode built on top of a reduced-round Keccak variant. - SHA-256 and SHA-512 together: we found 21 third-party cryptanalysis papers dealing with the compression functions of SHA-256 or SHA-512. - BLAKE2: the BLAKE2 webpage blake2.net lists 4 third-party cryptanalysis papers. There are also a handful of cryptanalysis papers on its predecessor BLAKE, but these results do not necessarily carry over as the two compression functions in the different BLAKE2 variants are different from the two compression functions in the different BLAKE variants. I was not surprised by the relatively low number of SHA-2 cryptanalysis papers we found as during the SHA-3 competition all cryptanalysts were focusing on SHA-3 candidates and after the competition attention shifted to authenticated encryption. Anyway, these numbers support the opinion that the safety margins taken in K12 are better understood than those in SHA-256, SHA-512 and BLAKE2. Adam Langley continues: Thus I think the question is primarily concerned with performance and implementation availability Table 2 in our ACNS paper on K12 (available at https://eprint.iacr.org/2016/770) shows that performance of K12 is quite competitive. Moreover, there is a lot of code available under CC0 license in the Keccak Code Package on github https://github.com/gvanas/KeccakCodePackage. If there is shortage of code for some platforms in the short term, we will be happy to work on that. In the long term, it is likely that the relative advantage of K12 will increase as it has more potential for hardware acceleration, e.g., by instruction set extension. This is thanks to the fact that it does not use addition, as opposed to so-called addition-xor-rotation (ARX) designs such as the SHA-2 and BLAKE2 families. This is already illustrated in our Table 2 I referred to above, in the transition from Skylake to SkylakeX. Maybe also interesting for this discussion are the two notes we (Keccak team) wrote on our choice to not go for ARX and the one on "open source crypto" at https://keccak.team/2017/not_arx.html and https://keccak.team/2017/open_source_crypto.html respectively. Kind regards, Joan Daemen On 22/07/2018 01:59, brian m. carlson wrote: > On Sun, Jul 22, 2018 at 12:38:41AM +0200, Johannes Schindelin wrote: >> Do you really want to value contributors' opinion more than >> cryptographers'? I mean, that's exactly what got us into this hard-coded >> SHA-1 mess in the first place. > I agree (believe me, of all people, I agree) that hard-coding SHA-1 was > a bad choice in retrospect. But I've solicited contributors' opinions > because the Git Project needs to make a decision *for this project* > about the algorithm we're going to use going forward. > >> And to set the record straight: I do not have a strong preference of the >> hash algorithm. But cryprographers I have the incredible luck to have >> access to, by virtue of being a colleague, did mention their preference. > I don't know your colleagues, and they haven't commented here. One > person that has commented here is Adam Langley. It is my impression > (and anyone is free to correct me if I'm incorrect) that he is indeed a > cryptographer. To quote him[0]: > > I think this group can safely assume that SHA-256, SHA-512, BLAKE2, > K12, etc are all secure to the extent that I don't believe that making > comparisons between them on that axis is meaningful. Thus I think the > question is primarily concerned with performance and implementation > availability. > > […] > > So, overall, none of these choices should obviously be excluded. The > considerations at this point are not cryptographic and the tradeoff > between implementation ease and performance is one that the git > community would have to make. > > I'm aware that cryptographers tend to prefer algorithms that have been > studied longer over ones that have been studied less. They also prefer > algorithms built in the open to ones developed behind closed doors. > > SHA-256 has the benefit that it has been studied for a long time, but it > was also designed in secret by the NSA. SHA3-256 was created with > significant study in the open, but is not as mature. BLAKE2b has been > incorporated into standards like Argon2, but has been weakened slightly > for performance. > > I'm not sure that there's a really obvious choice here. > > I'm at the point where to continue the work that I'm doing, I need to > make a decision. I'm happy to follow the consensus if there is one, but > it does not appear that there is. > > I will admit that I don't love making this decision by myself, because > right now, whatever I pick, somebody is going to be unhappy. I want to > state, unambiguously, that I'm trying to make a decision that is in the > interests of the Git Project, the community, and our users. > > I'm happy to wait a few more days to see if a consensus develops; if so, > I'll follow it. If we haven't come to one by, say, Wednesday, I'll make > a decision and write my patches accordingly. The community is free, as > always, to reject my patches if taking them is not in the interest of > the project. > > [0] https://public-inbox.org/git/CAL9PXLzhPyE+geUdcLmd=pidT5P8eFEBbSgX_dS88knz2q_LSw@mail.gmail.com/ ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-22 15:23 ` Joan Daemen @ 2018-07-22 18:54 ` Adam Langley 2018-07-26 10:31 ` Johannes Schindelin 1 sibling, 0 replies; 66+ messages in thread From: Adam Langley @ 2018-07-22 18:54 UTC (permalink / raw) To: jda Cc: brian m. carlson, Johannes Schindelin, Jonathan Nieder, Git Mailing List, demerphq, Linus Torvalds, all Somewhere upthread, Brian refers to me as a cryptographer. That's flattering (thank you), but probably not really true even on a good day. And certainly not true next to Joan Daemen. I do have experience with crypto at scale and in ecosystems, though. Joan's count of cryptanalysis papers is a reasonable way to try and bring some quantitative clarity to an otherwise subjective topic. But still, despite lacking any counterpoint to it, I find myself believing that practical concerns are a stronger differentiater here. But the world is in a position where a new, common hash function might crystalise, and git could be the start of that. What that means for the ecosystem is is that numerous libraries need to grow implementations optimised for 3+ platforms and those platforms (esp Intel) often need multiple versions (e.g. for different vector widths) with code-size concerns pushing back at the same time. Intrinsics still don't cut it, so that means hand-assembly and thus dealing with gas vs Windows, CFI metadata, etc. Licensing differences mean that code-sharing doesn't work nearly as well as one might hope. Then complexity spreads upwards as testing matrices expand with the combination of each signature algorithm with the new hash function, options in numerous protocols etc. In short, picking just one would be lovely. For that reason, I've held back from SHA3 (which I consider distinct from K12) because I didn't feel that it relieved enough pressure: people who wanted more performance weren't going to be satisfied. Other than that, I don't have strong feelings and, to be clear, K12 seems like a fine option. But it does seem that a) there is probably not any more information to discover that is going to alter your decision and b) waiting a short to medium amount of time is probably not going to bring any definitive developments either. Cheers AGL ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-22 15:23 ` Joan Daemen 2018-07-22 18:54 ` Adam Langley @ 2018-07-26 10:31 ` Johannes Schindelin 1 sibling, 0 replies; 66+ messages in thread From: Johannes Schindelin @ 2018-07-26 10:31 UTC (permalink / raw) To: Joan Daemen Cc: brian m. carlson, Jonathan Nieder, git, demerphq, Linus Torvalds, Adam Langley, Keccak Team Hi Joan, On Sun, 22 Jul 2018, Joan Daemen wrote: > I wanted to react to some statements I read in this discussion. But > first let me introduce myself. I'm Joan Daemen and I'm working in > symmetric cryptography since 1988. Vincent Rijmen and I designed > Rijndael that was selected to become AES and Guido Bertoni, Michael > Peeters and Gilles Van Assche and I (the Keccak team, later extended > with Ronny Van Keer) designed Keccak that was selected to become SHA3. > Of course as a member of the Keccak team I'm biased in this discussion > but I'll try to keep it factual. Thank you *so* much for giving your valuable time and expertise on this subject. I really would hate for the decision to be made due to opinions of people who are overconfident in their abilities to judge cryptographic matters despite clearly being out of their league (which includes me, I want to add specifically). On a personal note: back in the day, I have been following the Keccak with a lot of interest, being intrigued by the deliberate deviation from the standard primitives, and I am pretty much giddy about the fact that I am talking to you right now. > [... interesting, and thorough background information ...] > > Anyway, these numbers support the opinion that the safety margins taken > in K12 are better understood than those in SHA-256, SHA-512 and BLAKE2. This is very, very useful information in my mind. > Adam Langley continues: > > Thus I think the question is primarily concerned with performance and implementation availability > > > Table 2 in our ACNS paper on K12 (available at > https://eprint.iacr.org/2016/770) shows that performance of K12 is quite > competitive. Moreover, there is a lot of code available under CC0 > license in the Keccak Code Package on github > https://github.com/gvanas/KeccakCodePackage. If there is shortage of > code for some platforms in the short term, we will be happy to work on that. > > In the long term, it is likely that the relative advantage of K12 will > increase as it has more potential for hardware acceleration, e.g., by > instruction set extension. This is thanks to the fact that it does not > use addition, as opposed to so-called addition-xor-rotation (ARX) > designs such as the SHA-2 and BLAKE2 families. This is already > illustrated in our Table 2 I referred to above, in the transition from > Skylake to SkylakeX. I *really* hope that more accessible hardware acceleration for this materializes at some stage. And by "more accessible", I mean commodity hardware such as ARM or AMD/Intel processors: big hosters could relatively easily develop appropriate FPGAs (we already do this for AI, after all). > Maybe also interesting for this discussion are the two notes we (Keccak > team) wrote on our choice to not go for ARX and the one on "open source > crypto" at https://keccak.team/2017/not_arx.html and > https://keccak.team/2017/open_source_crypto.html respectively. I had read those posts when they came out, and still find them insightful. Hopefully other readers of this mailing list will spend the time to read them, too. Again, thank you so much for a well-timed dose of domain expertise in this thread. Ciao, Dscho ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-21 23:59 ` brian m. carlson 2018-07-22 9:34 ` Eric Deplagne 2018-07-22 15:23 ` Joan Daemen @ 2018-07-23 12:40 ` demerphq 2018-07-23 12:48 ` Sitaram Chamarty ` (2 more replies) 2 siblings, 3 replies; 66+ messages in thread From: demerphq @ 2018-07-23 12:40 UTC (permalink / raw) To: brian m. carlson, Johannes Schindelin, Jonathan Nieder, Git, Linus Torvalds, agl, keccak On Sun, 22 Jul 2018 at 01:59, brian m. carlson <sandals@crustytoothpaste.net> wrote: > I will admit that I don't love making this decision by myself, because > right now, whatever I pick, somebody is going to be unhappy. I want to > state, unambiguously, that I'm trying to make a decision that is in the > interests of the Git Project, the community, and our users. > > I'm happy to wait a few more days to see if a consensus develops; if so, > I'll follow it. If we haven't come to one by, say, Wednesday, I'll make > a decision and write my patches accordingly. The community is free, as > always, to reject my patches if taking them is not in the interest of > the project. Hi Brian. I do not envy you this decision. Personally I would aim towards pushing this decision out to the git user base and facilitating things so we can choose whatever hash function (and config) we wish, including ones not invented yet. Failing that I would aim towards a hashing strategy which has the most flexibility. Keccak for instance has the interesting property that its security level is tunable, and that it can produce aribitrarily long hashes. Leaving aside other concerns raised elsewhere in this thread, these two features alone seem to make it a superior choice for an initial implementation. You can find bugs by selecting unusual hash sizes, including very long ones, and you can provide ways to tune the function to peoples security and speed preferences. Someone really paranoid can specify an unusually large round count and a very long hash. Also frankly I keep thinking that the ability to arbitrarily extend the hash size has to be useful /somewhere/ in git. cheers, Yves I am not a cryptographer. -- perl -Mre=debug -e "/just|another|perl|hacker/" ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-23 12:40 ` demerphq @ 2018-07-23 12:48 ` Sitaram Chamarty 2018-07-23 12:55 ` demerphq 2018-07-23 18:23 ` Linus Torvalds 2018-07-23 17:57 ` Stefan Beller 2018-07-23 18:35 ` Jonathan Nieder 2 siblings, 2 replies; 66+ messages in thread From: Sitaram Chamarty @ 2018-07-23 12:48 UTC (permalink / raw) To: demerphq, brian m. carlson, Johannes Schindelin, Jonathan Nieder, Git, Linus Torvalds, agl, keccak On 07/23/2018 06:10 PM, demerphq wrote: > On Sun, 22 Jul 2018 at 01:59, brian m. carlson > <sandals@crustytoothpaste.net> wrote: >> I will admit that I don't love making this decision by myself, because >> right now, whatever I pick, somebody is going to be unhappy. I want to >> state, unambiguously, that I'm trying to make a decision that is in the >> interests of the Git Project, the community, and our users. >> >> I'm happy to wait a few more days to see if a consensus develops; if so, >> I'll follow it. If we haven't come to one by, say, Wednesday, I'll make >> a decision and write my patches accordingly. The community is free, as >> always, to reject my patches if taking them is not in the interest of >> the project. > > Hi Brian. > > I do not envy you this decision. > > Personally I would aim towards pushing this decision out to the git > user base and facilitating things so we can choose whatever hash > function (and config) we wish, including ones not invented yet. > > Failing that I would aim towards a hashing strategy which has the most > flexibility. Keccak for instance has the interesting property that its > security level is tunable, and that it can produce aribitrarily long > hashes. Leaving aside other concerns raised elsewhere in this thread, > these two features alone seem to make it a superior choice for an > initial implementation. You can find bugs by selecting unusual hash > sizes, including very long ones, and you can provide ways to tune the > function to peoples security and speed preferences. Someone really > paranoid can specify an unusually large round count and a very long > hash. > > Also frankly I keep thinking that the ability to arbitrarily extend > the hash size has to be useful /somewhere/ in git. I would not suggest arbitrarily long hashes. Not only would it complicate a lot of code, it is not clear that it has any real benefit. Plus, the code contortions required to support arbitrarily long hashes would be more susceptible to potential bugs and exploits, simply by being more complex code. Why take chances? I would suggest (a) hash size of 256 bits and (b) choice of any hash function that can produce such a hash. If people feel strongly that 256 bits may also turn out to be too small (really?) then a choice of 256 or 512, but not arbitrary sizes. Sitaram also not a cryptographer! ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-23 12:48 ` Sitaram Chamarty @ 2018-07-23 12:55 ` demerphq 2018-07-23 18:23 ` Linus Torvalds 1 sibling, 0 replies; 66+ messages in thread From: demerphq @ 2018-07-23 12:55 UTC (permalink / raw) To: Sitaram Chamarty Cc: brian m. carlson, Johannes Schindelin, Jonathan Nieder, Git, Linus Torvalds, agl, keccak On Mon, 23 Jul 2018 at 14:48, Sitaram Chamarty <sitaramc@gmail.com> wrote: > On 07/23/2018 06:10 PM, demerphq wrote: > > On Sun, 22 Jul 2018 at 01:59, brian m. carlson > > <sandals@crustytoothpaste.net> wrote: > >> I will admit that I don't love making this decision by myself, because > >> right now, whatever I pick, somebody is going to be unhappy. I want to > >> state, unambiguously, that I'm trying to make a decision that is in the > >> interests of the Git Project, the community, and our users. > >> > >> I'm happy to wait a few more days to see if a consensus develops; if so, > >> I'll follow it. If we haven't come to one by, say, Wednesday, I'll make > >> a decision and write my patches accordingly. The community is free, as > >> always, to reject my patches if taking them is not in the interest of > >> the project. > > > > Hi Brian. > > > > I do not envy you this decision. > > > > Personally I would aim towards pushing this decision out to the git > > user base and facilitating things so we can choose whatever hash > > function (and config) we wish, including ones not invented yet. > > > > Failing that I would aim towards a hashing strategy which has the most > > flexibility. Keccak for instance has the interesting property that its > > security level is tunable, and that it can produce aribitrarily long > > hashes. Leaving aside other concerns raised elsewhere in this thread, > > these two features alone seem to make it a superior choice for an > > initial implementation. You can find bugs by selecting unusual hash > > sizes, including very long ones, and you can provide ways to tune the > > function to peoples security and speed preferences. Someone really > > paranoid can specify an unusually large round count and a very long > > hash. > > > > Also frankly I keep thinking that the ability to arbitrarily extend > > the hash size has to be useful /somewhere/ in git. > > I would not suggest arbitrarily long hashes. Not only would it > complicate a lot of code, it is not clear that it has any real benefit. It has the benefit of armoring the code for the *next* hash change, and making it clear that such decisions are arbitrary and should not be depended on. > Plus, the code contortions required to support arbitrarily long hashes > would be more susceptible to potential bugs and exploits, simply by > being more complex code. Why take chances? I think the benefits would outweight the risks. > I would suggest (a) hash size of 256 bits and (b) choice of any hash > function that can produce such a hash. If people feel strongly that 256 > bits may also turn out to be too small (really?) then a choice of 256 or > 512, but not arbitrary sizes. I am aware of too many systems that cannot change their size and are locked into woefully bad decisions that were made long ago to buy this. Making it a per-repo option, would eliminate assumptions and make for a more secure and flexible tool. Anyway, I am not going to do the work so my opinion is worth the price of the paper I sent it on. :-) cheers, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/" ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-23 12:48 ` Sitaram Chamarty 2018-07-23 12:55 ` demerphq @ 2018-07-23 18:23 ` Linus Torvalds 1 sibling, 0 replies; 66+ messages in thread From: Linus Torvalds @ 2018-07-23 18:23 UTC (permalink / raw) To: Sitaram Chamarty Cc: demerphq, brian m. carlson, Johannes Schindelin, Jonathan Nieder, Git Mailing List, Adam Langley, keccak On Mon, Jul 23, 2018 at 5:48 AM Sitaram Chamarty <sitaramc@gmail.com> wrote: > > I would suggest (a) hash size of 256 bits and (b) choice of any hash > function that can produce such a hash. If people feel strongly that 256 > bits may also turn out to be too small (really?) then a choice of 256 or > 512, but not arbitrary sizes. Honestly, what's the expected point of 512-bit hashes? The _only_ point of a 512-bit hash is that it's going to grow objects in incompressible ways, and use more memory. Just don't do it. If somebody can break a 256-bit hash, you have two choices: (a) the hash function itself was broken, and 512 bits isn't the solution to it anyway, even if it can certainly hide the problem (b) you had some "new math" kind of unexpected breakthrough, which means that 512 bits might not be much better either. Honestly, the number of particles in the observable universe is on the order of 2**256. It's a really really big number. Don't make the code base more complex than it needs to be. Make a informed technical decision, and say "256 bits is a *lot*". The difference between engineering and theory is that engineering makes trade-offs. Good software is well *engineered*, not theorized. Also, I would suggest that git default to "abbrev-commit=40", so that nobody actually *sees* the new bits by default. So the perl scripts etc that use "[0-9a-f]{40}" as a hash pattern would just silently continue to work. Because backwards compatibility is important (*) Linus (*) And 2**160 is still a big big number, and hasn't really been a practical problem, and SHA1DC is likely a good hash for the next decade or longer. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-23 12:40 ` demerphq 2018-07-23 12:48 ` Sitaram Chamarty @ 2018-07-23 17:57 ` Stefan Beller 2018-07-23 18:35 ` Jonathan Nieder 2 siblings, 0 replies; 66+ messages in thread From: Stefan Beller @ 2018-07-23 17:57 UTC (permalink / raw) To: demerphq Cc: brian m. carlson, Johannes Schindelin, Jonathan Nieder, git, Linus Torvalds, Adam Langley, keccak On Mon, Jul 23, 2018 at 5:41 AM demerphq <demerphq@gmail.com> wrote: > > On Sun, 22 Jul 2018 at 01:59, brian m. carlson > <sandals@crustytoothpaste.net> wrote: > > I will admit that I don't love making this decision by myself, because > > right now, whatever I pick, somebody is going to be unhappy. I want to > > state, unambiguously, that I'm trying to make a decision that is in the > > interests of the Git Project, the community, and our users. > > > > I'm happy to wait a few more days to see if a consensus develops; if so, > > I'll follow it. If we haven't come to one by, say, Wednesday, I'll make > > a decision and write my patches accordingly. The community is free, as > > always, to reject my patches if taking them is not in the interest of > > the project. > > Hi Brian. > > I do not envy you this decision. > > Personally I would aim towards pushing this decision out to the git > user base and facilitating things so we can choose whatever hash > function (and config) we wish, including ones not invented yet. By Git user base you actually mean millions of people? (And they'll have different opinions and needs) One of the goals of the hash transition is to pick a hash function such that git repositories are compatible. If users were to pick their own hashes, we would need to not just give a SHA-1 -> <newhash> transition plan, but we'd have to make sure the full matrix of possible hashes is interchangeable as we have no idea of what the user would think of "safer". For example one server operator might decide to settle on SHA2 and another would settle on blake2, whereas a user that uses both servers as remotes settles with k12. Then there would be a whole lot of conversion going on (you cannot talk natively to a remote with a different hash; checking pgp signatures is also harder as you have an abstraction layer in between). I would rather just have the discussion now and then provide only one conversion tool which might be easy to adapt, but after the majority converted, it is rather left to bitrot instead of needing to support ongoing conversions. On the other hand, even if we'd provide a "different hashes are fine" solution, I would think the network effect would make sure that eventually most people end up with one hash. One example of using different hashes successfully are transports, like TLS, SSH. The difference there is that it is a point-to-point communication whereas a git repository needs to be read by many parties involved; also a communication over TLS/SSH is ephemeral unlike objects in Git. > Failing that I would aim towards a hashing strategy which has the most > flexibility. Keccak for instance has the interesting property that its > security level is tunable, and that it can produce aribitrarily long > hashes. Leaving aside other concerns raised elsewhere in this thread, > these two features alone seem to make it a superior choice for an > initial implementation. You can find bugs by selecting unusual hash > sizes, including very long ones, and you can provide ways to tune the > function to peoples security and speed preferences. Someone really > paranoid can specify an unusually large round count and a very long > hash. I do not object to this in theory, but I would rather not want to burden the need to write code for this on the community. > I am not a cryptographer. Same here. My personal preference would be blake2b as that is the fastest IIRC. Re-reading brians initial mail, I think we should settle on SHA-256, as that is a conservative choice for security and the winner in HW accelerated setups, and not to shabby in a software implementation; it is also widely available. Stefan ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-23 12:40 ` demerphq 2018-07-23 12:48 ` Sitaram Chamarty 2018-07-23 17:57 ` Stefan Beller @ 2018-07-23 18:35 ` Jonathan Nieder 2 siblings, 0 replies; 66+ messages in thread From: Jonathan Nieder @ 2018-07-23 18:35 UTC (permalink / raw) To: demerphq Cc: brian m. carlson, Johannes Schindelin, Git, Linus Torvalds, agl, keccak Hi Yves, demerphq wrote: > On Sun, 22 Jul 2018 at 01:59, brian m. carlson > <sandals@crustytoothpaste.net> wrote: >> I will admit that I don't love making this decision by myself, because >> right now, whatever I pick, somebody is going to be unhappy. [...] > I do not envy you this decision. > > Personally I would aim towards pushing this decision out to the git > user base and facilitating things so we can choose whatever hash > function (and config) we wish, including ones not invented yet. There are two separate pieces to this. One is configurability at compile time. So far that has definitely been a goal, because we want to be ready to start the transition to another hash, and quickly, as soon as the new hash is discovered to be weak. This also means that people can experiment with new hashes and in a controlled environment (where the users can afford to build from source), some users might prefer some bespoke hash for reasons only known to them. ;-) Another piece is configurability at run time. This is a harder sell because it has some negative effects in the ecosystem: - performance impact from users having to maintain a translation table between the different hash functions in use - security impact, in the form of downgrade attacks - dependency bloat, from Git having to be able to compute all hash functions permitted in that run-time configuration The security impact can be mitigated by keeping the list of supported hashes small (i.e. two or three instead of 10ish). Each additional hash function is a potential liability (just as in SSL), so they have to earn their keep. The performance impact is unavoidable if we encourage Git servers to pick their favorite hash function instead of making a decision in the project. This can in turn affect security, since it would increase the switching cost away from SHA-1, with the likely effect being that most users stay on SHA-1. I don't want to go there. So I would say, support for arbitrary hash functions at compile time and in file formats is important and I encourage you to hold us to that (when reviewing patches, etc). But in the standard Git build configuration that most people run, I believe it is best to support only SHA-1 + our chosen replacement hash. Thanks, Jonathan ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-20 21:52 ` brian m. carlson ` (2 preceding siblings ...) 2018-07-21 22:38 ` Johannes Schindelin @ 2018-07-24 19:01 ` Edward Thomson 2018-07-24 20:31 ` Linus Torvalds 3 siblings, 1 reply; 66+ messages in thread From: Edward Thomson @ 2018-07-24 19:01 UTC (permalink / raw) To: brian m. carlson, Jonathan Nieder, git, Johannes Schindelin, demerphq, Linus Torvalds, Adam Langley, The Keccak Team On Fri, Jul 20, 2018 at 09:52:20PM +0000, brian m. carlson wrote: > > To summarize the discussion that's been had in addition to the above, > Ævar has also stated a preference for SHA-256 and I would prefer BLAKE2b > over SHA-256 over SHA3-256, although any of them would be fine. > > Are there other contributors who have a strong opinion? Are there > things I can do to help us coalesce around an option? Overall, I prefer SHA-256. I mentioned this at the contributor summit - so this may have been captured in the notes. But if not, when I look at this from the perspective of my day job at Notorious Big Software Company, we would prefer SHA-256 due to its performance characteristics and the availability of hardware acceleration. We think about git object ids in a few different ways: Obviously we use git as a version control system - we have a significant investment in hosting repositories (for both internal Microsoft teams and our external customers). What may be less obvious is that often, git blob ids are used as fingerprints: on a typical Windows machine, you don't have the command-line hash functions (md5sum and friends), but every developer has git installed. So we end up calculating git object ids in places within the development pipeline that are beyond the scope of just version control. Not to dwell too much on implementation details, but this is especially advantageous for us in (say) labs where we can ensure that particular hardware is available to speed this up as necessary. Switching gears, if I look at this from the perspective of the libgit2 project, I would also prefer SHA-256 or SHA3 over blake2b. To support blake2b, we'd have to include - and support - that code ourselves. But to support SHA-256, we would simply use the system's crypto libraries that we already take a dependecy on (OpenSSL, mbedTLS, CryptoNG, or SecureTransport). All of those support SHA-256 and none of them include support for blake2b. That means if there's a problem with (say) OpenSSL's SHA-256 implementation, then it will be fixed by their vendor. If there's a problem with libb2, then that's now my responsibility. This is not to suggest that one library is of higher or lower quality than another. And surely we would try to use the same blake2b library that git itself is using to minimize some of this risk (so that at least we're all in the same boat and can leverage each other's communications to users) but even then, there will be inevitable drift between our vendored dependencies and the upstream code. You can see this in action in xdiff: git's xdiff has deviated from upstream, and libgit2 has taken git's and ours has deviated from that. Cheers- -ed ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-24 19:01 ` Edward Thomson @ 2018-07-24 20:31 ` Linus Torvalds 2018-07-24 20:49 ` Jonathan Nieder 2018-07-24 21:13 ` Junio C Hamano 0 siblings, 2 replies; 66+ messages in thread From: Linus Torvalds @ 2018-07-24 20:31 UTC (permalink / raw) To: Edward Thomson Cc: brian m. carlson, Jonathan Nieder, Git Mailing List, Johannes Schindelin, demerphq, Adam Langley, keccak On Tue, Jul 24, 2018 at 12:01 PM Edward Thomson <ethomson@edwardthomson.com> wrote: > > Switching gears, if I look at this from the perspective of the libgit2 > project, I would also prefer SHA-256 or SHA3 over blake2b. To support > blake2b, we'd have to include - and support - that code ourselves. But > to support SHA-256, we would simply use the system's crypto libraries > that we already take a dependecy on (OpenSSL, mbedTLS, CryptoNG, or > SecureTransport). I think this is probably the single strongest argument for sha256. "It's just there". The hardware acceleration hasn't become nearly as ubiquitous as I would have hoped, and honestly, sha256 _needs_ hw acceleration more than some of the alternatives in the first place. But sha256 does have the big advantage of just having been around and existing in pretty much every single crypto library. So I'm not a huge fan of sha256, partly because of my disappointment in lack of hw acceleration in releant markets (sure, it's fairly common in ARM, but nobody sane uses ARM for development because of _other_ reasons). And partly because I don't like how the internal data size is the same as the final hash. But that second issue is an annoyance with it, not a real issue - in the absence of weaknesses it's a non-issue, and any future weaknesses might affect any other choice too. So hey, if people are actually at the point where the lack of choice holds up development, we should just pick one. And despite what I've said in this discussion, sha256 would have been my first choice, just because it's the "obvious" choice. The exact same way that SHA1 was the obvious choice (for pretty much the same infrastructure reasons) back in 2005. And maybe the hw acceleration landscape will actually improve. I think AMD actually does do the SHA extensions in Zen/TR. So I think Junio should just pick one. And I'll stand up and say "Let's just pick one. And sha256 is certainly the safe choice in that it won't strike anybody as being the _wrong_ choice per se, even if not everybody will necessarily agree it's the _bext_ choice". but in the end I think Junio should be the final arbiter. I think all of the discussed choices are perfectly fine in practice. Btw, the one thing I *would* suggest is that the git community just also says that the current hash is not SHA1, but SHA1DC. Support for "plain" SHA1 should be removed entirely. If we add a lot of new infrastructure to support a new more secure hash, we should not have the old fallback for the known-weak one. Just make SHA1DC the only one git can be built with. Linus ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-24 20:31 ` Linus Torvalds @ 2018-07-24 20:49 ` Jonathan Nieder 2018-07-24 21:13 ` Junio C Hamano 1 sibling, 0 replies; 66+ messages in thread From: Jonathan Nieder @ 2018-07-24 20:49 UTC (permalink / raw) To: Linus Torvalds Cc: Edward Thomson, brian m. carlson, Git Mailing List, Johannes Schindelin, demerphq, Adam Langley, keccak Hi, Linus Torvalds wrote: > On Tue, Jul 24, 2018 at 12:01 PM Edward Thomson > <ethomson@edwardthomson.com> wrote: >> Switching gears, if I look at this from the perspective of the libgit2 >> project, I would also prefer SHA-256 or SHA3 over blake2b. To support >> blake2b, we'd have to include - and support - that code ourselves. But >> to support SHA-256, we would simply use the system's crypto libraries >> that we already take a dependecy on (OpenSSL, mbedTLS, CryptoNG, or >> SecureTransport). Just to be clear, OpenSSL has built-in blake2b support. [...] > So I'm not a huge fan of sha256, partly because of my disappointment > in lack of hw acceleration in releant markets (sure, it's fairly > common in ARM, but nobody sane uses ARM for development because of > _other_ reasons). And partly because I don't like how the internal > data size is the same as the final hash. But that second issue is an > annoyance with it, not a real issue - in the absence of weaknesses > it's a non-issue, and any future weaknesses might affect any other > choice too. Thanks for saying this. With this in mind, I think we have a clear way forward: we should use SHA-256. My main complaint about it is that it is not a tree hash, but the common availability in libraries trumps that (versus SHA-256x16, say). I also was excited about K12, both because I like a world where Keccak gets wide hardware accelaration (improving PRNGs and other applications) and because of Keccak team's helpfulness throughout the process of helping us evaluate this, and it's possible that some day in the future we may want to switch to something like it. But today, as mentioned in [1] and [2], there is value in settling on one standard and SHA2-256 is the obvious standard today. Thanks, Jonathan [1] https://public-inbox.org/git/CAL9PXLyNVLCCqV1ftRa3r4kuoamDZOF29HJEhv2JXrbHj1nirA@mail.gmail.com/ [2] https://public-inbox.org/git/20180723183523.GB9285@aiede.svl.corp.google.com/ ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-24 20:31 ` Linus Torvalds 2018-07-24 20:49 ` Jonathan Nieder @ 2018-07-24 21:13 ` Junio C Hamano 2018-07-24 22:10 ` brian m. carlson ` (3 more replies) 1 sibling, 4 replies; 66+ messages in thread From: Junio C Hamano @ 2018-07-24 21:13 UTC (permalink / raw) To: Linus Torvalds Cc: Edward Thomson, brian m. carlson, Jonathan Nieder, Git Mailing List, Johannes Schindelin, demerphq, Adam Langley, keccak Linus Torvalds <torvalds@linux-foundation.org> writes: > On Tue, Jul 24, 2018 at 12:01 PM Edward Thomson > <ethomson@edwardthomson.com> wrote: >> >> Switching gears, if I look at this from the perspective of the libgit2 >> project, I would also prefer SHA-256 or SHA3 over blake2b. To support >> blake2b, we'd have to include - and support - that code ourselves. But >> to support SHA-256, we would simply use the system's crypto libraries >> that we already take a dependecy on (OpenSSL, mbedTLS, CryptoNG, or >> SecureTransport). > > I think this is probably the single strongest argument for sha256. > "It's just there". Yup. I actually was leaning toward saying "all of them are OK in practice, so the person who is actually spear-heading the work gets to choose", but if we picked SHA-256 now, that would not be a choice that Brian has to later justify for choosing against everybody else's wishes, which makes it the best choice ;-) ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-24 21:13 ` Junio C Hamano @ 2018-07-24 22:10 ` brian m. carlson 2018-07-30 9:06 ` Johannes Schindelin 2018-07-25 8:30 ` [PATCH 0/2] document that NewHash is now SHA-256 Ævar Arnfjörð Bjarmason ` (2 subsequent siblings) 3 siblings, 1 reply; 66+ messages in thread From: brian m. carlson @ 2018-07-24 22:10 UTC (permalink / raw) To: Junio C Hamano Cc: Linus Torvalds, Edward Thomson, Jonathan Nieder, Git Mailing List, Johannes Schindelin, demerphq, Adam Langley, keccak [-- Attachment #1: Type: text/plain, Size: 1138 bytes --] On Tue, Jul 24, 2018 at 02:13:07PM -0700, Junio C Hamano wrote: > Yup. I actually was leaning toward saying "all of them are OK in > practice, so the person who is actually spear-heading the work gets > to choose", but if we picked SHA-256 now, that would not be a choice > that Brian has to later justify for choosing against everybody > else's wishes, which makes it the best choice ;-) This looks like a rough consensus. And fortunately, I was going to pick SHA-256 and implemented it over the weekend. Things I thought about in this regard: * When you compare against SHA1DC, most vectorized SHA-256 implementations are indeed faster, even without acceleration. * If we're doing signatures with OpenPGP (or even, I suppose, CMS), we're going to be using SHA-2, so it doesn't make sense to have our security depend on two separate algorithms when either one of them alone could break the security when we could just depend on one. I'll be sending out some patches, probably in a few days, with SHA-256 and some test fixes. -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 867 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-24 22:10 ` brian m. carlson @ 2018-07-30 9:06 ` Johannes Schindelin 2018-07-30 20:01 ` Dan Shumow 0 siblings, 1 reply; 66+ messages in thread From: Johannes Schindelin @ 2018-07-30 9:06 UTC (permalink / raw) To: brian m. carlson, Dan Shumow Cc: Junio C Hamano, Linus Torvalds, Edward Thomson, Jonathan Nieder, Git Mailing List, demerphq, Adam Langley, keccak Hi Brian, On Tue, 24 Jul 2018, brian m. carlson wrote: > On Tue, Jul 24, 2018 at 02:13:07PM -0700, Junio C Hamano wrote: > > Yup. I actually was leaning toward saying "all of them are OK in > > practice, so the person who is actually spear-heading the work gets to > > choose", but if we picked SHA-256 now, that would not be a choice that > > Brian has to later justify for choosing against everybody else's > > wishes, which makes it the best choice ;-) > > This looks like a rough consensus. As I grew really uncomfortable with having a decision that seems to be based on hunches by non-experts (we rejected the preference of the only cryptographer who weighed in, after all, precisely like we did over a decade ago), I asked whether I could loop in one of our in-house experts with this public discussion. Y'all should be quite familiar with his work: Dan Shumow. Dan, thank you for agreeing to chime in publicly. Ciao, Dscho ^ permalink raw reply [flat|nested] 66+ messages in thread
* RE: Hash algorithm analysis 2018-07-30 9:06 ` Johannes Schindelin @ 2018-07-30 20:01 ` Dan Shumow 2018-08-03 2:57 ` Jonathan Nieder 2018-09-18 15:18 ` Joan Daemen 0 siblings, 2 replies; 66+ messages in thread From: Dan Shumow @ 2018-07-30 20:01 UTC (permalink / raw) To: Johannes Schindelin, brian m. carlson Cc: Junio C Hamano, Linus Torvalds, Edward Thomson, Jonathan Nieder, Git Mailing List, demerphq, Adam Langley, keccak@noekeon.org Hello all. Johannes, thanks for adding me to this discussion. So, as one of the coauthors of the SHA-1 collision detection code, I just wanted to chime in and say I'm glad to see the move to a longer hash function. Though, as a cryptographer, I have a few thoughts on the matter that I thought I would share. I think that moving to SHA256 is a fine change, and I support it. I'm not anywhere near the expert in this that Joan Daeman is. I am someone who has worked in this space more or less peripherally. However, I agree with Adam Langley that basically all of the finalists for a hash function replacement are about the same for the security needs of Git. I think that, for this community, other software engineering considerations should be more important to the selection process. I think Joan's survey of cryptanalysis papers and the numbers that he gives are interesting, and I had never seen the comparison laid out like that. So, I think that there is a good argument to be made that SHA3 has had more cryptanalysis than SHA2. Though, Joan, are the papers that you surveyed only focused on SHA2? I'm curious if you think that the design/construction of SHA2, as it can be seen as an iteration of MD5/SHA1, means that the cryptanalysis papers on those constructions can be considered to apply to SHA2? Again, I'm not an expert in this, but I do know that Marc Steven's techniques for constructing collisions also provided some small cryptanalytic improvements against the SHA2 family as well. I also think that while the paper survey is a good way to look over all of this, the more time in the position of high profile visibility that SHA2 has can give us some confidence as well. Also something worth pointing out is that the connection SHA2 has to SHA1 means that if Marc Steven's cryptanalysis of MD5/SHA-1 were ever successfully applied to SHA2, the SHA1 collision detection approach could be applied there as well, thus providing a drop in replacement in that situation. That said, we don't know that there is not a similar way of addressing issues with the SHA3/Sponge design. It's just that because we haven't seen any weaknesses of this sort in similar designs, we just don't know what a similar approach would be there yet. I don't want to put too much stock in this argument, it's just saying "Well, we already know how SHA2 is likely to break, and we've had fixes for similar things in the past." This is pragmatic but not inspiring or confidence building. So, I also want to state my biases in favor of SHA2 as an employee of Microsoft. Microsoft, being a corporation headquartered in a America, with the US Gov't as a major customer definitely prefers to defer to the US Gov't NIST standardization process. And from that perspective SHA2 or SHA3 would be good choices. I, personally, think that the NIST process is the best we have. It is relatively transparent, and NIST employs a fair number of very competent cryptographers. Also, I am encouraged by the widespread international participation that the NIST competitions and selection processes attract. As such, and reflecting this bias, in the internal discussions that Johannes alluded to, SHA2 and SHA3 were the primary suggestions. There was a slight preference for SHA2 because SHA3 is not exposed through the windows cryptographic APIs (though Git does not use those, so this is a nonissue for this discussion.) I also wanted to thank Johannes for keeping the cryptographers that he discussed this with anonymous. After all, cryptographers are known for being private. And I wanted to say that Johannes did, in fact, accurately represent our internal discussions on the matter. I also wanted to comment on the discussion of the "internal state having the same size as the output." Linus referred to this several times. This is known as narrow-pipe vs wide-pipe in the hash function design literature. Linus is correct that wide-pipe designs are more in favor currently, and IIRC, all of the serious SHA3 candidates employed this. That said, it did seem that in the discussion this was being equated with "length extension attacks." And that connection is just not accurate. Length extension attacks are primarily a motivation of the HMAC liked nested hashing design for MACs, because of a potential forgery attack. Again, this doesn't really matter because the decision has been made despite this discussion. I just wanted to set the record straight about this, as to avoid doing the right thing for the wrong reason (T.S. Elliot's "greatest treason.") One other thing that I wanted to throw out there for the future is that in the crypto community there is currently a very large push to post quantum cryptography. Whether the threat of quantum computers is real or imagined this is a hot area of research, with a NIST competition to select post quantum asymmetric cryptographic algorithms. That is not directly of concern to the selection of a hash function. However, if we take this threat as legitimate, quantum computers reduce the strength of symmetric crypto, both encryption and hash functions, by 1/2. So, if this is the direction that the crypto community ultimately goes in, 512bit hashes will be seen as standard over the next decade or so. I don't think that this should be involved in this discussion, presently. I'm just saying that not unlike the time when SHA1 was selected, I think that the replacement of a 256bit hash is on the horizon as well. Thanks, Dan Shumow ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-30 20:01 ` Dan Shumow @ 2018-08-03 2:57 ` Jonathan Nieder 2018-09-18 15:18 ` Joan Daemen 1 sibling, 0 replies; 66+ messages in thread From: Jonathan Nieder @ 2018-08-03 2:57 UTC (permalink / raw) To: Dan Shumow Cc: Johannes Schindelin, brian m. carlson, Junio C Hamano, Linus Torvalds, Edward Thomson, Git Mailing List, demerphq, Adam Langley, keccak@noekeon.org Hi Dan, Dan Shumow wrote: [replying out of order for convenience] > However, I agree with Adam Langley that basically all of the > finalists for a hash function replacement are about the same for the > security needs of Git. I think that, for this community, other > software engineering considerations should be more important to the > selection process. Thanks for this clarification, which provides some useful context to your opinion that was previously relayed by Dscho. [...] > So, as one of the coauthors of the SHA-1 collision detection code, I > just wanted to chime in and say I'm glad to see the move to a longer > hash function. Though, as a cryptographer, I have a few thoughts on > the matter that I thought I would share. > > I think that moving to SHA256 is a fine change, and I support it. More generally, thanks for weighing in and for explaining your rationale. Even (especially) having already made the decision, it's comforting to hear a qualified person endorsing that choice. Sincerely, Jonathan ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-07-30 20:01 ` Dan Shumow 2018-08-03 2:57 ` Jonathan Nieder @ 2018-09-18 15:18 ` Joan Daemen 2018-09-18 15:32 ` Jonathan Nieder 2018-09-18 16:50 ` Linus Torvalds 1 sibling, 2 replies; 66+ messages in thread From: Joan Daemen @ 2018-09-18 15:18 UTC (permalink / raw) To: Dan Shumow, Johannes Schindelin, brian m. carlson Cc: Junio C Hamano, Linus Torvalds, Edward Thomson, Jonathan Nieder, Git Mailing List, demerphq, Adam Langley Dear all, when going over my todo list I was confronted with the mail of Dan Shumow on the successor of SHA-1 for git. I know the decision was made and it is not my intention to change it, but please see below some comments on Dan's arguments. In short, I argue below that SHA256 has no serious advantages when compared to KangarooTwelve. In that light, the fact that SHA2 was designed behind closed doors (like SHA-1 was) should be enough reason to skip it entirely in an undertaking that takes open-source seriously. Kind regards, Joan PS: In my comments below I use "we" as I discussed them with the members of the Keccak team being Gilles Van Assche, Michaël Peeters, Guido Bertoni, Ronny Van Keer and Seth Hoffert, and we agree on all of them. On 30/07/2018 22:01, Dan Shumow wrote: > Hello all. Johannes, thanks for adding me to this discussion. > > So, as one of the coauthors of the SHA-1 collision detection code, I just wanted to chime in and say I'm glad to see the move to a longer hash function. Though, as a cryptographer, I have a few thoughts on the matter that I thought I would share. > > I think that moving to SHA256 is a fine change, and I support it. > > I'm not anywhere near the expert in this that Joan Daeman is. Note that the correct spelling is "Daemen". But anyway, it is not a matter of being expert, but a matter of taking the arguments at face value. > I am someone who has worked in this space more or less peripherally. However, I agree with Adam Langley that basically all of the finalists for a hash function replacement are about the same for the security needs of Git. I think that, for this community, other software engineering considerations should be more important to the selection process. We are also with Adam on the engineering considerations. We see the parallelism that K12 can exploit adaptively (unlike SHA256) as an example of such a consideration. > I think Joan's survey of cryptanalysis papers and the numbers that he gives are interesting, and I had never seen the comparison laid out like that. So, I think that there is a good argument to be made that SHA3 has had more cryptanalysis than SHA2. Though, Joan, are the papers that you surveyed only focused on SHA2? I'm curious if you think that the design/construction of SHA2, as it can be seen as an iteration of MD5/SHA1, means that the cryptanalysis papers on those constructions can be considered to apply to SHA2? This argument works both ways, i.e., the knowledge and experience of the symmetric cryptography community in general has also contributed to our choices in Keccak and in K12 (including the experience gained by Rijndael/AES). But in the end, the only objective metric we have for comparing public scrutiny is the amount of cryptanalysis (and analysis) published, and there Keccak simply scores better. > Again, I'm not an expert in this, but I do know that Marc Steven's techniques for constructing collisions also provided some small cryptanalytic improvements against the SHA2 family as well. I also think that while the paper survey is a good way to look over all of this, the more time in the position of high profile visibility that SHA2 has can give us some confidence as well. High profile visibility to implementers does not mean more cryptanalysis, since users and implementers are usually not cryptanalysts. Actually, one of the reasons that SHA2 attracted much less cryptanalysis than you would expect due to its age is that during the SHA3 competition all cryptanalysts pointed their arrows to SHA3 candidates. > Also something worth pointing out is that the connection SHA2 has to SHA1 means that if Marc Steven's cryptanalysis of MD5/SHA-1 were ever successfully applied to SHA2, the SHA1 collision detection approach could be applied there as well, thus providing a drop in replacement in that situation. That said, we don't know that there is not a similar way of addressing issues with the SHA3/Sponge design. It's just that because we haven't seen any weaknesses of this sort in similar designs, we just don't know what a similar approach would be there yet. I don't want to put too much stock in this argument, it's just saying "Well, we already know how SHA2 is likely to break, and we've had fixes for similar things in the past." This is pragmatic but not inspiring or confidence building. > > So, I also want to state my biases in favor of SHA2 as an employee of Microsoft. Microsoft, being a corporation headquartered in a America, with the US Gov't as a major customer definitely prefers to defer to the US Gov't NIST standardization process. And from that perspective SHA2 or SHA3 would be good choices. I, personally, think that the NIST process is the best we have. It is relatively transparent, and NIST employs a fair number of very competent cryptographers. Also, I am encouraged by the widespread international participation that the NIST competitions and selection processes attract. Of course, NIST has done (and is still doing) a great job at organizing public competitions, where all submissions have to include a design rationale and where the final selection is based on extensive openly published cryptanalysis and comparisons done by the cryptographic community. This is obviously AES and SHA3. However, NIST also put forward NSA designs as standards, without design rationale or public cryptanalysis whatsoever, and in some cases even with built-in backdoors (EC-DRBG as Dan probably remembers). Examples of this are DES, SHA(0), SHA-1 and, yes, SHA2. The former we would call open-source philosophy and the latter closed-source. > As such, and reflecting this bias, in the internal discussions that Johannes alluded to, SHA2 and SHA3 were the primary suggestions. There was a slight preference for SHA2 because SHA3 is not exposed through the windows cryptographic APIs (though Git does not use those, so this is a nonissue for this discussion.) We find it cynical to bring up a Microsoft-internal argument that is actually not relevant to Git. > I also wanted to thank Johannes for keeping the cryptographers that he discussed this with anonymous. After all, cryptographers are known for being private. And I wanted to say that Johannes did, in fact, accurately represent our internal discussions on the matter. Our experience is that in the cryptographic community there are many outspoken individuals that fearlessly ventilate their opinions (sometimes even controversial ones). > I also wanted to comment on the discussion of the "internal state having the same size as the output." Linus referred to this several times. This is known as narrow-pipe vs wide-pipe in the hash function design literature. Linus is correct that wide-pipe designs are more in favor currently, and IIRC, all of the serious SHA3 candidates employed this. That said, it did seem that in the discussion this was being equated with "length extension attacks." And that connection is just not accurate. Length extension attacks are primarily a motivation of the HMAC liked nested hashing design for MACs, because of a potential forgery attack. Again, this doesn't really matter because the decision has been made despite this discussion. I just wanted to set the record straight about this, as to avoid doing the right thing for the wrong reason (T.S. Elliot's "greatest treason.") Indeed, vulnerability to length extension attacks and size of the internal state (chaining value in compression function based designs, or capacity in sponge) are two different things. Still, we would like to make a few points here. 1) There were SHA3 submissions that were narrow-pipe, i.e., finalist Blake is narrow-pipe. 2) SHA2 and its predecessors are vulnerable to length extension, SHA3 (or any of the SHA3 finalists) isn't. Length extension is a problem when using the hash function for MAC computation but this can be fixed by putting a construction on top of it. That construction is HMAC, that comes with some fixed overhead (up to a factor 4 for short messages). 3) The relatively large state in the sponge construction increases the generic strength against attacks when the input contains redundancy or has a certain form. For instance, if the input is restricted to be text in ASCII (such as source code), then the collision-resistance grows higher than the nominal 2^{c/2}. Such an effect does not exist with narrow-pipe Merkle-Damgård. (This may be what Linus had intuitively in mind.) > One other thing that I wanted to throw out there for the future is that in the crypto community there is currently a very large push to post quantum cryptography. Whether the threat of quantum computers is real or imagined this is a hot area of research, with a NIST competition to select post quantum asymmetric cryptographic algorithms. That is not directly of concern to the selection of a hash function. However, if we take this threat as legitimate, quantum computers reduce the strength of symmetric crypto, both encryption and hash functions, by 1/2. This is not what the experts say. In [1] a quantum algorithm is given that reduces the effort to generate a hash collision to 2^{n/3} (instead of 2^{n/2} classically). So according to [1] the strength reduction is 2/3 rather than 1/2. Moreover, in [2], Dan Bernstein takes a more detailed look at the actual cost of that algorithm and argues that the quantum algorithm of [1] performs worse than classical ones and that there is no security reduction at all for collision-resistance. [1] Gilles Brassard, Peter Høyer, Alain Tapp, Quantum cryptanalysis of hash and claw-free functions, in LATIN'98 proceedings (1998), 163–169. [2] Daniel J. Bernstein, Cost analysis of hash collisions: Will quantum computers make SHARCS obsolete? Workshop Record of SHARCS'09. > So, if this is the direction that the crypto community ultimately goes in, 512bit hashes will be seen as standard over the next decade or so. I don't think that this should be involved in this discussion, presently. I'm just saying that not unlike the time when SHA1 was selected, I think that the replacement of a 256bit hash is on the horizon as well. > > Thanks, > Dan Shumow > ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-09-18 15:18 ` Joan Daemen @ 2018-09-18 15:32 ` Jonathan Nieder 2018-09-18 16:50 ` Linus Torvalds 1 sibling, 0 replies; 66+ messages in thread From: Jonathan Nieder @ 2018-09-18 15:32 UTC (permalink / raw) To: Joan Daemen Cc: Dan Shumow, Johannes Schindelin, brian m. carlson, Junio C Hamano, Linus Torvalds, Edward Thomson, Git Mailing List, demerphq, Adam Langley Hi, A quick note. Joan Daemen wrote: > when going over my todo list I was confronted with the mail of Dan > Shumow on the successor of SHA-1 for git. I know the decision was > made and it is not my intention to change it, but please see below > some comments on Dan's arguments. When the time comes for the next hash change in Git, it will be useful to be able to look back over this discussion. Thanks for adding details. [...] > On 30/07/2018 22:01, Dan Shumow wrote: >> So, I also want to state my biases in favor of SHA2 as an employee >> of Microsoft. [...] As such, and reflecting this bias, in the >> internal discussions that Johannes alluded to, SHA2 and SHA3 were >> the primary suggestions. There was a slight preference for SHA2 >> because SHA3 is not exposed through the windows cryptographic APIs >> (though Git does not use those, so this is a nonissue for this >> discussion.) > > We find it cynical to bring up a Microsoft-internal argument that is > actually not relevant to Git. On the contrary, I am quite grateful that Dan was up front about where his preference comes from, *especially* when the reasons are not relevant to Git. It is useful background for better understanding his rationale and understanding the ramifications for some subset of users. In other words, consider someone active in the Git project that disagrees with the decision to use SHA2. This explanation by Dan can help such a person understand where the disagreement is coming from and whether we are making the decision for the wrong reasons (because Git on Windows does not even use those APIs). [...] > 3) The relatively large state in the sponge construction increases > the generic strength against attacks when the input contains > redundancy or has a certain form. For instance, if the input is > restricted to be text in ASCII (such as source code), then the > collision-resistance grows higher than the nominal 2^{c/2}. Such an > effect does not exist with narrow-pipe Merkle-Damgård. (This may be > what Linus had intuitively in mind.) Interesting. [...] > [2] Daniel J. Bernstein, Cost analysis of hash collisions: Will > quantum computers make SHARCS obsolete? Workshop Record of > SHARCS'09. I remember that paper! Thanks for the pointer. Sincerely, Jonathan ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-09-18 15:18 ` Joan Daemen 2018-09-18 15:32 ` Jonathan Nieder @ 2018-09-18 16:50 ` Linus Torvalds 1 sibling, 0 replies; 66+ messages in thread From: Linus Torvalds @ 2018-09-18 16:50 UTC (permalink / raw) To: jda Cc: Dan Shumow, Johannes Schindelin, brian m. carlson, Junio C Hamano, Edward Thomson, Jonathan Nieder, Git Mailing List, demerphq, Adam Langley On Tue, Sep 18, 2018 at 8:18 AM Joan Daemen <jda@noekeon.org> wrote: > > 3) The relatively large state in the sponge construction increases the generic strength against attacks when the input contains redundancy or > has a certain form. For instance, if the input is restricted to be text in ASCII (such as source code), then the collision-resistance grows > higher than the nominal 2^{c/2}. Such an effect does not exist with narrow-pipe Merkle-Damgård. (This may be what Linus had intuitively in mind.) Answering to just this part: No, what I had in mind was literally just exactly the kind of attack that SHA1 broke for - attacking the internal state vector directly, and not paying any penalty for it, because the stat size is the same as the final hash size. The length extension attack is just the simplest and most trivial version of that kind of attack - because the internal state vector *is* the result, and you just continue using it. But that trivial length extension thing not the real problem, it's just the absolutely simplest symptom of the real problem. I think that the model where the internal state of the hash is the same width as the final result is simply broken. It was what broke SHA1, and that problem is shared with SHA2. "Length extension" is just the simplest way to say "broken by design", imho. Because the length extension attack is just the most trivial attack, but it isn't the fundamental problem. It was just the first and the cheapest attack found, but it was also the most special-cased and least interesting. You need to have a very special case (with that secret at the beginning etc) to make the pure length extension attack interesting. And git has no secrets, so in that sense "length extension" by itself is totally immaterial. But the basic problem of internal hash size obviously wasn't. So I would say that length extension is a direct result of the _real_ problem, which is that the hash exposes _all_ of the internal data. That is what makes length extension possible - because you can just continue from a known state, and there is absolutely nothing hidden - and yes, that's a really easy special case where you don't even need to actually break the hash at all. But I argue that it's _also_ one big part of what made SHAttered practical, and I think the underlying problem is exactly the same. When the internal state is the same size as the hash, you can attack the internal state itself for basically the same cost as attacking the whole hash. So you can pick-and-choose the weakest point. Which is basically exactly what SHAttered did. No, it wasn't the trivial "just add to the end", but it used the exact same underlying weakness as one part of the attack. *This* is why I dislike SHA2. It has basically the exact same basic weakness that we already know SHA1 fell for. The hashing details are different, and hopefully that means that there aren't the same kind of patterns that can be generated to do the "attack the internal hash state" part, but I don't understand why people seem to ignore that other fundamental issue. Something like SHA-512/256 would have been better, but I think almost nobody does that in hardware, which was one of the big advantages of plain SHA2. The main reason I think SHA2 is acceptable is simply that 256 bits is a lot. So even if somebody comes up with a shortcut that weakens it by tens of bits, nobody really cares. Plus I'm obviously not a cryptographer, so I didn't feel like I was going to fight it a lot. But yes, I'd have probably gone with any of the other alternatives, because I think it's a bit silly that we're switching hashes to another hash that has (at least in part) the *exact* same issue as the one people call broken. (And yes, the hashing details are different, so it's "exactly the same" only wrt that internal state part - not the bitpattern finding part that made the attack on the internal state much cheaper. Real cryptographers obviously found that "figure out the weakness of the hashing" to be the more interesting and novel part over the trivial internal hash size part). That said.. The real reason I think SHA2 is the right choice was simply that there needs to be a decision, and none of the choices were *wrong*. Sometimes just the _act_ of making a decision is more important than _what_ the decision is. And hey, it is also likely that the reason _I_ get hung up on just the size of the internal state is that exactly because I am _not_ a cryptographer, that kind of high-level stuff is the part I understand. When you start talking about why the exact rules of Merkle–Damgård constructions work, my eyes just glaze over. So I'm probably - no, certainly - myopic and looking at only one part of the issue to begin with. The end result is that I argued for more bits in the internal state (and apparently wide vs narrow is the technical term), and I would have seen parallel algorithms as a bonus for the large-file case. None of which argued for SHA2. But see above on why I think SHA2 is if not *the* right choice, at least *a* right choice. Linus ^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH 0/2] document that NewHash is now SHA-256 2018-07-24 21:13 ` Junio C Hamano 2018-07-24 22:10 ` brian m. carlson @ 2018-07-25 8:30 ` Ævar Arnfjörð Bjarmason 2018-07-25 8:30 ` [PATCH 1/2] doc hash-function-transition: note the lack of a changelog Ævar Arnfjörð Bjarmason 2018-07-25 8:30 ` [PATCH 2/2] doc hash-function-transition: pick SHA-256 as NewHash Ævar Arnfjörð Bjarmason 3 siblings, 0 replies; 66+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-07-25 8:30 UTC (permalink / raw) To: git Cc: Junio C Hamano, Linus Torvalds, Edward Thomson, brian m . carlson, Jonathan Nieder, Johannes Schindelin, demerphq, Adam Langley, keccak, Ævar Arnfjörð Bjarmason On Tue, Jul 24 2018, Junio C Hamano wrote: > Linus Torvalds <torvalds@linux-foundation.org> writes: > >> On Tue, Jul 24, 2018 at 12:01 PM Edward Thomson >> <ethomson@edwardthomson.com> wrote: >>> >>> Switching gears, if I look at this from the perspective of the libgit2 >>> project, I would also prefer SHA-256 or SHA3 over blake2b. To support >>> blake2b, we'd have to include - and support - that code ourselves. But >>> to support SHA-256, we would simply use the system's crypto libraries >>> that we already take a dependecy on (OpenSSL, mbedTLS, CryptoNG, or >>> SecureTransport). >> >> I think this is probably the single strongest argument for sha256. >> "It's just there". > > Yup. I actually was leaning toward saying "all of them are OK in > practice, so the person who is actually spear-heading the work gets > to choose", but if we picked SHA-256 now, that would not be a choice > that Brian has to later justify for choosing against everybody > else's wishes, which makes it the best choice ;-) Looks like it's settled then. I thought I'd do the grunt work of updating the relevant documentation so we can officially move on from the years-long NewHash discussion. Ævar Arnfjörð Bjarmason (2): doc hash-function-transition: note the lack of a changelog doc hash-function-transition: pick SHA-256 as NewHash .../technical/hash-function-transition.txt | 192 ++++++++++-------- 1 file changed, 102 insertions(+), 90 deletions(-) -- 2.17.0.290.gded63e768a ^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH 1/2] doc hash-function-transition: note the lack of a changelog 2018-07-24 21:13 ` Junio C Hamano 2018-07-24 22:10 ` brian m. carlson 2018-07-25 8:30 ` [PATCH 0/2] document that NewHash is now SHA-256 Ævar Arnfjörð Bjarmason @ 2018-07-25 8:30 ` Ævar Arnfjörð Bjarmason 2018-07-25 8:30 ` [PATCH 2/2] doc hash-function-transition: pick SHA-256 as NewHash Ævar Arnfjörð Bjarmason 3 siblings, 0 replies; 66+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-07-25 8:30 UTC (permalink / raw) To: git Cc: Junio C Hamano, Linus Torvalds, Edward Thomson, brian m . carlson, Jonathan Nieder, Johannes Schindelin, demerphq, Adam Langley, keccak, Ævar Arnfjörð Bjarmason The changelog embedded in the document pre-dates the addition of the document to git.git (it used to be a Google Doc), so it only goes up to 752414ae43 ("technical doc: add a design doc for hash function transition", 2017-09-27). Since then I made some small edits to it, which would have been worthy of including in this changelog (but weren't). Instead of amending it to include these, just note that future changes will be noted in the log. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> --- Documentation/technical/hash-function-transition.txt | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt index 4ab6cd1012..5ee4754adb 100644 --- a/Documentation/technical/hash-function-transition.txt +++ b/Documentation/technical/hash-function-transition.txt @@ -814,6 +814,12 @@ Incorporated suggestions from jonathantanmy and sbeller: * avoid loose object overhead by packing more aggressively in "git gc --auto" +Later history: + + See the history of this file in git.git for the history of subsequent + edits. This document history is no longer being maintained as it + would now be superfluous to the commit log + [1] http://public-inbox.org/git/CA+55aFzJtejiCjV0e43+9oR3QuJK2PiFiLQemytoLpyJWe6P9w@mail.gmail.com/ [2] http://public-inbox.org/git/CA+55aFz+gkAsDZ24zmePQuEs1XPS9BP_s8O7Q4wQ7LV7X5-oDA@mail.gmail.com/ [3] http://public-inbox.org/git/20170306084353.nrns455dvkdsfgo5@sigill.intra.peff.net/ -- 2.17.0.290.gded63e768a ^ permalink raw reply related [flat|nested] 66+ messages in thread
* [PATCH 2/2] doc hash-function-transition: pick SHA-256 as NewHash 2018-07-24 21:13 ` Junio C Hamano ` (2 preceding siblings ...) 2018-07-25 8:30 ` [PATCH 1/2] doc hash-function-transition: note the lack of a changelog Ævar Arnfjörð Bjarmason @ 2018-07-25 8:30 ` Ævar Arnfjörð Bjarmason 2018-07-25 16:45 ` Junio C Hamano 3 siblings, 1 reply; 66+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-07-25 8:30 UTC (permalink / raw) To: git Cc: Junio C Hamano, Linus Torvalds, Edward Thomson, brian m . carlson, Jonathan Nieder, Johannes Schindelin, demerphq, Adam Langley, keccak, Ævar Arnfjörð Bjarmason The consensus on the mailing list seems to be that SHA-256 should be picked as our NewHash, see the "Hash algorithm analysis" thread as of [1]. Linus has come around to this choice and suggested Junio make the final pick, and he's endorsed SHA-256 [3]. This follow-up change changes occurrences of "NewHash" to "SHA-256" (or "sha256", depending on the context). The "Selection of a New Hash" section has also been changed to note that historically we used the the "NewHash" name while we didn't know what the new hash function would be. This leaves no use of "NewHash" anywhere in git.git except in the aforementioned section (and as a variable name in t/t9700/test.pl, but that use from 2008 has nothing to do with this transition plan). 1. https://public-inbox.org/git/20180720215220.GB18502@genre.crustytoothpaste.net/ 2. https://public-inbox.org/git/CA+55aFwSe9BF8e0hLk9pp3FVD5LaVY5GRdsV3fbNtgzekJadyA@mail.gmail.com/ 3. https://public-inbox.org/git/xmqqzhygwd5o.fsf@gitster-ct.c.googlers.com/ Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> --- .../technical/hash-function-transition.txt | 186 +++++++++--------- 1 file changed, 96 insertions(+), 90 deletions(-) diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt index 5ee4754adb..8fb2d4b498 100644 --- a/Documentation/technical/hash-function-transition.txt +++ b/Documentation/technical/hash-function-transition.txt @@ -59,14 +59,11 @@ that are believed to be cryptographically secure. Goals ----- -Where NewHash is a strong 256-bit hash function to replace SHA-1 (see -"Selection of a New Hash", below): - -1. The transition to NewHash can be done one local repository at a time. +1. The transition to SHA-256 can be done one local repository at a time. a. Requiring no action by any other party. - b. A NewHash repository can communicate with SHA-1 Git servers + b. A SHA-256 repository can communicate with SHA-1 Git servers (push/fetch). - c. Users can use SHA-1 and NewHash identifiers for objects + c. Users can use SHA-1 and SHA-256 identifiers for objects interchangeably (see "Object names on the command line", below). d. New signed objects make use of a stronger hash function than SHA-1 for their security guarantees. @@ -79,7 +76,7 @@ Where NewHash is a strong 256-bit hash function to replace SHA-1 (see Non-Goals --------- -1. Add NewHash support to Git protocol. This is valuable and the +1. Add SHA-256 support to Git protocol. This is valuable and the logical next step but it is out of scope for this initial design. 2. Transparently improving the security of existing SHA-1 signed objects. @@ -87,26 +84,26 @@ Non-Goals repository. 4. Taking the opportunity to fix other bugs in Git's formats and protocols. -5. Shallow clones and fetches into a NewHash repository. (This will - change when we add NewHash support to Git protocol.) -6. Skip fetching some submodules of a project into a NewHash - repository. (This also depends on NewHash support in Git +5. Shallow clones and fetches into a SHA-256 repository. (This will + change when we add SHA-256 support to Git protocol.) +6. Skip fetching some submodules of a project into a SHA-256 + repository. (This also depends on SHA-256 support in Git protocol.) Overview -------- We introduce a new repository format extension. Repositories with this -extension enabled use NewHash instead of SHA-1 to name their objects. +extension enabled use SHA-256 instead of SHA-1 to name their objects. This affects both object names and object content --- both the names of objects and all references to other objects within an object are switched to the new hash function. -NewHash repositories cannot be read by older versions of Git. +SHA-256 repositories cannot be read by older versions of Git. -Alongside the packfile, a NewHash repository stores a bidirectional -mapping between NewHash and SHA-1 object names. The mapping is generated +Alongside the packfile, a SHA-256 repository stores a bidirectional +mapping between SHA-256 and SHA-1 object names. The mapping is generated locally and can be verified using "git fsck". Object lookups use this -mapping to allow naming objects using either their SHA-1 and NewHash names +mapping to allow naming objects using either their SHA-1 and SHA-256 names interchangeably. "git cat-file" and "git hash-object" gain options to display an object @@ -116,7 +113,7 @@ object database so that they can be named using the appropriate name (using the bidirectional hash mapping). Fetches from a SHA-1 based server convert the fetched objects into -NewHash form and record the mapping in the bidirectional mapping table +SHA-256 form and record the mapping in the bidirectional mapping table (see below for details). Pushes to a SHA-1 based server convert the objects being pushed into sha1 form so the server does not have to be aware of the hash function the client is using. @@ -125,19 +122,19 @@ Detailed Design --------------- Repository format extension ~~~~~~~~~~~~~~~~~~~~~~~~~~~ -A NewHash repository uses repository format version `1` (see +A SHA-256 repository uses repository format version `1` (see Documentation/technical/repository-version.txt) with extensions `objectFormat` and `compatObjectFormat`: [core] repositoryFormatVersion = 1 [extensions] - objectFormat = newhash + objectFormat = sha256 compatObjectFormat = sha1 The combination of setting `core.repositoryFormatVersion=1` and populating `extensions.*` ensures that all versions of Git later than -`v0.99.9l` will die instead of trying to operate on the NewHash +`v0.99.9l` will die instead of trying to operate on the SHA-256 repository, instead producing an error message. # Between v0.99.9l and v2.7.0 @@ -155,36 +152,36 @@ repository extensions. Object names ~~~~~~~~~~~~ Objects can be named by their 40 hexadecimal digit sha1-name or 64 -hexadecimal digit newhash-name, plus names derived from those (see +hexadecimal digit sha256-name, plus names derived from those (see gitrevisions(7)). The sha1-name of an object is the SHA-1 of the concatenation of its type, length, a nul byte, and the object's sha1-content. This is the traditional <sha1> used in Git to name objects. -The newhash-name of an object is the NewHash of the concatenation of its -type, length, a nul byte, and the object's newhash-content. +The sha256-name of an object is the SHA-256 of the concatenation of its +type, length, a nul byte, and the object's sha256-content. Object format ~~~~~~~~~~~~~ The content as a byte sequence of a tag, commit, or tree object named -by sha1 and newhash differ because an object named by newhash-name refers to -other objects by their newhash-names and an object named by sha1-name +by sha1 and sha256 differ because an object named by sha256-name refers to +other objects by their sha256-names and an object named by sha1-name refers to other objects by their sha1-names. -The newhash-content of an object is the same as its sha1-content, except -that objects referenced by the object are named using their newhash-names +The sha256-content of an object is the same as its sha1-content, except +that objects referenced by the object are named using their sha256-names instead of sha1-names. Because a blob object does not refer to any -other object, its sha1-content and newhash-content are the same. +other object, its sha1-content and sha256-content are the same. -The format allows round-trip conversion between newhash-content and +The format allows round-trip conversion between sha256-content and sha1-content. Object storage ~~~~~~~~~~~~~~ Loose objects use zlib compression and packed objects use the packed format described in Documentation/technical/pack-format.txt, just like -today. The content that is compressed and stored uses newhash-content +today. The content that is compressed and stored uses sha256-content instead of sha1-content. Pack index @@ -255,10 +252,10 @@ network byte order): up to and not including the table of CRC32 values. - Zero or more NUL bytes. - The trailer consists of the following: - - A copy of the 20-byte NewHash checksum at the end of the + - A copy of the 20-byte SHA-256 checksum at the end of the corresponding packfile. - - 20-byte NewHash checksum of all of the above. + - 20-byte SHA-256 checksum of all of the above. Loose object index ~~~~~~~~~~~~~~~~~~ @@ -266,7 +263,7 @@ A new file $GIT_OBJECT_DIR/loose-object-idx contains information about all loose objects. Its format is # loose-object-idx - (newhash-name SP sha1-name LF)* + (sha256-name SP sha1-name LF)* where the object names are in hexadecimal format. The file is not sorted. @@ -292,8 +289,8 @@ To remove entries (e.g. in "git pack-refs" or "git-prune"): Translation table ~~~~~~~~~~~~~~~~~ The index files support a bidirectional mapping between sha1-names -and newhash-names. The lookup proceeds similarly to ordinary object -lookups. For example, to convert a sha1-name to a newhash-name: +and sha256-names. The lookup proceeds similarly to ordinary object +lookups. For example, to convert a sha1-name to a sha256-name: 1. Look for the object in idx files. If a match is present in the idx's sorted list of truncated sha1-names, then: @@ -301,8 +298,8 @@ lookups. For example, to convert a sha1-name to a newhash-name: name order mapping. b. Read the corresponding entry in the full sha1-name table to verify we found the right object. If it is, then - c. Read the corresponding entry in the full newhash-name table. - That is the object's newhash-name. + c. Read the corresponding entry in the full sha256-name table. + That is the object's sha256-name. 2. Check for a loose object. Read lines from loose-object-idx until we find a match. @@ -318,25 +315,25 @@ for all objects in the object store. Reading an object's sha1-content ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The sha1-content of an object can be read by converting all newhash-names -its newhash-content references to sha1-names using the translation table. +The sha1-content of an object can be read by converting all sha256-names +its sha256-content references to sha1-names using the translation table. Fetch ~~~~~ Fetching from a SHA-1 based server requires translating between SHA-1 -and NewHash based representations on the fly. +and SHA-256 based representations on the fly. SHA-1s named in the ref advertisement that are present on the client -can be translated to NewHash and looked up as local objects using the +can be translated to SHA-256 and looked up as local objects using the translation table. Negotiation proceeds as today. Any "have"s generated locally are converted to SHA-1 before being sent to the server, and SHA-1s -mentioned by the server are converted to NewHash when looking them up +mentioned by the server are converted to SHA-256 when looking them up locally. After negotiation, the server sends a packfile containing the -requested objects. We convert the packfile to NewHash format using +requested objects. We convert the packfile to SHA-256 format using the following steps: 1. index-pack: inflate each object in the packfile and compute its @@ -351,12 +348,12 @@ the following steps: (This list only contains objects reachable from the "wants". If the pack from the server contained additional extraneous objects, then they will be discarded.) -3. convert to newhash: open a new (newhash) packfile. Read the topologically +3. convert to sha256: open a new (sha256) packfile. Read the topologically sorted list just generated. For each object, inflate its - sha1-content, convert to newhash-content, and write it to the newhash - pack. Record the new sha1<->newhash mapping entry for use in the idx. + sha1-content, convert to sha256-content, and write it to the sha256 + pack. Record the new sha1<->sha256 mapping entry for use in the idx. 4. sort: reorder entries in the new pack to match the order of objects - in the pack the server generated and include blobs. Write a newhash idx + in the pack the server generated and include blobs. Write a sha256 idx file 5. clean up: remove the SHA-1 based pack file, index, and topologically sorted list obtained from the server in steps 1 @@ -388,16 +385,16 @@ send-pack. Signed Commits ~~~~~~~~~~~~~~ -We add a new field "gpgsig-newhash" to the commit object format to allow +We add a new field "gpgsig-sha256" to the commit object format to allow signing commits without relying on SHA-1. It is similar to the -existing "gpgsig" field. Its signed payload is the newhash-content of the -commit object with any "gpgsig" and "gpgsig-newhash" fields removed. +existing "gpgsig" field. Its signed payload is the sha256-content of the +commit object with any "gpgsig" and "gpgsig-sha256" fields removed. This means commits can be signed 1. using SHA-1 only, as in existing signed commit objects -2. using both SHA-1 and NewHash, by using both gpgsig-newhash and gpgsig +2. using both SHA-1 and SHA-256, by using both gpgsig-sha256 and gpgsig fields. -3. using only NewHash, by only using the gpgsig-newhash field. +3. using only SHA-256, by only using the gpgsig-sha256 field. Old versions of "git verify-commit" can verify the gpgsig signature in cases (1) and (2) without modifications and view case (3) as an @@ -405,24 +402,24 @@ ordinary unsigned commit. Signed Tags ~~~~~~~~~~~ -We add a new field "gpgsig-newhash" to the tag object format to allow +We add a new field "gpgsig-sha256" to the tag object format to allow signing tags without relying on SHA-1. Its signed payload is the -newhash-content of the tag with its gpgsig-newhash field and "-----BEGIN PGP +sha256-content of the tag with its gpgsig-sha256 field and "-----BEGIN PGP SIGNATURE-----" delimited in-body signature removed. This means tags can be signed 1. using SHA-1 only, as in existing signed tag objects -2. using both SHA-1 and NewHash, by using gpgsig-newhash and an in-body +2. using both SHA-1 and SHA-256, by using gpgsig-sha256 and an in-body signature. -3. using only NewHash, by only using the gpgsig-newhash field. +3. using only SHA-256, by only using the gpgsig-sha256 field. Mergetag embedding ~~~~~~~~~~~~~~~~~~ The mergetag field in the sha1-content of a commit contains the sha1-content of a tag that was merged by that commit. -The mergetag field in the newhash-content of the same commit contains the -newhash-content of the same tag. +The mergetag field in the sha256-content of the same commit contains the +sha256-content of the same tag. Submodules ~~~~~~~~~~ @@ -497,7 +494,7 @@ Caveats ------- Invalid objects ~~~~~~~~~~~~~~~ -The conversion from sha1-content to newhash-content retains any +The conversion from sha1-content to sha256-content retains any brokenness in the original object (e.g., tree entry modes encoded with leading 0, tree objects whose paths are not sorted correctly, and commit objects without an author or committer). This is a deliberate @@ -516,7 +513,7 @@ allow lifting this restriction. Alternates ~~~~~~~~~~ -For the same reason, a newhash repository cannot borrow objects from a +For the same reason, a sha256 repository cannot borrow objects from a sha1 repository using objects/info/alternates or $GIT_ALTERNATE_OBJECT_REPOSITORIES. @@ -524,20 +521,20 @@ git notes ~~~~~~~~~ The "git notes" tool annotates objects using their sha1-name as key. This design does not describe a way to migrate notes trees to use -newhash-names. That migration is expected to happen separately (for +sha256-names. That migration is expected to happen separately (for example using a file at the root of the notes tree to describe which hash it uses). Server-side cost ~~~~~~~~~~~~~~~~ -Until Git protocol gains NewHash support, using NewHash based storage +Until Git protocol gains SHA-256 support, using SHA-256 based storage on public-facing Git servers is strongly discouraged. Once Git -protocol gains NewHash support, NewHash based servers are likely not +protocol gains SHA-256 support, SHA-256 based servers are likely not to support SHA-1 compatibility, to avoid what may be a very expensive hash reencode during clone and to encourage peers to modernize. The design described here allows fetches by SHA-1 clients of a -personal NewHash repository because it's not much more difficult than +personal SHA-256 repository because it's not much more difficult than allowing pushes from that repository. This support needs to be guarded by a configuration option --- servers like git.kernel.org that serve a large number of clients would not be expected to bear that cost. @@ -547,7 +544,7 @@ Meaning of signatures The signed payload for signed commits and tags does not explicitly name the hash used to identify objects. If some day Git adopts a new hash function with the same length as the current SHA-1 (40 -hexadecimal digit) or NewHash (64 hexadecimal digit) objects then the +hexadecimal digit) or SHA-256 (64 hexadecimal digit) objects then the intent behind the PGP signed payload in an object signature is unclear: @@ -562,7 +559,7 @@ Does this mean Git v2.12.0 is the commit with sha1-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7? -Fortunately NewHash and SHA-1 have different lengths. If Git starts +Fortunately SHA-256 and SHA-1 have different lengths. If Git starts using another hash with the same length to name objects, then it will need to change the format of signed payloads using that hash to address this issue. @@ -574,24 +571,24 @@ supports four different modes of operation: 1. ("dark launch") Treat object names input by the user as SHA-1 and convert any object names written to output to SHA-1, but store - objects using NewHash. This allows users to test the code with no + objects using SHA-256. This allows users to test the code with no visible behavior change except for performance. This allows allows running even tests that assume the SHA-1 hash function, to sanity-check the behavior of the new mode. - 2. ("early transition") Allow both SHA-1 and NewHash object names in + 2. ("early transition") Allow both SHA-1 and SHA-256 object names in input. Any object names written to output use SHA-1. This allows users to continue to make use of SHA-1 to communicate with peers (e.g. by email) that have not migrated yet and prepares for mode 3. - 3. ("late transition") Allow both SHA-1 and NewHash object names in - input. Any object names written to output use NewHash. In this + 3. ("late transition") Allow both SHA-1 and SHA-256 object names in + input. Any object names written to output use SHA-256. In this mode, users are using a more secure object naming method by default. The disruption is minimal as long as most of their peers are in mode 2 or mode 3. 4. ("post-transition") Treat object names input by the user as - NewHash and write output using NewHash. This is safer than mode 3 + SHA-256 and write output using SHA-256. This is safer than mode 3 because there is less risk that input is incorrectly interpreted using the wrong hash function. @@ -601,7 +598,7 @@ The user can also explicitly specify which format to use for a particular revision specifier and for output, overriding the mode. For example: -git --output-format=sha1 log abac87a^{sha1}..f787cac^{newhash} +git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256} Selection of a New Hash ----------------------- @@ -611,6 +608,10 @@ collisions in 2^69 operations. In August they published details. Luckily, no practical demonstrations of a collision in full SHA-1 were published until 10 years later, in 2017. +It was decided that Git needed to transition to a new hash +function. Initially no decision was made as to what function this was, +the "NewHash" placeholder name was picked to describe it. + The hash function NewHash to replace SHA-1 should be stronger than SHA-1 was: we would like it to be trustworthy and useful in practice for at least 10 years. @@ -630,14 +631,19 @@ Some other relevant properties: 4. As a tiebreaker, the hash should be fast to compute (fortunately many contenders are faster than SHA-1). -Some hashes under consideration are SHA-256, SHA-512/256, SHA-256x16, +Some hashes under consideration were SHA-256, SHA-512/256, SHA-256x16, K12, and BLAKE2bp-256. +Eventually in July 2018 SHA-256 was chosen to be the NewHash. See the +thread starting at <20180609224913.GC38834@genre.crustytoothpaste.net> +for the discussion +(https://public-inbox.org/git/20180609224913.GC38834@genre.crustytoothpaste.net/) + Transition plan --------------- Some initial steps can be implemented independently of one another: - adding a hash function API (vtable) -- teaching fsck to tolerate the gpgsig-newhash field +- teaching fsck to tolerate the gpgsig-sha256 field - excluding gpgsig-* from the fields copied by "git commit --amend" - annotating tests that depend on SHA-1 values with a SHA1 test prerequisite @@ -664,7 +670,7 @@ Next comes introduction of compatObjectFormat: - adding appropriate index entries when adding a new object to the object store - --output-format option -- ^{sha1} and ^{newhash} revision notation +- ^{sha1} and ^{sha256} revision notation - configuration to specify default input and output format (see "Object names on the command line" above) @@ -672,7 +678,7 @@ The next step is supporting fetches and pushes to SHA-1 repositories: - allow pushes to a repository using the compat format - generate a topologically sorted list of the SHA-1 names of fetched objects -- convert the fetched packfile to newhash format and generate an idx +- convert the fetched packfile to sha256 format and generate an idx file - re-sort to match the order of objects in the fetched packfile @@ -680,30 +686,30 @@ The infrastructure supporting fetch also allows converting an existing repository. In converted repositories and new clones, end users can gain support for the new hash function without any visible change in behavior (see "dark launch" in the "Object names on the command line" -section). In particular this allows users to verify NewHash signatures +section). In particular this allows users to verify SHA-256 signatures on objects in the repository, and it should ensure the transition code is stable in production in preparation for using it more widely. Over time projects would encourage their users to adopt the "early transition" and then "late transition" modes to take advantage of the -new, more futureproof NewHash object names. +new, more futureproof SHA-256 object names. When objectFormat and compatObjectFormat are both set, commands -generating signatures would generate both SHA-1 and NewHash signatures +generating signatures would generate both SHA-1 and SHA-256 signatures by default to support both new and old users. -In projects using NewHash heavily, users could be encouraged to adopt +In projects using SHA-256 heavily, users could be encouraged to adopt the "post-transition" mode to avoid accidentally making implicit use of SHA-1 object names. Once a critical mass of users have upgraded to a version of Git that -can verify NewHash signatures and have converted their existing +can verify SHA-256 signatures and have converted their existing repositories to support verifying them, we can add support for a -setting to generate only NewHash signatures. This is expected to be at +setting to generate only SHA-256 signatures. This is expected to be at least a year later. That is also a good moment to advertise the ability to convert -repositories to use NewHash only, stripping out all SHA-1 related +repositories to use SHA-256 only, stripping out all SHA-1 related metadata. This improves performance by eliminating translation overhead and security by avoiding the possibility of accidentally relying on the safety of SHA-1. @@ -742,16 +748,16 @@ using the old hash function. Signed objects with multiple hashes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Instead of introducing the gpgsig-newhash field in commit and tag objects -for newhash-content based signatures, an earlier version of this design -added "hash newhash <newhash-name>" fields to strengthen the existing +Instead of introducing the gpgsig-sha256 field in commit and tag objects +for sha256-content based signatures, an earlier version of this design +added "hash sha256 <sha256-name>" fields to strengthen the existing sha1-content based signatures. In other words, a single signature was used to attest to the object content using both hash functions. This had some advantages: * Using one signature instead of two speeds up the signing process. * Having one signed payload with both hashes allows the signer to - attest to the sha1-name and newhash-name referring to the same object. + attest to the sha1-name and sha256-name referring to the same object. * All users consume the same signature. Broken signatures are likely to be detected quickly using current versions of git. @@ -760,11 +766,11 @@ However, it also came with disadvantages: objects it references, even after the transition is complete and translation table is no longer needed for anything else. To support this, the design added fields such as "hash sha1 tree <sha1-name>" - and "hash sha1 parent <sha1-name>" to the newhash-content of a signed + and "hash sha1 parent <sha1-name>" to the sha256-content of a signed commit, complicating the conversion process. * Allowing signed objects without a sha1 (for after the transition is complete) complicated the design further, requiring a "nohash sha1" - field to suppress including "hash sha1" fields in the newhash-content + field to suppress including "hash sha1" fields in the sha256-content and signed payload. Lazily populated translation table @@ -772,7 +778,7 @@ Lazily populated translation table Some of the work of building the translation table could be deferred to push time, but that would significantly complicate and slow down pushes. Calculating the sha1-name at object creation time at the same time it is -being streamed to disk and having its newhash-name calculated should be +being streamed to disk and having its sha256-name calculated should be an acceptable cost. Document History -- 2.17.0.290.gded63e768a ^ permalink raw reply related [flat|nested] 66+ messages in thread
* Re: [PATCH 2/2] doc hash-function-transition: pick SHA-256 as NewHash 2018-07-25 8:30 ` [PATCH 2/2] doc hash-function-transition: pick SHA-256 as NewHash Ævar Arnfjörð Bjarmason @ 2018-07-25 16:45 ` Junio C Hamano 2018-07-25 17:25 ` Jonathan Nieder 2018-07-25 22:56 ` [PATCH " brian m. carlson 0 siblings, 2 replies; 66+ messages in thread From: Junio C Hamano @ 2018-07-25 16:45 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: git, Linus Torvalds, Edward Thomson, brian m . carlson, Jonathan Nieder, Johannes Schindelin, demerphq, Adam Langley, keccak Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > @@ -125,19 +122,19 @@ Detailed Design > --------------- > Repository format extension > ~~~~~~~~~~~~~~~~~~~~~~~~~~~ > -A NewHash repository uses repository format version `1` (see > +A SHA-256 repository uses repository format version `1` (see > Documentation/technical/repository-version.txt) with extensions > `objectFormat` and `compatObjectFormat`: > > [core] > repositoryFormatVersion = 1 > [extensions] > - objectFormat = newhash > + objectFormat = sha256 > compatObjectFormat = sha1 Whenever we said SHA1, somebody came and told us that the name of the hash is SHA-1 (with dash). Would we be nitpicker-prone in the same way with "sha256" here? > @@ -155,36 +152,36 @@ repository extensions. > Object names > ~~~~~~~~~~~~ > Objects can be named by their 40 hexadecimal digit sha1-name or 64 > -hexadecimal digit newhash-name, plus names derived from those (see > +hexadecimal digit sha256-name, plus names derived from those (see > gitrevisions(7)). Seeing this hunk makes me respond to the above question with another question: "having to write sha-256-name, sha-1-name, gpgsig-sha-256, and sha-256-content is sort of ugly, no?" I guess names with two dashes are not _too_ bad, so I dunno. > Selection of a New Hash > ----------------------- > @@ -611,6 +608,10 @@ collisions in 2^69 operations. In August they published details. > Luckily, no practical demonstrations of a collision in full SHA-1 were > published until 10 years later, in 2017. > > +It was decided that Git needed to transition to a new hash > +function. Initially no decision was made as to what function this was, > +the "NewHash" placeholder name was picked to describe it. > + > The hash function NewHash to replace SHA-1 should be stronger than > SHA-1 was: we would like it to be trustworthy and useful in practice > for at least 10 years. This sentence needs a bit of updating to match the new paragraph inserted above. "should be stronger" is something said by those who are still looking for one and/or trying to decide. Perhaps something like this? ... the "NewHash" placeholder name was used to describe it. We wanted to choose a hash function to replace SHA-1 that is stronger than SHA-1 was, and would like it to be trustworthy and useful in practice for at least 10 years. Some other relevant properties we wanted in NewHash are: > @@ -630,14 +631,19 @@ Some other relevant properties: > 4. As a tiebreaker, the hash should be fast to compute (fortunately > many contenders are faster than SHA-1). > > -Some hashes under consideration are SHA-256, SHA-512/256, SHA-256x16, > +Some hashes under consideration were SHA-256, SHA-512/256, SHA-256x16, > K12, and BLAKE2bp-256. > > +Eventually in July 2018 SHA-256 was chosen to be the NewHash. See the > +thread starting at <20180609224913.GC38834@genre.crustytoothpaste.net> > +for the discussion > +(https://public-inbox.org/git/20180609224913.GC38834@genre.crustytoothpaste.net/) > + ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 2/2] doc hash-function-transition: pick SHA-256 as NewHash 2018-07-25 16:45 ` Junio C Hamano @ 2018-07-25 17:25 ` Jonathan Nieder 2018-07-25 21:32 ` Junio C Hamano 2018-07-25 22:56 ` [PATCH " brian m. carlson 1 sibling, 1 reply; 66+ messages in thread From: Jonathan Nieder @ 2018-07-25 17:25 UTC (permalink / raw) To: Junio C Hamano Cc: Ævar Arnfjörð Bjarmason, git, Linus Torvalds, Edward Thomson, brian m . carlson, Johannes Schindelin, demerphq, Adam Langley, keccak Hi, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: >> The consensus on the mailing list seems to be that SHA-256 should be >> picked as our NewHash, see the "Hash algorithm analysis" thread as of >> [1]. Linus has come around to this choice and suggested Junio make the >> final pick, and he's endorsed SHA-256 [3]. I think this commit message focuses too much on the development process, in a way that makes it not necessary useful to the target audience that would be finding it with "git blame" or "git log". It's also not self-contained, which makes it less useful in the same way. In other words, the commit message should be speaking for the project, not speaking about the project. I would be tempted to say something as simple as hash-function-transition: pick SHA-256 as NewHash The project has decided. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> and let any Acked-bys on the message speak for themselves. Alternatively, the commit message could include a summary of the discussion: From a security perspective, it seems that SHA-256, BLAKE2, SHA3-256, K12, and so on are all believed to have similar security properties. All are good options from a security point of view. SHA-256 has a number of advantages: * It has been around for a while, is widely used, and is supported by just about every single crypto library (OpenSSL, mbedTLS, CryptoNG, SecureTransport, etc). * When you compare against SHA1DC, most vectorized SHA-256 implementations are indeed faster, even without acceleration. * If we're doing signatures with OpenPGP (or even, I suppose, CMS), we're going to be using SHA-2, so it doesn't make sense to have our security depend on two separate algorithms when either one of them alone could break the security when we could just depend on one. So SHA-256 it is. [...] >> @@ -125,19 +122,19 @@ Detailed Design >> --------------- >> Repository format extension >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~ >> -A NewHash repository uses repository format version `1` (see >> +A SHA-256 repository uses repository format version `1` (see >> Documentation/technical/repository-version.txt) with extensions >> `objectFormat` and `compatObjectFormat`: >> >> [core] >> repositoryFormatVersion = 1 >> [extensions] >> - objectFormat = newhash >> + objectFormat = sha256 >> compatObjectFormat = sha1 > > Whenever we said SHA1, somebody came and told us that the name of > the hash is SHA-1 (with dash). Would we be nitpicker-prone in the > same way with "sha256" here? Regardless of how we spell it in prose, I think `sha256` as an identifier in configuration is the spelling people will expect. For example, gpg ("gpg --version") calls it SHA256. [...] >> Selection of a New Hash >> ----------------------- >> @@ -611,6 +608,10 @@ collisions in 2^69 operations. In August they published details. >> Luckily, no practical demonstrations of a collision in full SHA-1 were >> published until 10 years later, in 2017. >> >> +It was decided that Git needed to transition to a new hash >> +function. Initially no decision was made as to what function this was, >> +the "NewHash" placeholder name was picked to describe it. >> + >> The hash function NewHash to replace SHA-1 should be stronger than >> SHA-1 was: we would like it to be trustworthy and useful in practice >> for at least 10 years. > > This sentence needs a bit of updating to match the new paragraph > inserted above. "should be stronger" is something said by those > who are still looking for one and/or trying to decide. For what it's worth, I would be in favor of modifying the section more heavily. For example: Choice of Hash -------------- In early 2005, around the time that Git was written, Xiaoyun Wang, Yiqun Lisa Yin, and Hongbo Yu announced an attack finding SHA-1 collisions in 2^69 operations. In August they published details. Luckily, no practical demonstrations of a collision in full SHA-1 were published until 10 years later, in 2017. Git v2.13.0 and later subsequently moved to a hardened SHA-1 implementation by default that mitigates the SHAttered attack, but SHA-1 is still believed to be weak. The hash to replace this hardened SHA-1 should be stronger than SHA-1 was: we would like it to be trustworthy and useful in practice for at least 10 years. Some other relevant properties: 1. A 256-bit hash (long enough to match common security practice; not excessively long to hurt performance and disk usage). 2. High quality implementations should be widely available (e.g., in OpenSSL and Apple CommonCrypto). 3. The hash function's properties should match Git's needs (e.g. Git requires collision and 2nd preimage resistance and does not require length extension resistance). 4. As a tiebreaker, the hash should be fast to compute (fortunately many contenders are faster than SHA-1). We choose SHA-256. Changes: - retitled since the hash function has already been selected - added some notes about sha1dc - when discussing wide implementation availability, mentioned CommonCrypto too, as an example of a non-OpenSSL library that the libgit2 authors care about - named which function is chosen We could put the runners up in the "alternatives considered" section, but I don't think there's much to say about them here so I wouldn't. Thanks and hope that helps, Jonathan ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 2/2] doc hash-function-transition: pick SHA-256 as NewHash 2018-07-25 17:25 ` Jonathan Nieder @ 2018-07-25 21:32 ` Junio C Hamano 2018-07-26 13:41 ` [PATCH v2 " Ævar Arnfjörð Bjarmason 0 siblings, 1 reply; 66+ messages in thread From: Junio C Hamano @ 2018-07-25 21:32 UTC (permalink / raw) To: Jonathan Nieder Cc: Ævar Arnfjörð Bjarmason, git, Linus Torvalds, Edward Thomson, brian m . carlson, Johannes Schindelin, demerphq, Adam Langley, keccak Jonathan Nieder <jrnieder@gmail.com> writes: > Regardless of how we spell it in prose, I think `sha256` as an > identifier in configuration is the spelling people will expect. For > example, gpg ("gpg --version") calls it SHA256. OK. > For what it's worth, I would be in favor of modifying the section > more heavily. For example: > ... > Changes: > > - retitled since the hash function has already been selected > - added some notes about sha1dc > - when discussing wide implementation availability, mentioned > CommonCrypto too, as an example of a non-OpenSSL library that the > libgit2 authors care about > - named which function is chosen > > We could put the runners up in the "alternatives considered" section, > but I don't think there's much to say about them here so I wouldn't. All interesting ideas and good suggestions. I'll leave 2/2 in the mail archive and take only 1/2 for now. I'd expect the final version, not too soon after mulling over the suggestions raised here, but not in too distant future to prevent us from forgetting ;-) Thanks. ^ permalink raw reply [flat|nested] 66+ messages in thread
* [PATCH v2 2/2] doc hash-function-transition: pick SHA-256 as NewHash 2018-07-25 21:32 ` Junio C Hamano @ 2018-07-26 13:41 ` Ævar Arnfjörð Bjarmason 2018-08-03 7:20 ` Jonathan Nieder 0 siblings, 1 reply; 66+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-07-26 13:41 UTC (permalink / raw) To: git Cc: Junio C Hamano, Linus Torvalds, Edward Thomson, brian m . carlson, Jonathan Nieder, Johannes Schindelin, demerphq, Adam Langley, keccak, Ævar Arnfjörð Bjarmason From a security perspective, it seems that SHA-256, BLAKE2, SHA3-256, K12, and so on are all believed to have similar security properties. All are good options from a security point of view. SHA-256 has a number of advantages: * It has been around for a while, is widely used, and is supported by just about every single crypto library (OpenSSL, mbedTLS, CryptoNG, SecureTransport, etc). * When you compare against SHA1DC, most vectorized SHA-256 implementations are indeed faster, even without acceleration. * If we're doing signatures with OpenPGP (or even, I suppose, CMS), we're going to be using SHA-2, so it doesn't make sense to have our security depend on two separate algorithms when either one of them alone could break the security when we could just depend on one. So SHA-256 it is. See the "Hash algorithm analysis" thread as of [1]. Linus has come around to this choice and suggested Junio make the final pick, and he's endorsed SHA-256 [3]. This follow-up change changes occurrences of "NewHash" to "SHA-256" (or "sha256", depending on the context). The "Selection of a New Hash" section has also been changed to note that historically we used the the "NewHash" name while we didn't know what the new hash function would be. This leaves no use of "NewHash" anywhere in git.git except in the aforementioned section (and as a variable name in t/t9700/test.pl, but that use from 2008 has nothing to do with this transition plan). 1. https://public-inbox.org/git/20180720215220.GB18502@genre.crustytoothpaste.net/ 2. https://public-inbox.org/git/CA+55aFwSe9BF8e0hLk9pp3FVD5LaVY5GRdsV3fbNtgzekJadyA@mail.gmail.com/ 3. https://public-inbox.org/git/xmqqzhygwd5o.fsf@gitster-ct.c.googlers.com/ Helped-by: Jonathan Nieder <jrnieder@gmail.com> Helped-by: Junio C Hamano <gitster@pobox.com> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> --- On Wed, Jul 25 2018, Junio C Hamano wrote: > Jonathan Nieder <jrnieder@gmail.com> writes: > [...] > All interesting ideas and good suggestions. I'll leave 2/2 in the > mail archive and take only 1/2 for now. I'd expect the final > version, not too soon after mulling over the suggestions raised > here, but not in too distant future to prevent us from forgetting > ;-) Here's a v2 which uses the suggestions for both the commit message & documentation from Jonathan and yourself. Thanks! .../technical/hash-function-transition.txt | 196 +++++++++--------- 1 file changed, 99 insertions(+), 97 deletions(-) diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt index 5ee4754adb..5041e57273 100644 --- a/Documentation/technical/hash-function-transition.txt +++ b/Documentation/technical/hash-function-transition.txt @@ -59,14 +59,11 @@ that are believed to be cryptographically secure. Goals ----- -Where NewHash is a strong 256-bit hash function to replace SHA-1 (see -"Selection of a New Hash", below): - -1. The transition to NewHash can be done one local repository at a time. +1. The transition to SHA-256 can be done one local repository at a time. a. Requiring no action by any other party. - b. A NewHash repository can communicate with SHA-1 Git servers + b. A SHA-256 repository can communicate with SHA-1 Git servers (push/fetch). - c. Users can use SHA-1 and NewHash identifiers for objects + c. Users can use SHA-1 and SHA-256 identifiers for objects interchangeably (see "Object names on the command line", below). d. New signed objects make use of a stronger hash function than SHA-1 for their security guarantees. @@ -79,7 +76,7 @@ Where NewHash is a strong 256-bit hash function to replace SHA-1 (see Non-Goals --------- -1. Add NewHash support to Git protocol. This is valuable and the +1. Add SHA-256 support to Git protocol. This is valuable and the logical next step but it is out of scope for this initial design. 2. Transparently improving the security of existing SHA-1 signed objects. @@ -87,26 +84,26 @@ Non-Goals repository. 4. Taking the opportunity to fix other bugs in Git's formats and protocols. -5. Shallow clones and fetches into a NewHash repository. (This will - change when we add NewHash support to Git protocol.) -6. Skip fetching some submodules of a project into a NewHash - repository. (This also depends on NewHash support in Git +5. Shallow clones and fetches into a SHA-256 repository. (This will + change when we add SHA-256 support to Git protocol.) +6. Skip fetching some submodules of a project into a SHA-256 + repository. (This also depends on SHA-256 support in Git protocol.) Overview -------- We introduce a new repository format extension. Repositories with this -extension enabled use NewHash instead of SHA-1 to name their objects. +extension enabled use SHA-256 instead of SHA-1 to name their objects. This affects both object names and object content --- both the names of objects and all references to other objects within an object are switched to the new hash function. -NewHash repositories cannot be read by older versions of Git. +SHA-256 repositories cannot be read by older versions of Git. -Alongside the packfile, a NewHash repository stores a bidirectional -mapping between NewHash and SHA-1 object names. The mapping is generated +Alongside the packfile, a SHA-256 repository stores a bidirectional +mapping between SHA-256 and SHA-1 object names. The mapping is generated locally and can be verified using "git fsck". Object lookups use this -mapping to allow naming objects using either their SHA-1 and NewHash names +mapping to allow naming objects using either their SHA-1 and SHA-256 names interchangeably. "git cat-file" and "git hash-object" gain options to display an object @@ -116,7 +113,7 @@ object database so that they can be named using the appropriate name (using the bidirectional hash mapping). Fetches from a SHA-1 based server convert the fetched objects into -NewHash form and record the mapping in the bidirectional mapping table +SHA-256 form and record the mapping in the bidirectional mapping table (see below for details). Pushes to a SHA-1 based server convert the objects being pushed into sha1 form so the server does not have to be aware of the hash function the client is using. @@ -125,19 +122,19 @@ Detailed Design --------------- Repository format extension ~~~~~~~~~~~~~~~~~~~~~~~~~~~ -A NewHash repository uses repository format version `1` (see +A SHA-256 repository uses repository format version `1` (see Documentation/technical/repository-version.txt) with extensions `objectFormat` and `compatObjectFormat`: [core] repositoryFormatVersion = 1 [extensions] - objectFormat = newhash + objectFormat = sha256 compatObjectFormat = sha1 The combination of setting `core.repositoryFormatVersion=1` and populating `extensions.*` ensures that all versions of Git later than -`v0.99.9l` will die instead of trying to operate on the NewHash +`v0.99.9l` will die instead of trying to operate on the SHA-256 repository, instead producing an error message. # Between v0.99.9l and v2.7.0 @@ -155,36 +152,36 @@ repository extensions. Object names ~~~~~~~~~~~~ Objects can be named by their 40 hexadecimal digit sha1-name or 64 -hexadecimal digit newhash-name, plus names derived from those (see +hexadecimal digit sha256-name, plus names derived from those (see gitrevisions(7)). The sha1-name of an object is the SHA-1 of the concatenation of its type, length, a nul byte, and the object's sha1-content. This is the traditional <sha1> used in Git to name objects. -The newhash-name of an object is the NewHash of the concatenation of its -type, length, a nul byte, and the object's newhash-content. +The sha256-name of an object is the SHA-256 of the concatenation of its +type, length, a nul byte, and the object's sha256-content. Object format ~~~~~~~~~~~~~ The content as a byte sequence of a tag, commit, or tree object named -by sha1 and newhash differ because an object named by newhash-name refers to -other objects by their newhash-names and an object named by sha1-name +by sha1 and sha256 differ because an object named by sha256-name refers to +other objects by their sha256-names and an object named by sha1-name refers to other objects by their sha1-names. -The newhash-content of an object is the same as its sha1-content, except -that objects referenced by the object are named using their newhash-names +The sha256-content of an object is the same as its sha1-content, except +that objects referenced by the object are named using their sha256-names instead of sha1-names. Because a blob object does not refer to any -other object, its sha1-content and newhash-content are the same. +other object, its sha1-content and sha256-content are the same. -The format allows round-trip conversion between newhash-content and +The format allows round-trip conversion between sha256-content and sha1-content. Object storage ~~~~~~~~~~~~~~ Loose objects use zlib compression and packed objects use the packed format described in Documentation/technical/pack-format.txt, just like -today. The content that is compressed and stored uses newhash-content +today. The content that is compressed and stored uses sha256-content instead of sha1-content. Pack index @@ -255,10 +252,10 @@ network byte order): up to and not including the table of CRC32 values. - Zero or more NUL bytes. - The trailer consists of the following: - - A copy of the 20-byte NewHash checksum at the end of the + - A copy of the 20-byte SHA-256 checksum at the end of the corresponding packfile. - - 20-byte NewHash checksum of all of the above. + - 20-byte SHA-256 checksum of all of the above. Loose object index ~~~~~~~~~~~~~~~~~~ @@ -266,7 +263,7 @@ A new file $GIT_OBJECT_DIR/loose-object-idx contains information about all loose objects. Its format is # loose-object-idx - (newhash-name SP sha1-name LF)* + (sha256-name SP sha1-name LF)* where the object names are in hexadecimal format. The file is not sorted. @@ -292,8 +289,8 @@ To remove entries (e.g. in "git pack-refs" or "git-prune"): Translation table ~~~~~~~~~~~~~~~~~ The index files support a bidirectional mapping between sha1-names -and newhash-names. The lookup proceeds similarly to ordinary object -lookups. For example, to convert a sha1-name to a newhash-name: +and sha256-names. The lookup proceeds similarly to ordinary object +lookups. For example, to convert a sha1-name to a sha256-name: 1. Look for the object in idx files. If a match is present in the idx's sorted list of truncated sha1-names, then: @@ -301,8 +298,8 @@ lookups. For example, to convert a sha1-name to a newhash-name: name order mapping. b. Read the corresponding entry in the full sha1-name table to verify we found the right object. If it is, then - c. Read the corresponding entry in the full newhash-name table. - That is the object's newhash-name. + c. Read the corresponding entry in the full sha256-name table. + That is the object's sha256-name. 2. Check for a loose object. Read lines from loose-object-idx until we find a match. @@ -318,25 +315,25 @@ for all objects in the object store. Reading an object's sha1-content ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The sha1-content of an object can be read by converting all newhash-names -its newhash-content references to sha1-names using the translation table. +The sha1-content of an object can be read by converting all sha256-names +its sha256-content references to sha1-names using the translation table. Fetch ~~~~~ Fetching from a SHA-1 based server requires translating between SHA-1 -and NewHash based representations on the fly. +and SHA-256 based representations on the fly. SHA-1s named in the ref advertisement that are present on the client -can be translated to NewHash and looked up as local objects using the +can be translated to SHA-256 and looked up as local objects using the translation table. Negotiation proceeds as today. Any "have"s generated locally are converted to SHA-1 before being sent to the server, and SHA-1s -mentioned by the server are converted to NewHash when looking them up +mentioned by the server are converted to SHA-256 when looking them up locally. After negotiation, the server sends a packfile containing the -requested objects. We convert the packfile to NewHash format using +requested objects. We convert the packfile to SHA-256 format using the following steps: 1. index-pack: inflate each object in the packfile and compute its @@ -351,12 +348,12 @@ the following steps: (This list only contains objects reachable from the "wants". If the pack from the server contained additional extraneous objects, then they will be discarded.) -3. convert to newhash: open a new (newhash) packfile. Read the topologically +3. convert to sha256: open a new (sha256) packfile. Read the topologically sorted list just generated. For each object, inflate its - sha1-content, convert to newhash-content, and write it to the newhash - pack. Record the new sha1<->newhash mapping entry for use in the idx. + sha1-content, convert to sha256-content, and write it to the sha256 + pack. Record the new sha1<->sha256 mapping entry for use in the idx. 4. sort: reorder entries in the new pack to match the order of objects - in the pack the server generated and include blobs. Write a newhash idx + in the pack the server generated and include blobs. Write a sha256 idx file 5. clean up: remove the SHA-1 based pack file, index, and topologically sorted list obtained from the server in steps 1 @@ -388,16 +385,16 @@ send-pack. Signed Commits ~~~~~~~~~~~~~~ -We add a new field "gpgsig-newhash" to the commit object format to allow +We add a new field "gpgsig-sha256" to the commit object format to allow signing commits without relying on SHA-1. It is similar to the -existing "gpgsig" field. Its signed payload is the newhash-content of the -commit object with any "gpgsig" and "gpgsig-newhash" fields removed. +existing "gpgsig" field. Its signed payload is the sha256-content of the +commit object with any "gpgsig" and "gpgsig-sha256" fields removed. This means commits can be signed 1. using SHA-1 only, as in existing signed commit objects -2. using both SHA-1 and NewHash, by using both gpgsig-newhash and gpgsig +2. using both SHA-1 and SHA-256, by using both gpgsig-sha256 and gpgsig fields. -3. using only NewHash, by only using the gpgsig-newhash field. +3. using only SHA-256, by only using the gpgsig-sha256 field. Old versions of "git verify-commit" can verify the gpgsig signature in cases (1) and (2) without modifications and view case (3) as an @@ -405,24 +402,24 @@ ordinary unsigned commit. Signed Tags ~~~~~~~~~~~ -We add a new field "gpgsig-newhash" to the tag object format to allow +We add a new field "gpgsig-sha256" to the tag object format to allow signing tags without relying on SHA-1. Its signed payload is the -newhash-content of the tag with its gpgsig-newhash field and "-----BEGIN PGP +sha256-content of the tag with its gpgsig-sha256 field and "-----BEGIN PGP SIGNATURE-----" delimited in-body signature removed. This means tags can be signed 1. using SHA-1 only, as in existing signed tag objects -2. using both SHA-1 and NewHash, by using gpgsig-newhash and an in-body +2. using both SHA-1 and SHA-256, by using gpgsig-sha256 and an in-body signature. -3. using only NewHash, by only using the gpgsig-newhash field. +3. using only SHA-256, by only using the gpgsig-sha256 field. Mergetag embedding ~~~~~~~~~~~~~~~~~~ The mergetag field in the sha1-content of a commit contains the sha1-content of a tag that was merged by that commit. -The mergetag field in the newhash-content of the same commit contains the -newhash-content of the same tag. +The mergetag field in the sha256-content of the same commit contains the +sha256-content of the same tag. Submodules ~~~~~~~~~~ @@ -497,7 +494,7 @@ Caveats ------- Invalid objects ~~~~~~~~~~~~~~~ -The conversion from sha1-content to newhash-content retains any +The conversion from sha1-content to sha256-content retains any brokenness in the original object (e.g., tree entry modes encoded with leading 0, tree objects whose paths are not sorted correctly, and commit objects without an author or committer). This is a deliberate @@ -516,7 +513,7 @@ allow lifting this restriction. Alternates ~~~~~~~~~~ -For the same reason, a newhash repository cannot borrow objects from a +For the same reason, a sha256 repository cannot borrow objects from a sha1 repository using objects/info/alternates or $GIT_ALTERNATE_OBJECT_REPOSITORIES. @@ -524,20 +521,20 @@ git notes ~~~~~~~~~ The "git notes" tool annotates objects using their sha1-name as key. This design does not describe a way to migrate notes trees to use -newhash-names. That migration is expected to happen separately (for +sha256-names. That migration is expected to happen separately (for example using a file at the root of the notes tree to describe which hash it uses). Server-side cost ~~~~~~~~~~~~~~~~ -Until Git protocol gains NewHash support, using NewHash based storage +Until Git protocol gains SHA-256 support, using SHA-256 based storage on public-facing Git servers is strongly discouraged. Once Git -protocol gains NewHash support, NewHash based servers are likely not +protocol gains SHA-256 support, SHA-256 based servers are likely not to support SHA-1 compatibility, to avoid what may be a very expensive hash reencode during clone and to encourage peers to modernize. The design described here allows fetches by SHA-1 clients of a -personal NewHash repository because it's not much more difficult than +personal SHA-256 repository because it's not much more difficult than allowing pushes from that repository. This support needs to be guarded by a configuration option --- servers like git.kernel.org that serve a large number of clients would not be expected to bear that cost. @@ -547,7 +544,7 @@ Meaning of signatures The signed payload for signed commits and tags does not explicitly name the hash used to identify objects. If some day Git adopts a new hash function with the same length as the current SHA-1 (40 -hexadecimal digit) or NewHash (64 hexadecimal digit) objects then the +hexadecimal digit) or SHA-256 (64 hexadecimal digit) objects then the intent behind the PGP signed payload in an object signature is unclear: @@ -562,7 +559,7 @@ Does this mean Git v2.12.0 is the commit with sha1-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7? -Fortunately NewHash and SHA-1 have different lengths. If Git starts +Fortunately SHA-256 and SHA-1 have different lengths. If Git starts using another hash with the same length to name objects, then it will need to change the format of signed payloads using that hash to address this issue. @@ -574,24 +571,24 @@ supports four different modes of operation: 1. ("dark launch") Treat object names input by the user as SHA-1 and convert any object names written to output to SHA-1, but store - objects using NewHash. This allows users to test the code with no + objects using SHA-256. This allows users to test the code with no visible behavior change except for performance. This allows allows running even tests that assume the SHA-1 hash function, to sanity-check the behavior of the new mode. - 2. ("early transition") Allow both SHA-1 and NewHash object names in + 2. ("early transition") Allow both SHA-1 and SHA-256 object names in input. Any object names written to output use SHA-1. This allows users to continue to make use of SHA-1 to communicate with peers (e.g. by email) that have not migrated yet and prepares for mode 3. - 3. ("late transition") Allow both SHA-1 and NewHash object names in - input. Any object names written to output use NewHash. In this + 3. ("late transition") Allow both SHA-1 and SHA-256 object names in + input. Any object names written to output use SHA-256. In this mode, users are using a more secure object naming method by default. The disruption is minimal as long as most of their peers are in mode 2 or mode 3. 4. ("post-transition") Treat object names input by the user as - NewHash and write output using NewHash. This is safer than mode 3 + SHA-256 and write output using SHA-256. This is safer than mode 3 because there is less risk that input is incorrectly interpreted using the wrong hash function. @@ -601,18 +598,22 @@ The user can also explicitly specify which format to use for a particular revision specifier and for output, overriding the mode. For example: -git --output-format=sha1 log abac87a^{sha1}..f787cac^{newhash} +git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256} -Selection of a New Hash ------------------------ +Choice of Hash +-------------- In early 2005, around the time that Git was written, Xiaoyun Wang, Yiqun Lisa Yin, and Hongbo Yu announced an attack finding SHA-1 collisions in 2^69 operations. In August they published details. Luckily, no practical demonstrations of a collision in full SHA-1 were published until 10 years later, in 2017. -The hash function NewHash to replace SHA-1 should be stronger than -SHA-1 was: we would like it to be trustworthy and useful in practice +Git v2.13.0 and later subsequently moved to a hardened SHA-1 +implementation by default that mitigates the SHAttered attack, but +SHA-1 is still believed to be weak. + +The hash to replace this hardened SHA-1 should be stronger than SHA-1 +was: we would like it to be trustworthy and useful in practice for at least 10 years. Some other relevant properties: @@ -620,8 +621,8 @@ Some other relevant properties: 1. A 256-bit hash (long enough to match common security practice; not excessively long to hurt performance and disk usage). -2. High quality implementations should be widely available (e.g. in - OpenSSL). +2. High quality implementations should be widely available (e.g., in + OpenSSL and Apple CommonCrypto). 3. The hash function's properties should match Git's needs (e.g. Git requires collision and 2nd preimage resistance and does not require @@ -630,14 +631,15 @@ Some other relevant properties: 4. As a tiebreaker, the hash should be fast to compute (fortunately many contenders are faster than SHA-1). -Some hashes under consideration are SHA-256, SHA-512/256, SHA-256x16, -K12, and BLAKE2bp-256. +We choose SHA-256. See the thread starting at +<20180609224913.GC38834@genre.crustytoothpaste.net> for the discussion +(https://public-inbox.org/git/20180609224913.GC38834@genre.crustytoothpaste.net/) Transition plan --------------- Some initial steps can be implemented independently of one another: - adding a hash function API (vtable) -- teaching fsck to tolerate the gpgsig-newhash field +- teaching fsck to tolerate the gpgsig-sha256 field - excluding gpgsig-* from the fields copied by "git commit --amend" - annotating tests that depend on SHA-1 values with a SHA1 test prerequisite @@ -664,7 +666,7 @@ Next comes introduction of compatObjectFormat: - adding appropriate index entries when adding a new object to the object store - --output-format option -- ^{sha1} and ^{newhash} revision notation +- ^{sha1} and ^{sha256} revision notation - configuration to specify default input and output format (see "Object names on the command line" above) @@ -672,7 +674,7 @@ The next step is supporting fetches and pushes to SHA-1 repositories: - allow pushes to a repository using the compat format - generate a topologically sorted list of the SHA-1 names of fetched objects -- convert the fetched packfile to newhash format and generate an idx +- convert the fetched packfile to sha256 format and generate an idx file - re-sort to match the order of objects in the fetched packfile @@ -680,30 +682,30 @@ The infrastructure supporting fetch also allows converting an existing repository. In converted repositories and new clones, end users can gain support for the new hash function without any visible change in behavior (see "dark launch" in the "Object names on the command line" -section). In particular this allows users to verify NewHash signatures +section). In particular this allows users to verify SHA-256 signatures on objects in the repository, and it should ensure the transition code is stable in production in preparation for using it more widely. Over time projects would encourage their users to adopt the "early transition" and then "late transition" modes to take advantage of the -new, more futureproof NewHash object names. +new, more futureproof SHA-256 object names. When objectFormat and compatObjectFormat are both set, commands -generating signatures would generate both SHA-1 and NewHash signatures +generating signatures would generate both SHA-1 and SHA-256 signatures by default to support both new and old users. -In projects using NewHash heavily, users could be encouraged to adopt +In projects using SHA-256 heavily, users could be encouraged to adopt the "post-transition" mode to avoid accidentally making implicit use of SHA-1 object names. Once a critical mass of users have upgraded to a version of Git that -can verify NewHash signatures and have converted their existing +can verify SHA-256 signatures and have converted their existing repositories to support verifying them, we can add support for a -setting to generate only NewHash signatures. This is expected to be at +setting to generate only SHA-256 signatures. This is expected to be at least a year later. That is also a good moment to advertise the ability to convert -repositories to use NewHash only, stripping out all SHA-1 related +repositories to use SHA-256 only, stripping out all SHA-1 related metadata. This improves performance by eliminating translation overhead and security by avoiding the possibility of accidentally relying on the safety of SHA-1. @@ -742,16 +744,16 @@ using the old hash function. Signed objects with multiple hashes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Instead of introducing the gpgsig-newhash field in commit and tag objects -for newhash-content based signatures, an earlier version of this design -added "hash newhash <newhash-name>" fields to strengthen the existing +Instead of introducing the gpgsig-sha256 field in commit and tag objects +for sha256-content based signatures, an earlier version of this design +added "hash sha256 <sha256-name>" fields to strengthen the existing sha1-content based signatures. In other words, a single signature was used to attest to the object content using both hash functions. This had some advantages: * Using one signature instead of two speeds up the signing process. * Having one signed payload with both hashes allows the signer to - attest to the sha1-name and newhash-name referring to the same object. + attest to the sha1-name and sha256-name referring to the same object. * All users consume the same signature. Broken signatures are likely to be detected quickly using current versions of git. @@ -760,11 +762,11 @@ However, it also came with disadvantages: objects it references, even after the transition is complete and translation table is no longer needed for anything else. To support this, the design added fields such as "hash sha1 tree <sha1-name>" - and "hash sha1 parent <sha1-name>" to the newhash-content of a signed + and "hash sha1 parent <sha1-name>" to the sha256-content of a signed commit, complicating the conversion process. * Allowing signed objects without a sha1 (for after the transition is complete) complicated the design further, requiring a "nohash sha1" - field to suppress including "hash sha1" fields in the newhash-content + field to suppress including "hash sha1" fields in the sha256-content and signed payload. Lazily populated translation table @@ -772,7 +774,7 @@ Lazily populated translation table Some of the work of building the translation table could be deferred to push time, but that would significantly complicate and slow down pushes. Calculating the sha1-name at object creation time at the same time it is -being streamed to disk and having its newhash-name calculated should be +being streamed to disk and having its sha256-name calculated should be an acceptable cost. Document History -- 2.18.0.345.g5c9ce644c3 ^ permalink raw reply related [flat|nested] 66+ messages in thread
* Re: [PATCH v2 2/2] doc hash-function-transition: pick SHA-256 as NewHash 2018-07-26 13:41 ` [PATCH v2 " Ævar Arnfjörð Bjarmason @ 2018-08-03 7:20 ` Jonathan Nieder 2018-08-03 16:40 ` Junio C Hamano ` (3 more replies) 0 siblings, 4 replies; 66+ messages in thread From: Jonathan Nieder @ 2018-08-03 7:20 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: git, Junio C Hamano, Linus Torvalds, Edward Thomson, brian m . carlson, Johannes Schindelin, demerphq, Adam Langley, keccak Hi again, Sorry for the slow review. I finally got a chance to look this over again. My main nits are about the commit message: I think it still focuses too much on the process instead of the usual "knowing what I know now, here's the clearest explanation for why we need this patch" approach. I can send a patch illustrating what I mean tomorrow morning. Ævar Arnfjörð Bjarmason wrote: > From a security perspective, it seems that SHA-256, BLAKE2, SHA3-256, > K12, and so on are all believed to have similar security properties. > All are good options from a security point of view. > > SHA-256 has a number of advantages: > > * It has been around for a while, is widely used, and is supported by > just about every single crypto library (OpenSSL, mbedTLS, CryptoNG, > SecureTransport, etc). > > * When you compare against SHA1DC, most vectorized SHA-256 > implementations are indeed faster, even without acceleration. > > * If we're doing signatures with OpenPGP (or even, I suppose, CMS), > we're going to be using SHA-2, so it doesn't make sense to have our > security depend on two separate algorithms when either one of them > alone could break the security when we could just depend on one. > > So SHA-256 it is. The above is what I wrote, so of course I'd like it. ;-) > See the "Hash algorithm analysis" thread as of > [1]. Linus has come around to this choice and suggested Junio make the > final pick, and he's endorsed SHA-256 [3]. The above paragraph has the same problem as before of (1) not being self-contained and (2) focusing on the process that led to this patch instead of the benefit of the patch itself. I think we should omit it. > This follow-up change changes occurrences of "NewHash" to > "SHA-256" (or "sha256", depending on the context). The "Selection of a > New Hash" section has also been changed to note that historically we > used the the "NewHash" name while we didn't know what the new hash > function would be. nit: Commit messages are usually in the imperative tense. This is in the past tense, I think because it is a continuation of that discussion about process. For this part, I think we can let the patch speak for itself. > This leaves no use of "NewHash" anywhere in git.git except in the > aforementioned section (and as a variable name in t/t9700/test.pl, but > that use from 2008 has nothing to do with this transition plan). This part is helpful --- good. > 1. https://public-inbox.org/git/20180720215220.GB18502@genre.crustytoothpaste.net/ > 2. https://public-inbox.org/git/CA+55aFwSe9BF8e0hLk9pp3FVD5LaVY5GRdsV3fbNtgzekJadyA@mail.gmail.com/ > 3. https://public-inbox.org/git/xmqqzhygwd5o.fsf@gitster-ct.c.googlers.com/ Footnotes to the historical part --- I'd recommend removing these. > Helped-by: Jonathan Nieder <jrnieder@gmail.com> > Helped-by: Junio C Hamano <gitster@pobox.com> > Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Here I'd want to put a pile of acks, e.g.: Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: brian m. carlson <sandals@crustytoothpaste.net> Acked-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Acked-by: Dan Shumow <danshu@microsoft.com> Acked-by: Junio C Hamano <gitster@pobox.com> [...] > --- a/Documentation/technical/hash-function-transition.txt > +++ b/Documentation/technical/hash-function-transition.txt > @@ -59,14 +59,11 @@ that are believed to be cryptographically secure. > > Goals > ----- > -Where NewHash is a strong 256-bit hash function to replace SHA-1 (see > -"Selection of a New Hash", below): > - > -1. The transition to NewHash can be done one local repository at a time. > +1. The transition to SHA-256 can be done one local repository at a time. Yay! [...] > [extensions] > - objectFormat = newhash > + objectFormat = sha256 > compatObjectFormat = sha1 Yes, makes sense. [...] > @@ -155,36 +152,36 @@ repository extensions. > Object names > ~~~~~~~~~~~~ > Objects can be named by their 40 hexadecimal digit sha1-name or 64 > -hexadecimal digit newhash-name, plus names derived from those (see > +hexadecimal digit sha256-name, plus names derived from those (see > gitrevisions(7)). > > The sha1-name of an object is the SHA-1 of the concatenation of its > type, length, a nul byte, and the object's sha1-content. This is the > traditional <sha1> used in Git to name objects. > > -The newhash-name of an object is the NewHash of the concatenation of its > -type, length, a nul byte, and the object's newhash-content. > +The sha256-name of an object is the SHA-256 of the concatenation of its > +type, length, a nul byte, and the object's sha256-content. Sensible. [...] > > Object format > ~~~~~~~~~~~~~ > The content as a byte sequence of a tag, commit, or tree object named > -by sha1 and newhash differ because an object named by newhash-name refers to > +by sha1 and sha256 differ because an object named by sha256-name refers to Not about this patch: this should say SHA-1 and SHA-256, I think. Leaving it as is in this patch as you did is the right thing. [...] > @@ -255,10 +252,10 @@ network byte order): > up to and not including the table of CRC32 values. > - Zero or more NUL bytes. > - The trailer consists of the following: > - - A copy of the 20-byte NewHash checksum at the end of the > + - A copy of the 20-byte SHA-256 checksum at the end of the Not about this patch: a SHA-256 is 32 bytes. Leaving that for a separate patch as you did is the right thing, though. [...] > - - 20-byte NewHash checksum of all of the above. > + - 20-byte SHA-256 checksum of all of the above. Likewise. [...] > @@ -351,12 +348,12 @@ the following steps: > (This list only contains objects reachable from the "wants". If the > pack from the server contained additional extraneous objects, then > they will be discarded.) > -3. convert to newhash: open a new (newhash) packfile. Read the topologically > +3. convert to sha256: open a new (sha256) packfile. Read the topologically Not about this patch: this one's more murky, since it's talking about the object names instead of the hash function. I guess "sha256" instead of "SHA-256" for this could be right, but I worry it's going to take time for me to figure out the exact distinction. That seems like a reason to just call it SHA-256 (but in a separate patch). [...] > - sha1-content, convert to newhash-content, and write it to the newhash > - pack. Record the new sha1<->newhash mapping entry for use in the idx. > + sha1-content, convert to sha256-content, and write it to the sha256 > + pack. Record the new sha1<->sha256 mapping entry for use in the idx. > 4. sort: reorder entries in the new pack to match the order of objects > - in the pack the server generated and include blobs. Write a newhash idx > + in the pack the server generated and include blobs. Write a sha256 idx > file Likewise. [...] > @@ -388,16 +385,16 @@ send-pack. > > Signed Commits > ~~~~~~~~~~~~~~ > -We add a new field "gpgsig-newhash" to the commit object format to allow > +We add a new field "gpgsig-sha256" to the commit object format to allow > signing commits without relying on SHA-1. It is similar to the > -existing "gpgsig" field. Its signed payload is the newhash-content of the > -commit object with any "gpgsig" and "gpgsig-newhash" fields removed. > +existing "gpgsig" field. Its signed payload is the sha256-content of the > +commit object with any "gpgsig" and "gpgsig-sha256" fields removed. That reminds me --- we need to add support for stripping these out. [...] > @@ -601,18 +598,22 @@ The user can also explicitly specify which format to use for a > particular revision specifier and for output, overriding the mode. For > example: > > -git --output-format=sha1 log abac87a^{sha1}..f787cac^{newhash} > +git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256} > > -Selection of a New Hash > ------------------------ > +Choice of Hash > +-------------- Yay! [...] > -Some hashes under consideration are SHA-256, SHA-512/256, SHA-256x16, > -K12, and BLAKE2bp-256. > +We choose SHA-256. See the thread starting at > +<20180609224913.GC38834@genre.crustytoothpaste.net> for the discussion > +(https://public-inbox.org/git/20180609224913.GC38834@genre.crustytoothpaste.net/) Can this reference be moved to a footnote? It's not part of the design, but it's a good reference. Thanks again for getting this documented. Sincerely, Jonathan ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH v2 2/2] doc hash-function-transition: pick SHA-256 as NewHash 2018-08-03 7:20 ` Jonathan Nieder @ 2018-08-03 16:40 ` Junio C Hamano 2018-08-03 17:01 ` Linus Torvalds 2018-08-03 16:42 ` Linus Torvalds ` (2 subsequent siblings) 3 siblings, 1 reply; 66+ messages in thread From: Junio C Hamano @ 2018-08-03 16:40 UTC (permalink / raw) To: Jonathan Nieder Cc: Ævar Arnfjörð Bjarmason, git, Linus Torvalds, Edward Thomson, brian m . carlson, Johannes Schindelin, demerphq, Adam Langley, keccak Jonathan Nieder <jrnieder@gmail.com> writes: > Sorry for the slow review. I finally got a chance to look this over > again. > > My main nits are about the commit message: I think it still focuses > too much on the process instead of the usual "knowing what I know now, > here's the clearest explanation for why we need this patch" approach. > I can send a patch illustrating what I mean tomorrow morning. I'll turn 'Will merge to next' to 'Hold' for now. >> Object format >> ~~~~~~~~~~~~~ >> The content as a byte sequence of a tag, commit, or tree object named >> -by sha1 and newhash differ because an object named by newhash-name refers to >> +by sha1 and sha256 differ because an object named by sha256-name refers to > > Not about this patch: this should say SHA-1 and SHA-256, I think. > Leaving it as is in this patch as you did is the right thing. Perhaps deserves a "NEEDSWORK" comment, though. > [...] >> @@ -255,10 +252,10 @@ network byte order): >> up to and not including the table of CRC32 values. >> - Zero or more NUL bytes. >> - The trailer consists of the following: >> - - A copy of the 20-byte NewHash checksum at the end of the >> + - A copy of the 20-byte SHA-256 checksum at the end of the > > Not about this patch: a SHA-256 is 32 bytes. Leaving that for a > separate patch as you did is the right thing, though. > > [...] >> - - 20-byte NewHash checksum of all of the above. >> + - 20-byte SHA-256 checksum of all of the above. > > Likewise. Hmph, I've always assumed since NewHash plan was written that this part was not about tamper resistance but was about bit-flipping detection and it was deliberate to stick to 20-byte sum, truncating as necessary. It definitely is a good idea to leave it for a separate patch to update this part. Thanks. ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH v2 2/2] doc hash-function-transition: pick SHA-256 as NewHash 2018-08-03 16:40 ` Junio C Hamano @ 2018-08-03 17:01 ` Linus Torvalds 0 siblings, 0 replies; 66+ messages in thread From: Linus Torvalds @ 2018-08-03 17:01 UTC (permalink / raw) To: Junio C Hamano Cc: Jonathan Nieder, Ævar Arnfjörð Bjarmason, Git Mailing List, Edward Thomson, brian m. carlson, Johannes Schindelin, demerphq, Adam Langley, keccak On Fri, Aug 3, 2018 at 9:40 AM Junio C Hamano <gitster@pobox.com> wrote: > > > [...] > >> - - 20-byte NewHash checksum of all of the above. > >> + - 20-byte SHA-256 checksum of all of the above. > > > > Likewise. > > Hmph, I've always assumed since NewHash plan was written that this > part was not about tamper resistance but was about bit-flipping > detection and it was deliberate to stick to 20-byte sum, truncating > as necessary. Yeah, in fact, since this was one area where people actually had performance issues with the hash, it might be worth considering _weakening_ the hash. Things like the pack index files (and just the regular file index) are entirely local, and the SHA1 in those is really just a fancy CRC. It's really just "good protection against disk corruption" (it happens), and in fact you cannot use it as protection against active tampering, since there's no secret there and any active attacker that has access to your local filesystem could just recompute the hash anyway. And because they are local anyway and aren't really transported (modulo "shared repositories" where you use them across users or legacy rsync-like synchronization), they can be handled separately from any hashing changes. The pack and index file formats have in fact been changed before. It might make sense to either keep it as SHA1 (just to minimize any changes) or if there are still issues with index file performance it could even be made to use something fast-but-not-cryptographic like just making it use crc32(). Remember: one of the original core git design requirements was "corruption detection". For normal hashed objects, that came naturally, and the normal object store additionally has active tamper protection thanks to the interconnected nature of the hashes and the distribution of the objects. But for things like packfiles and the file index, it is just a separate checksum. There is no tamper protection anyway, since anybody who can modify them directly can just recompute the hash checksum. The fact that both of these things used SHA1 was more of a convenience issue than anything else, and the pack/index file checksum is fundamentally not cryptographic (but a cryptographic hash is obviously by definition also a very good corruption detector). Linus ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH v2 2/2] doc hash-function-transition: pick SHA-256 as NewHash 2018-08-03 7:20 ` Jonathan Nieder 2018-08-03 16:40 ` Junio C Hamano @ 2018-08-03 16:42 ` Linus Torvalds 2018-08-03 17:43 ` Ævar Arnfjörð Bjarmason 2018-08-03 17:45 ` brian m. carlson 3 siblings, 0 replies; 66+ messages in thread From: Linus Torvalds @ 2018-08-03 16:42 UTC (permalink / raw) To: Jonathan Nieder Cc: Ævar Arnfjörð Bjarmason, Git Mailing List, Junio C Hamano, Edward Thomson, brian m. carlson, Johannes Schindelin, demerphq, Adam Langley, keccak On Fri, Aug 3, 2018 at 12:20 AM Jonathan Nieder <jrnieder@gmail.com> wrote: > > > Here I'd want to put a pile of acks, e.g.: > > Acked-by: Linus Torvalds <torvalds@linux-foundation.org> > .. Sure, feel free to add my Ack for this. Linus ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH v2 2/2] doc hash-function-transition: pick SHA-256 as NewHash 2018-08-03 7:20 ` Jonathan Nieder 2018-08-03 16:40 ` Junio C Hamano 2018-08-03 16:42 ` Linus Torvalds @ 2018-08-03 17:43 ` Ævar Arnfjörð Bjarmason 2018-08-04 8:52 ` Jonathan Nieder 2018-08-03 17:45 ` brian m. carlson 3 siblings, 1 reply; 66+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-08-03 17:43 UTC (permalink / raw) To: Jonathan Nieder Cc: git, Junio C Hamano, Linus Torvalds, Edward Thomson, brian m . carlson, Johannes Schindelin, demerphq, Adam Langley, keccak On Fri, Aug 03 2018, Jonathan Nieder wrote: > Hi again, > > Sorry for the slow review. I finally got a chance to look this over > again. > > My main nits are about the commit message: I think it still focuses > too much on the process instead of the usual "knowing what I know now, > here's the clearest explanation for why we need this patch" approach. > I can send a patch illustrating what I mean tomorrow morning. I think it makes if you just take over 2/2 of this series (or even the whole thing), since the meat of it is already something I copy/pasted from you, and you've got more of an opinion / idea about how to proceed (which is good!); it's more efficient than me trying to fix various stuff you're pointing out at this point, I also think it makes sense that you change the "Author" line for that, since the rest of the changes will mainly be search-replace by me. Perhaps it's better for readability if those search-replace changes go into their own change, i.e. make it a three-part where 2/3 does the search-replace, and promises that 3/3 has the full rationale etc. > Ævar Arnfjörð Bjarmason wrote: > >> From a security perspective, it seems that SHA-256, BLAKE2, SHA3-256, >> K12, and so on are all believed to have similar security properties. >> All are good options from a security point of view. >> >> SHA-256 has a number of advantages: >> >> * It has been around for a while, is widely used, and is supported by >> just about every single crypto library (OpenSSL, mbedTLS, CryptoNG, >> SecureTransport, etc). >> >> * When you compare against SHA1DC, most vectorized SHA-256 >> implementations are indeed faster, even without acceleration. >> >> * If we're doing signatures with OpenPGP (or even, I suppose, CMS), >> we're going to be using SHA-2, so it doesn't make sense to have our >> security depend on two separate algorithms when either one of them >> alone could break the security when we could just depend on one. >> >> So SHA-256 it is. > > The above is what I wrote, so of course I'd like it. ;-) > >> See the "Hash algorithm analysis" thread as of >> [1]. Linus has come around to this choice and suggested Junio make the >> final pick, and he's endorsed SHA-256 [3]. > > The above paragraph has the same problem as before of (1) not being > self-contained and (2) focusing on the process that led to this patch > instead of the benefit of the patch itself. I think we should omit it. > >> This follow-up change changes occurrences of "NewHash" to >> "SHA-256" (or "sha256", depending on the context). The "Selection of a >> New Hash" section has also been changed to note that historically we >> used the the "NewHash" name while we didn't know what the new hash >> function would be. > > nit: Commit messages are usually in the imperative tense. This is in > the past tense, I think because it is a continuation of that > discussion about process. > > For this part, I think we can let the patch speak for itself. > >> This leaves no use of "NewHash" anywhere in git.git except in the >> aforementioned section (and as a variable name in t/t9700/test.pl, but >> that use from 2008 has nothing to do with this transition plan). > > This part is helpful --- good. > >> 1. https://public-inbox.org/git/20180720215220.GB18502@genre.crustytoothpaste.net/ >> 2. https://public-inbox.org/git/CA+55aFwSe9BF8e0hLk9pp3FVD5LaVY5GRdsV3fbNtgzekJadyA@mail.gmail.com/ >> 3. https://public-inbox.org/git/xmqqzhygwd5o.fsf@gitster-ct.c.googlers.com/ > > Footnotes to the historical part --- I'd recommend removing these. > >> Helped-by: Jonathan Nieder <jrnieder@gmail.com> >> Helped-by: Junio C Hamano <gitster@pobox.com> >> Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> > > Here I'd want to put a pile of acks, e.g.: > > Acked-by: Linus Torvalds <torvalds@linux-foundation.org> > Acked-by: brian m. carlson <sandals@crustytoothpaste.net> > Acked-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> > Acked-by: Dan Shumow <danshu@microsoft.com> > Acked-by: Junio C Hamano <gitster@pobox.com> > > [...] >> --- a/Documentation/technical/hash-function-transition.txt >> +++ b/Documentation/technical/hash-function-transition.txt >> @@ -59,14 +59,11 @@ that are believed to be cryptographically secure. >> >> Goals >> ----- >> -Where NewHash is a strong 256-bit hash function to replace SHA-1 (see >> -"Selection of a New Hash", below): >> - >> -1. The transition to NewHash can be done one local repository at a time. >> +1. The transition to SHA-256 can be done one local repository at a time. > > Yay! > > [...] >> [extensions] >> - objectFormat = newhash >> + objectFormat = sha256 >> compatObjectFormat = sha1 > > Yes, makes sense. > > [...] >> @@ -155,36 +152,36 @@ repository extensions. >> Object names >> ~~~~~~~~~~~~ >> Objects can be named by their 40 hexadecimal digit sha1-name or 64 >> -hexadecimal digit newhash-name, plus names derived from those (see >> +hexadecimal digit sha256-name, plus names derived from those (see >> gitrevisions(7)). >> >> The sha1-name of an object is the SHA-1 of the concatenation of its >> type, length, a nul byte, and the object's sha1-content. This is the >> traditional <sha1> used in Git to name objects. >> >> -The newhash-name of an object is the NewHash of the concatenation of its >> -type, length, a nul byte, and the object's newhash-content. >> +The sha256-name of an object is the SHA-256 of the concatenation of its >> +type, length, a nul byte, and the object's sha256-content. > > Sensible. > > [...] >> >> Object format >> ~~~~~~~~~~~~~ >> The content as a byte sequence of a tag, commit, or tree object named >> -by sha1 and newhash differ because an object named by newhash-name refers to >> +by sha1 and sha256 differ because an object named by sha256-name refers to > > Not about this patch: this should say SHA-1 and SHA-256, I think. > Leaving it as is in this patch as you did is the right thing. > > [...] >> @@ -255,10 +252,10 @@ network byte order): >> up to and not including the table of CRC32 values. >> - Zero or more NUL bytes. >> - The trailer consists of the following: >> - - A copy of the 20-byte NewHash checksum at the end of the >> + - A copy of the 20-byte SHA-256 checksum at the end of the > > Not about this patch: a SHA-256 is 32 bytes. Leaving that for a > separate patch as you did is the right thing, though. > > [...] >> - - 20-byte NewHash checksum of all of the above. >> + - 20-byte SHA-256 checksum of all of the above. > > Likewise. > > [...] >> @@ -351,12 +348,12 @@ the following steps: >> (This list only contains objects reachable from the "wants". If the >> pack from the server contained additional extraneous objects, then >> they will be discarded.) >> -3. convert to newhash: open a new (newhash) packfile. Read the topologically >> +3. convert to sha256: open a new (sha256) packfile. Read the topologically > > Not about this patch: this one's more murky, since it's talking about > the object names instead of the hash function. I guess "sha256" > instead of "SHA-256" for this could be right, but I worry it's going > to take time for me to figure out the exact distinction. That seems > like a reason to just call it SHA-256 (but in a separate patch). > > [...] >> - sha1-content, convert to newhash-content, and write it to the newhash >> - pack. Record the new sha1<->newhash mapping entry for use in the idx. >> + sha1-content, convert to sha256-content, and write it to the sha256 >> + pack. Record the new sha1<->sha256 mapping entry for use in the idx. >> 4. sort: reorder entries in the new pack to match the order of objects >> - in the pack the server generated and include blobs. Write a newhash idx >> + in the pack the server generated and include blobs. Write a sha256 idx >> file > > Likewise. > > [...] >> @@ -388,16 +385,16 @@ send-pack. >> >> Signed Commits >> ~~~~~~~~~~~~~~ >> -We add a new field "gpgsig-newhash" to the commit object format to allow >> +We add a new field "gpgsig-sha256" to the commit object format to allow >> signing commits without relying on SHA-1. It is similar to the >> -existing "gpgsig" field. Its signed payload is the newhash-content of the >> -commit object with any "gpgsig" and "gpgsig-newhash" fields removed. >> +existing "gpgsig" field. Its signed payload is the sha256-content of the >> +commit object with any "gpgsig" and "gpgsig-sha256" fields removed. > > That reminds me --- we need to add support for stripping these out. > > [...] >> @@ -601,18 +598,22 @@ The user can also explicitly specify which format to use for a >> particular revision specifier and for output, overriding the mode. For >> example: >> >> -git --output-format=sha1 log abac87a^{sha1}..f787cac^{newhash} >> +git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256} >> >> -Selection of a New Hash >> ------------------------ >> +Choice of Hash >> +-------------- > > Yay! > > [...] >> -Some hashes under consideration are SHA-256, SHA-512/256, SHA-256x16, >> -K12, and BLAKE2bp-256. >> +We choose SHA-256. See the thread starting at >> +<20180609224913.GC38834@genre.crustytoothpaste.net> for the discussion >> +(https://public-inbox.org/git/20180609224913.GC38834@genre.crustytoothpaste.net/) > > Can this reference be moved to a footnote? It's not part of the > design, but it's a good reference. > > Thanks again for getting this documented. > > Sincerely, > Jonathan ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH v2 2/2] doc hash-function-transition: pick SHA-256 as NewHash 2018-08-03 17:43 ` Ævar Arnfjörð Bjarmason @ 2018-08-04 8:52 ` Jonathan Nieder 0 siblings, 0 replies; 66+ messages in thread From: Jonathan Nieder @ 2018-08-04 8:52 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: git, Junio C Hamano, Linus Torvalds, Edward Thomson, brian m . carlson, Johannes Schindelin, demerphq, Adam Langley, keccak Subject: doc hash-function-transition: pick SHA-256 as NewHash From a security perspective, it seems that SHA-256, BLAKE2, SHA3-256, K12, and so on are all believed to have similar security properties. All are good options from a security point of view. SHA-256 has a number of advantages: * It has been around for a while, is widely used, and is supported by just about every single crypto library (OpenSSL, mbedTLS, CryptoNG, SecureTransport, etc). * When you compare against SHA1DC, most vectorized SHA-256 implementations are indeed faster, even without acceleration. * If we're doing signatures with OpenPGP (or even, I suppose, CMS), we're going to be using SHA-2, so it doesn't make sense to have our security depend on two separate algorithms when either one of them alone could break the security when we could just depend on one. So SHA-256 it is. Update the hash-function-transition design doc to say so. After this patch, there are no remaining instances of the string "NewHash", except for an unrelated use from 2008 as a variable name in t/t9700/test.pl. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Acked-by: brian m. carlson <sandals@crustytoothpaste.net> Acked-by: Johannes Schindelin <Johannes.Schindelin@gmx.de> Acked-by: Dan Shumow <danshu@microsoft.com> Signed-off-by: Jonathan Nieder <jrnieder@gmail.com> --- Hi, Ævar Arnfjörð Bjarmason wrote: > I think it makes if you just take over 2/2 of this series (or even the > whole thing), since the meat of it is already something I copy/pasted > from you, and you've got more of an opinion / idea about how to proceed > (which is good!); it's more efficient than me trying to fix various > stuff you're pointing out at this point, I also think it makes sense > that you change the "Author" line for that, since the rest of the > changes will mainly be search-replace by me. Fair enough. Here's that updated patch 2/2. I'll try to make a more comprehensive set of proposed edits tomorrow, in a fresh thread (dealing with the cksum-trailer, etc). Brian, is your latest work in progress available somewhere (e.g. a branch on https://git.crustytoothpaste.net/git/bmc/git) so I can make sure any edits I make match well with it? Thanks, Jonathan .../technical/hash-function-transition.txt | 196 +++++++++--------- 1 file changed, 98 insertions(+), 98 deletions(-) diff --git a/Documentation/technical/hash-function-transition.txt b/Documentation/technical/hash-function-transition.txt index 5ee4754adb..bc2ace2a6e 100644 --- a/Documentation/technical/hash-function-transition.txt +++ b/Documentation/technical/hash-function-transition.txt @@ -59,14 +59,11 @@ that are believed to be cryptographically secure. Goals ----- -Where NewHash is a strong 256-bit hash function to replace SHA-1 (see -"Selection of a New Hash", below): - -1. The transition to NewHash can be done one local repository at a time. +1. The transition to SHA-256 can be done one local repository at a time. a. Requiring no action by any other party. - b. A NewHash repository can communicate with SHA-1 Git servers + b. A SHA-256 repository can communicate with SHA-1 Git servers (push/fetch). - c. Users can use SHA-1 and NewHash identifiers for objects + c. Users can use SHA-1 and SHA-256 identifiers for objects interchangeably (see "Object names on the command line", below). d. New signed objects make use of a stronger hash function than SHA-1 for their security guarantees. @@ -79,7 +76,7 @@ Where NewHash is a strong 256-bit hash function to replace SHA-1 (see Non-Goals --------- -1. Add NewHash support to Git protocol. This is valuable and the +1. Add SHA-256 support to Git protocol. This is valuable and the logical next step but it is out of scope for this initial design. 2. Transparently improving the security of existing SHA-1 signed objects. @@ -87,26 +84,26 @@ Non-Goals repository. 4. Taking the opportunity to fix other bugs in Git's formats and protocols. -5. Shallow clones and fetches into a NewHash repository. (This will - change when we add NewHash support to Git protocol.) -6. Skip fetching some submodules of a project into a NewHash - repository. (This also depends on NewHash support in Git +5. Shallow clones and fetches into a SHA-256 repository. (This will + change when we add SHA-256 support to Git protocol.) +6. Skip fetching some submodules of a project into a SHA-256 + repository. (This also depends on SHA-256 support in Git protocol.) Overview -------- We introduce a new repository format extension. Repositories with this -extension enabled use NewHash instead of SHA-1 to name their objects. +extension enabled use SHA-256 instead of SHA-1 to name their objects. This affects both object names and object content --- both the names of objects and all references to other objects within an object are switched to the new hash function. -NewHash repositories cannot be read by older versions of Git. +SHA-256 repositories cannot be read by older versions of Git. -Alongside the packfile, a NewHash repository stores a bidirectional -mapping between NewHash and SHA-1 object names. The mapping is generated +Alongside the packfile, a SHA-256 repository stores a bidirectional +mapping between SHA-256 and SHA-1 object names. The mapping is generated locally and can be verified using "git fsck". Object lookups use this -mapping to allow naming objects using either their SHA-1 and NewHash names +mapping to allow naming objects using either their SHA-1 and SHA-256 names interchangeably. "git cat-file" and "git hash-object" gain options to display an object @@ -116,7 +113,7 @@ object database so that they can be named using the appropriate name (using the bidirectional hash mapping). Fetches from a SHA-1 based server convert the fetched objects into -NewHash form and record the mapping in the bidirectional mapping table +SHA-256 form and record the mapping in the bidirectional mapping table (see below for details). Pushes to a SHA-1 based server convert the objects being pushed into sha1 form so the server does not have to be aware of the hash function the client is using. @@ -125,19 +122,19 @@ Detailed Design --------------- Repository format extension ~~~~~~~~~~~~~~~~~~~~~~~~~~~ -A NewHash repository uses repository format version `1` (see +A SHA-256 repository uses repository format version `1` (see Documentation/technical/repository-version.txt) with extensions `objectFormat` and `compatObjectFormat`: [core] repositoryFormatVersion = 1 [extensions] - objectFormat = newhash + objectFormat = sha256 compatObjectFormat = sha1 The combination of setting `core.repositoryFormatVersion=1` and populating `extensions.*` ensures that all versions of Git later than -`v0.99.9l` will die instead of trying to operate on the NewHash +`v0.99.9l` will die instead of trying to operate on the SHA-256 repository, instead producing an error message. # Between v0.99.9l and v2.7.0 @@ -155,36 +152,36 @@ repository extensions. Object names ~~~~~~~~~~~~ Objects can be named by their 40 hexadecimal digit sha1-name or 64 -hexadecimal digit newhash-name, plus names derived from those (see +hexadecimal digit sha256-name, plus names derived from those (see gitrevisions(7)). The sha1-name of an object is the SHA-1 of the concatenation of its type, length, a nul byte, and the object's sha1-content. This is the traditional <sha1> used in Git to name objects. -The newhash-name of an object is the NewHash of the concatenation of its -type, length, a nul byte, and the object's newhash-content. +The sha256-name of an object is the SHA-256 of the concatenation of its +type, length, a nul byte, and the object's sha256-content. Object format ~~~~~~~~~~~~~ The content as a byte sequence of a tag, commit, or tree object named -by sha1 and newhash differ because an object named by newhash-name refers to -other objects by their newhash-names and an object named by sha1-name +by sha1 and sha256 differ because an object named by sha256-name refers to +other objects by their sha256-names and an object named by sha1-name refers to other objects by their sha1-names. -The newhash-content of an object is the same as its sha1-content, except -that objects referenced by the object are named using their newhash-names +The sha256-content of an object is the same as its sha1-content, except +that objects referenced by the object are named using their sha256-names instead of sha1-names. Because a blob object does not refer to any -other object, its sha1-content and newhash-content are the same. +other object, its sha1-content and sha256-content are the same. -The format allows round-trip conversion between newhash-content and +The format allows round-trip conversion between sha256-content and sha1-content. Object storage ~~~~~~~~~~~~~~ Loose objects use zlib compression and packed objects use the packed format described in Documentation/technical/pack-format.txt, just like -today. The content that is compressed and stored uses newhash-content +today. The content that is compressed and stored uses sha256-content instead of sha1-content. Pack index @@ -255,10 +252,10 @@ network byte order): up to and not including the table of CRC32 values. - Zero or more NUL bytes. - The trailer consists of the following: - - A copy of the 20-byte NewHash checksum at the end of the + - A copy of the 20-byte SHA-256 checksum at the end of the corresponding packfile. - - 20-byte NewHash checksum of all of the above. + - 20-byte SHA-256 checksum of all of the above. Loose object index ~~~~~~~~~~~~~~~~~~ @@ -266,7 +263,7 @@ A new file $GIT_OBJECT_DIR/loose-object-idx contains information about all loose objects. Its format is # loose-object-idx - (newhash-name SP sha1-name LF)* + (sha256-name SP sha1-name LF)* where the object names are in hexadecimal format. The file is not sorted. @@ -292,8 +289,8 @@ To remove entries (e.g. in "git pack-refs" or "git-prune"): Translation table ~~~~~~~~~~~~~~~~~ The index files support a bidirectional mapping between sha1-names -and newhash-names. The lookup proceeds similarly to ordinary object -lookups. For example, to convert a sha1-name to a newhash-name: +and sha256-names. The lookup proceeds similarly to ordinary object +lookups. For example, to convert a sha1-name to a sha256-name: 1. Look for the object in idx files. If a match is present in the idx's sorted list of truncated sha1-names, then: @@ -301,8 +298,8 @@ lookups. For example, to convert a sha1-name to a newhash-name: name order mapping. b. Read the corresponding entry in the full sha1-name table to verify we found the right object. If it is, then - c. Read the corresponding entry in the full newhash-name table. - That is the object's newhash-name. + c. Read the corresponding entry in the full sha256-name table. + That is the object's sha256-name. 2. Check for a loose object. Read lines from loose-object-idx until we find a match. @@ -318,25 +315,25 @@ for all objects in the object store. Reading an object's sha1-content ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -The sha1-content of an object can be read by converting all newhash-names -its newhash-content references to sha1-names using the translation table. +The sha1-content of an object can be read by converting all sha256-names +its sha256-content references to sha1-names using the translation table. Fetch ~~~~~ Fetching from a SHA-1 based server requires translating between SHA-1 -and NewHash based representations on the fly. +and SHA-256 based representations on the fly. SHA-1s named in the ref advertisement that are present on the client -can be translated to NewHash and looked up as local objects using the +can be translated to SHA-256 and looked up as local objects using the translation table. Negotiation proceeds as today. Any "have"s generated locally are converted to SHA-1 before being sent to the server, and SHA-1s -mentioned by the server are converted to NewHash when looking them up +mentioned by the server are converted to SHA-256 when looking them up locally. After negotiation, the server sends a packfile containing the -requested objects. We convert the packfile to NewHash format using +requested objects. We convert the packfile to SHA-256 format using the following steps: 1. index-pack: inflate each object in the packfile and compute its @@ -351,12 +348,12 @@ the following steps: (This list only contains objects reachable from the "wants". If the pack from the server contained additional extraneous objects, then they will be discarded.) -3. convert to newhash: open a new (newhash) packfile. Read the topologically +3. convert to sha256: open a new (sha256) packfile. Read the topologically sorted list just generated. For each object, inflate its - sha1-content, convert to newhash-content, and write it to the newhash - pack. Record the new sha1<->newhash mapping entry for use in the idx. + sha1-content, convert to sha256-content, and write it to the sha256 + pack. Record the new sha1<->sha256 mapping entry for use in the idx. 4. sort: reorder entries in the new pack to match the order of objects - in the pack the server generated and include blobs. Write a newhash idx + in the pack the server generated and include blobs. Write a sha256 idx file 5. clean up: remove the SHA-1 based pack file, index, and topologically sorted list obtained from the server in steps 1 @@ -388,16 +385,16 @@ send-pack. Signed Commits ~~~~~~~~~~~~~~ -We add a new field "gpgsig-newhash" to the commit object format to allow +We add a new field "gpgsig-sha256" to the commit object format to allow signing commits without relying on SHA-1. It is similar to the -existing "gpgsig" field. Its signed payload is the newhash-content of the -commit object with any "gpgsig" and "gpgsig-newhash" fields removed. +existing "gpgsig" field. Its signed payload is the sha256-content of the +commit object with any "gpgsig" and "gpgsig-sha256" fields removed. This means commits can be signed 1. using SHA-1 only, as in existing signed commit objects -2. using both SHA-1 and NewHash, by using both gpgsig-newhash and gpgsig +2. using both SHA-1 and SHA-256, by using both gpgsig-sha256 and gpgsig fields. -3. using only NewHash, by only using the gpgsig-newhash field. +3. using only SHA-256, by only using the gpgsig-sha256 field. Old versions of "git verify-commit" can verify the gpgsig signature in cases (1) and (2) without modifications and view case (3) as an @@ -405,24 +402,24 @@ ordinary unsigned commit. Signed Tags ~~~~~~~~~~~ -We add a new field "gpgsig-newhash" to the tag object format to allow +We add a new field "gpgsig-sha256" to the tag object format to allow signing tags without relying on SHA-1. Its signed payload is the -newhash-content of the tag with its gpgsig-newhash field and "-----BEGIN PGP +sha256-content of the tag with its gpgsig-sha256 field and "-----BEGIN PGP SIGNATURE-----" delimited in-body signature removed. This means tags can be signed 1. using SHA-1 only, as in existing signed tag objects -2. using both SHA-1 and NewHash, by using gpgsig-newhash and an in-body +2. using both SHA-1 and SHA-256, by using gpgsig-sha256 and an in-body signature. -3. using only NewHash, by only using the gpgsig-newhash field. +3. using only SHA-256, by only using the gpgsig-sha256 field. Mergetag embedding ~~~~~~~~~~~~~~~~~~ The mergetag field in the sha1-content of a commit contains the sha1-content of a tag that was merged by that commit. -The mergetag field in the newhash-content of the same commit contains the -newhash-content of the same tag. +The mergetag field in the sha256-content of the same commit contains the +sha256-content of the same tag. Submodules ~~~~~~~~~~ @@ -497,7 +494,7 @@ Caveats ------- Invalid objects ~~~~~~~~~~~~~~~ -The conversion from sha1-content to newhash-content retains any +The conversion from sha1-content to sha256-content retains any brokenness in the original object (e.g., tree entry modes encoded with leading 0, tree objects whose paths are not sorted correctly, and commit objects without an author or committer). This is a deliberate @@ -516,7 +513,7 @@ allow lifting this restriction. Alternates ~~~~~~~~~~ -For the same reason, a newhash repository cannot borrow objects from a +For the same reason, a sha256 repository cannot borrow objects from a sha1 repository using objects/info/alternates or $GIT_ALTERNATE_OBJECT_REPOSITORIES. @@ -524,20 +521,20 @@ git notes ~~~~~~~~~ The "git notes" tool annotates objects using their sha1-name as key. This design does not describe a way to migrate notes trees to use -newhash-names. That migration is expected to happen separately (for +sha256-names. That migration is expected to happen separately (for example using a file at the root of the notes tree to describe which hash it uses). Server-side cost ~~~~~~~~~~~~~~~~ -Until Git protocol gains NewHash support, using NewHash based storage +Until Git protocol gains SHA-256 support, using SHA-256 based storage on public-facing Git servers is strongly discouraged. Once Git -protocol gains NewHash support, NewHash based servers are likely not +protocol gains SHA-256 support, SHA-256 based servers are likely not to support SHA-1 compatibility, to avoid what may be a very expensive hash reencode during clone and to encourage peers to modernize. The design described here allows fetches by SHA-1 clients of a -personal NewHash repository because it's not much more difficult than +personal SHA-256 repository because it's not much more difficult than allowing pushes from that repository. This support needs to be guarded by a configuration option --- servers like git.kernel.org that serve a large number of clients would not be expected to bear that cost. @@ -547,7 +544,7 @@ Meaning of signatures The signed payload for signed commits and tags does not explicitly name the hash used to identify objects. If some day Git adopts a new hash function with the same length as the current SHA-1 (40 -hexadecimal digit) or NewHash (64 hexadecimal digit) objects then the +hexadecimal digit) or SHA-256 (64 hexadecimal digit) objects then the intent behind the PGP signed payload in an object signature is unclear: @@ -562,7 +559,7 @@ Does this mean Git v2.12.0 is the commit with sha1-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7? -Fortunately NewHash and SHA-1 have different lengths. If Git starts +Fortunately SHA-256 and SHA-1 have different lengths. If Git starts using another hash with the same length to name objects, then it will need to change the format of signed payloads using that hash to address this issue. @@ -574,24 +571,24 @@ supports four different modes of operation: 1. ("dark launch") Treat object names input by the user as SHA-1 and convert any object names written to output to SHA-1, but store - objects using NewHash. This allows users to test the code with no + objects using SHA-256. This allows users to test the code with no visible behavior change except for performance. This allows allows running even tests that assume the SHA-1 hash function, to sanity-check the behavior of the new mode. - 2. ("early transition") Allow both SHA-1 and NewHash object names in + 2. ("early transition") Allow both SHA-1 and SHA-256 object names in input. Any object names written to output use SHA-1. This allows users to continue to make use of SHA-1 to communicate with peers (e.g. by email) that have not migrated yet and prepares for mode 3. - 3. ("late transition") Allow both SHA-1 and NewHash object names in - input. Any object names written to output use NewHash. In this + 3. ("late transition") Allow both SHA-1 and SHA-256 object names in + input. Any object names written to output use SHA-256. In this mode, users are using a more secure object naming method by default. The disruption is minimal as long as most of their peers are in mode 2 or mode 3. 4. ("post-transition") Treat object names input by the user as - NewHash and write output using NewHash. This is safer than mode 3 + SHA-256 and write output using SHA-256. This is safer than mode 3 because there is less risk that input is incorrectly interpreted using the wrong hash function. @@ -601,27 +598,31 @@ The user can also explicitly specify which format to use for a particular revision specifier and for output, overriding the mode. For example: -git --output-format=sha1 log abac87a^{sha1}..f787cac^{newhash} +git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256} -Selection of a New Hash ------------------------ +Choice of Hash +-------------- In early 2005, around the time that Git was written, Xiaoyun Wang, Yiqun Lisa Yin, and Hongbo Yu announced an attack finding SHA-1 collisions in 2^69 operations. In August they published details. Luckily, no practical demonstrations of a collision in full SHA-1 were published until 10 years later, in 2017. -The hash function NewHash to replace SHA-1 should be stronger than -SHA-1 was: we would like it to be trustworthy and useful in practice -for at least 10 years. +Git v2.13.0 and later subsequently moved to a hardened SHA-1 +implementation by default that mitigates the SHAttered attack, but +SHA-1 is still believed to be weak. + +The hash to replace this hardened SHA-1 should be stronger than SHA-1 +was: we would like it to be trustworthy and useful in practice for at +least 10 years. Some other relevant properties: 1. A 256-bit hash (long enough to match common security practice; not excessively long to hurt performance and disk usage). -2. High quality implementations should be widely available (e.g. in - OpenSSL). +2. High quality implementations should be widely available (e.g., in + OpenSSL and Apple CommonCrypto). 3. The hash function's properties should match Git's needs (e.g. Git requires collision and 2nd preimage resistance and does not require @@ -630,14 +631,13 @@ Some other relevant properties: 4. As a tiebreaker, the hash should be fast to compute (fortunately many contenders are faster than SHA-1). -Some hashes under consideration are SHA-256, SHA-512/256, SHA-256x16, -K12, and BLAKE2bp-256. +We choose SHA-256. Transition plan --------------- Some initial steps can be implemented independently of one another: - adding a hash function API (vtable) -- teaching fsck to tolerate the gpgsig-newhash field +- teaching fsck to tolerate the gpgsig-sha256 field - excluding gpgsig-* from the fields copied by "git commit --amend" - annotating tests that depend on SHA-1 values with a SHA1 test prerequisite @@ -664,7 +664,7 @@ Next comes introduction of compatObjectFormat: - adding appropriate index entries when adding a new object to the object store - --output-format option -- ^{sha1} and ^{newhash} revision notation +- ^{sha1} and ^{sha256} revision notation - configuration to specify default input and output format (see "Object names on the command line" above) @@ -672,7 +672,7 @@ The next step is supporting fetches and pushes to SHA-1 repositories: - allow pushes to a repository using the compat format - generate a topologically sorted list of the SHA-1 names of fetched objects -- convert the fetched packfile to newhash format and generate an idx +- convert the fetched packfile to sha256 format and generate an idx file - re-sort to match the order of objects in the fetched packfile @@ -680,30 +680,30 @@ The infrastructure supporting fetch also allows converting an existing repository. In converted repositories and new clones, end users can gain support for the new hash function without any visible change in behavior (see "dark launch" in the "Object names on the command line" -section). In particular this allows users to verify NewHash signatures +section). In particular this allows users to verify SHA-256 signatures on objects in the repository, and it should ensure the transition code is stable in production in preparation for using it more widely. Over time projects would encourage their users to adopt the "early transition" and then "late transition" modes to take advantage of the -new, more futureproof NewHash object names. +new, more futureproof SHA-256 object names. When objectFormat and compatObjectFormat are both set, commands -generating signatures would generate both SHA-1 and NewHash signatures +generating signatures would generate both SHA-1 and SHA-256 signatures by default to support both new and old users. -In projects using NewHash heavily, users could be encouraged to adopt +In projects using SHA-256 heavily, users could be encouraged to adopt the "post-transition" mode to avoid accidentally making implicit use of SHA-1 object names. Once a critical mass of users have upgraded to a version of Git that -can verify NewHash signatures and have converted their existing +can verify SHA-256 signatures and have converted their existing repositories to support verifying them, we can add support for a -setting to generate only NewHash signatures. This is expected to be at +setting to generate only SHA-256 signatures. This is expected to be at least a year later. That is also a good moment to advertise the ability to convert -repositories to use NewHash only, stripping out all SHA-1 related +repositories to use SHA-256 only, stripping out all SHA-1 related metadata. This improves performance by eliminating translation overhead and security by avoiding the possibility of accidentally relying on the safety of SHA-1. @@ -742,16 +742,16 @@ using the old hash function. Signed objects with multiple hashes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Instead of introducing the gpgsig-newhash field in commit and tag objects -for newhash-content based signatures, an earlier version of this design -added "hash newhash <newhash-name>" fields to strengthen the existing +Instead of introducing the gpgsig-sha256 field in commit and tag objects +for sha256-content based signatures, an earlier version of this design +added "hash sha256 <sha256-name>" fields to strengthen the existing sha1-content based signatures. In other words, a single signature was used to attest to the object content using both hash functions. This had some advantages: * Using one signature instead of two speeds up the signing process. * Having one signed payload with both hashes allows the signer to - attest to the sha1-name and newhash-name referring to the same object. + attest to the sha1-name and sha256-name referring to the same object. * All users consume the same signature. Broken signatures are likely to be detected quickly using current versions of git. @@ -760,11 +760,11 @@ However, it also came with disadvantages: objects it references, even after the transition is complete and translation table is no longer needed for anything else. To support this, the design added fields such as "hash sha1 tree <sha1-name>" - and "hash sha1 parent <sha1-name>" to the newhash-content of a signed + and "hash sha1 parent <sha1-name>" to the sha256-content of a signed commit, complicating the conversion process. * Allowing signed objects without a sha1 (for after the transition is complete) complicated the design further, requiring a "nohash sha1" - field to suppress including "hash sha1" fields in the newhash-content + field to suppress including "hash sha1" fields in the sha256-content and signed payload. Lazily populated translation table @@ -772,7 +772,7 @@ Lazily populated translation table Some of the work of building the translation table could be deferred to push time, but that would significantly complicate and slow down pushes. Calculating the sha1-name at object creation time at the same time it is -being streamed to disk and having its newhash-name calculated should be +being streamed to disk and having its sha256-name calculated should be an acceptable cost. Document History -- 2.18.0.597.ga71716f1ad ^ permalink raw reply related [flat|nested] 66+ messages in thread
* Re: [PATCH v2 2/2] doc hash-function-transition: pick SHA-256 as NewHash 2018-08-03 7:20 ` Jonathan Nieder ` (2 preceding siblings ...) 2018-08-03 17:43 ` Ævar Arnfjörð Bjarmason @ 2018-08-03 17:45 ` brian m. carlson 3 siblings, 0 replies; 66+ messages in thread From: brian m. carlson @ 2018-08-03 17:45 UTC (permalink / raw) To: Jonathan Nieder Cc: Ævar Arnfjörð Bjarmason, git, Junio C Hamano, Linus Torvalds, Edward Thomson, Johannes Schindelin, demerphq, Adam Langley, keccak [-- Attachment #1: Type: text/plain, Size: 2397 bytes --] On Fri, Aug 03, 2018 at 12:20:14AM -0700, Jonathan Nieder wrote: > Ævar Arnfjörð Bjarmason wrote: > > Object format > > ~~~~~~~~~~~~~ > > The content as a byte sequence of a tag, commit, or tree object named > > -by sha1 and newhash differ because an object named by newhash-name refers to > > +by sha1 and sha256 differ because an object named by sha256-name refers to > > Not about this patch: this should say SHA-1 and SHA-256, I think. > Leaving it as is in this patch as you did is the right thing. > > [...] > > @@ -255,10 +252,10 @@ network byte order): > > up to and not including the table of CRC32 values. > > - Zero or more NUL bytes. > > - The trailer consists of the following: > > - - A copy of the 20-byte NewHash checksum at the end of the > > + - A copy of the 20-byte SHA-256 checksum at the end of the > > Not about this patch: a SHA-256 is 32 bytes. Leaving that for a > separate patch as you did is the right thing, though. > > [...] > > - - 20-byte NewHash checksum of all of the above. > > + - 20-byte SHA-256 checksum of all of the above. > > Likewise. For the record, my code for these does use 32 bytes. I'm fine with this being a separate patch, though. > [...] > > @@ -351,12 +348,12 @@ the following steps: > > (This list only contains objects reachable from the "wants". If the > > pack from the server contained additional extraneous objects, then > > they will be discarded.) > > -3. convert to newhash: open a new (newhash) packfile. Read the topologically > > +3. convert to sha256: open a new (sha256) packfile. Read the topologically > > Not about this patch: this one's more murky, since it's talking about > the object names instead of the hash function. I guess "sha256" > instead of "SHA-256" for this could be right, but I worry it's going > to take time for me to figure out the exact distinction. That seems > like a reason to just call it SHA-256 (but in a separate patch). My assumption has been that when we are referring to the algorithm, we'll use SHA-1 and SHA-256, and when we're referring to the input to Git (in config files or in ^{sha256} notation), we use "sha1" and "sha256". I see this as analogous to "Git" and "git". Otherwise, I'm fine with this document as it is. -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 867 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: [PATCH 2/2] doc hash-function-transition: pick SHA-256 as NewHash 2018-07-25 16:45 ` Junio C Hamano 2018-07-25 17:25 ` Jonathan Nieder @ 2018-07-25 22:56 ` brian m. carlson 1 sibling, 0 replies; 66+ messages in thread From: brian m. carlson @ 2018-07-25 22:56 UTC (permalink / raw) To: Junio C Hamano Cc: Ævar Arnfjörð Bjarmason, git, Linus Torvalds, Edward Thomson, Jonathan Nieder, Johannes Schindelin, demerphq, Adam Langley, keccak [-- Attachment #1: Type: text/plain, Size: 1158 bytes --] On Wed, Jul 25, 2018 at 09:45:52AM -0700, Junio C Hamano wrote: > Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes: > > > @@ -125,19 +122,19 @@ Detailed Design > > --------------- > > Repository format extension > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > -A NewHash repository uses repository format version `1` (see > > +A SHA-256 repository uses repository format version `1` (see > > Documentation/technical/repository-version.txt) with extensions > > `objectFormat` and `compatObjectFormat`: > > > > [core] > > repositoryFormatVersion = 1 > > [extensions] > > - objectFormat = newhash > > + objectFormat = sha256 > > compatObjectFormat = sha1 > > Whenever we said SHA1, somebody came and told us that the name of > the hash is SHA-1 (with dash). Would we be nitpicker-prone in the > same way with "sha256" here? I actually have a patch to make the names "sha1" and "sha256". My rationale is that it's shorter and easier to type. People can quibble about it when I send it to the list, but that's what I'm proposing at least. -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 867 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-06-09 22:49 ` Hash algorithm analysis brian m. carlson 2018-06-11 19:29 ` Jonathan Nieder @ 2018-06-11 21:19 ` Ævar Arnfjörð Bjarmason 2018-06-21 8:20 ` Johannes Schindelin 2018-06-21 22:39 ` brian m. carlson 1 sibling, 2 replies; 66+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2018-06-11 21:19 UTC (permalink / raw) To: brian m. carlson Cc: git, Adam Langley, Jeff King, Mike Hommey, Brandon Williams, Linus Torvalds, Jonathan Nieder, Stefan Beller, Jonathan Tan, Junio Hamano, Johannes Schindelin On Sat, Jun 09 2018, brian m. carlson wrote: [Expanding the CC list to what we had in the last "what hash" thread[1] last year]. > == Discussion of Candidates > > I've implemented and tested the following algorithms, all of which are > 256-bit (in alphabetical order): > > * BLAKE2b (libb2) > * BLAKE2bp (libb2) > * KangarooTwelve (imported from the Keccak Code Package) > * SHA-256 (OpenSSL) > * SHA-512/256 (OpenSSL) > * SHA3-256 (OpenSSL) > * SHAKE128 (OpenSSL) > > I also rejected some other candidates. I couldn't find any reference or > implementation of SHA256×16, so I didn't implement it. I didn't > consider SHAKE256 because it is nearly identical to SHA3-256 in almost > all characteristics (including performance). > > I imported the optimized 64-bit implementation of KangarooTwelve. The > AVX2 implementation was not considered for licensing reasons (it's > partially generated from external code, which falls foul of the GPL's > "preferred form for modifications" rule). > > === BLAKE2b and BLAKE2bp > > These are the non-parallelized and parallelized 64-bit variants of > BLAKE2. > > Benefits: > * Both algorithms provide 256-bit preimage resistance. > > Downsides: > * Some people are uncomfortable that the security margin has been > decreased from the original SHA-3 submission, although it is still > considered secure. > * BLAKE2bp, as implemented in libb2, uses OpenMP (and therefore > multithreading) by default. It was no longer possible to run the > testsuite with -j3 on my laptop in this configuration. > > === Keccak-based Algorithms > > SHA3-256 is the 256-bit Keccak algorithm with 24 rounds, processing 136 > bytes at a time. SHAKE128 is an extendable output function with 24 > rounds, processing 168 bytes at a time. KangarooTwelve is an extendable > output function with 12 rounds, processing 136 bytes at a time. > > Benefits: > * SHA3-256 provides 256-bit preimage resistance. > * SHA3-256 has been heavily studied and is believed to have a large > security margin. > > I noted the following downsides: > * There's a lack of a availability of KangarooTwelve in other > implementations. It may be the least available option in terms of > implementations. > * Some people are uncomfortable that the security margin of > KangarooTwelve has been decreased, although it is still considered > secure. > * SHAKE128 and KangarooTwelve provide only 128-bit preimage resistance. > > === SHA-256 and SHA-512/256 > > These are the 32-bit and 64-bit SHA-2 algorithms that are 256 bits in > size. > > I noted the following benefits: > * Both algorithms are well known and heavily analyzed. > * Both algorithms provide 256-bit preimage resistance. > > == Implementation Support > > |=== > | Implementation | OpenSSL | libb2 | NSS | ACC | gcrypt | Nettle| CL | > | SHA-1 | 🗸 | | 🗸 | 🗸 | 🗸 | 🗸 | {1} | > | BLAKE2b | f | 🗸 | | | 🗸 | | {2} | > | BLAKE2bp | | 🗸 | | | | | | > | KangarooTwelve | | | | | | | | > | SHA-256 | 🗸 | | 🗸 | 🗸 | 🗸 | 🗸 | {1} | > | SHA-512/256 | 🗸 | | | | | 🗸 | {3} | > | SHA3-256 | 🗸 | | | | 🗸 | 🗸 | {4} | > | SHAKE128 | 🗸 | | | | 🗸 | | {5} | > |=== > > f: future version (expected 1.2.0) > ACC: Apple Common Crypto > CL: Command-line > > :1: OpenSSL, coreutils, Perl Digest::SHA. > :2: OpenSSL, coreutils. > :3: OpenSSL > :4: OpenSSL, Perl Digest::SHA3. > :5: Perl Digest::SHA3. > > === Performance Analysis > > The test system used below is my personal laptop, a 2016 Lenovo ThinkPad > X1 Carbon with an Intel i7-6600U CPU (2.60 GHz) running Debian unstable. > > I implemented a test tool helper to compute speed much like OpenSSL > does. Below is a comparison of speeds. The columns indicate the speed > in KiB/s for chunks of the given size. The runs are representative of > multiple similar runs. > > 256 and 1024 bytes were chosen to represent common tree and commit > object sizes and the 8 KiB is an approximate average blob size. > > Algorithms are sorted by performance on the 1 KiB column. > > |=== > | Implementation | 256 B | 1 KiB | 8 KiB | 16 KiB | > | SHA-1 (OpenSSL) | 513963 | 685966 | 748993 | 754270 | > | BLAKE2b (libb2) | 488123 | 552839 | 576246 | 579292 | > | SHA-512/256 (OpenSSL) | 181177 | 349002 | 499113 | 495169 | > | BLAKE2bp (libb2) | 139891 | 344786 | 488390 | 522575 | > | SHA-256 (OpenSSL) | 264276 | 333560 | 357830 | 355761 | > | KangarooTwelve | 239305 | 307300 | 355257 | 364261 | > | SHAKE128 (OpenSSL) | 154775 | 253344 | 337811 | 346732 | > | SHA3-256 (OpenSSL) | 128597 | 185381 | 198931 | 207365 | > | BLAKE2bp (libb2; threaded) | 12223 | 49306 | 132833 | 179616 | > |=== > > SUPERCOP (a crypto benchmarking tool; > https://bench.cr.yp.to/results-hash.html) has also benchmarked these > algorithms. Note that BLAKE2bp is not listed, KangarooTwelve is k12, > SHA-512/256 is equivalent to sha512, SHA3-256 is keccakc512, and SHAKE128 is > keccakc256. > > Information is for kizomba, a Kaby Lake system. Counts are in cycles > per byte (smaller is better; sorted by 1536 B column): > > |=== > | Algorithm | 576 B | 1536 B | 4096 B | long | > | BLAKE2b | 3.51 | 3.10 | 3.08 | 3.07 | > | SHA-1 | 4.36 | 3.81 | 3.59 | 3.49 | > | KangarooTwelve | 4.99 | 4.57 | 4.13 | 3.86 | > | SHA-512/256 | 6.39 | 5.76 | 5.31 | 5.05 | > | SHAKE128 | 8.23 | 7.67 | 7.17 | 6.97 | > | SHA-256 | 8.90 | 8.08 | 7.77 | 7.59 | > | SHA3-256 | 10.26 | 9.15 | 8.84 | 8.57 | > |=== > > Numbers for genji262, an AMD Ryzen System, which has SHA acceleration: > > |=== > | Algorithm | 576 B | 1536 B | 4096 B | long | > | SHA-1 | 1.87 | 1.69 | 1.60 | 1.54 | > | SHA-256 | 1.95 | 1.72 | 1.68 | 1.64 | > | BLAKE2b | 2.94 | 2.59 | 2.59 | 2.59 | > | KangarooTwelve | 4.09 | 3.65 | 3.35 | 3.17 | > | SHA-512/256 | 5.54 | 5.08 | 4.71 | 4.48 | > | SHAKE128 | 6.95 | 6.23 | 5.71 | 5.49 | > | SHA3-256 | 8.29 | 7.35 | 7.04 | 6.81 | > |=== > > Note that no mid- to high-end Intel processors provide acceleration. > AMD Ryzen and some ARM64 processors do. > > == Summary > > The algorithms with the greatest implementation availability are > SHA-256, SHA3-256, BLAKE2b, and SHAKE128. > > In terms of command-line availability, BLAKE2b, SHA-256, SHA-512/256, > and SHA3-256 should be available in the near future on a reasonably > small Debian, Ubuntu, or Fedora install. > > As far as security, the most conservative choices appear to be SHA-256, > SHA-512/256, and SHA3-256. > > The performance winners are BLAKE2b unaccelerated and SHA-256 accelerated. This is a great summary. Thanks. In case it's not apparent from what follows, I have a bias towards SHA-256. Reasons for that, to summarize some of the discussion the last time around[1], and to add more details: == Popularity Other things being equal we should be biased towards whatever's in the widest use & recommended fon new projects. I fear that if e.g. git had used whatever at time was to SHA-1 as BLAKE2b is to SHA-256 now, we might not even know that it's broken (or had the sha1collisiondetection work to fall back on), since researchers are less likely to look at algorithms that aren't in wide use. SHA-256 et al were published in 2001 and has ~20k results on Google Scholar, compared to ~150 for BLAKE2b[4], published in 2008 (but ~1.2K for "BLAKE2"). Between the websites of Intel, AMD & ARM there are thousands of results for SHA-256 (and existing in-silicon acceleration). There's exactly one result on all three for BLAKE2b (on amd.com, in the context of a laundry list of hash algorithms in some presentation. Since BLAKE2b lost the SHA-3 competition to Keccak it seems impossible that it'll get ever get anywhere close to the same scrutiny or support in silicon as one of the SHA families. Which brings me to the next section... == Hardware acceleration The only widely deployed HW acceleration is for the SHA-1 and SHA-256 from the SHA-2 family[5], but notably nothing from the newer SHA-3 family (released in 2015). It seems implausible that anything except SHA-3 will get future HW acceleration given the narrow scope of current HW acceleration v.s. existing hash algorithms. As noted in the thread from last year[1] most git users won't even notice if the hashing is faster, but it does matter for some big users (big monorepos), so having the out of purchasing hardware to make things faster today is great, and given how these new instruction set extensions get rolled out it seems inevitable that this'll be available in all consumer CPUs within 5-10 years. == Age Similar to "popularity" it seems better to bias things towards a hash that's been out there for a while, i.e. it would be too early to pick SHA-3. The hash transitioning plan, once implemented, also makes it easier to switch to something else in the future, so we shouldn't be in a rush to pick some newer hash because we'll need to keep it forever, we can always do another transition in another 10-15 years. == Conclusion For all the above reasons I think we should pick SHA-256. 1. https://public-inbox.org/git/87y3ss8n4h.fsf@gmail.com/#t 2. https://github.com/cr-marcstevens/sha1collisiondetection 3. https://scholar.google.nl/scholar?hl=en&as_sdt=0%2C5&q=SHA-256&btnG= 4. https://scholar.google.nl/scholar?hl=en&as_sdt=0%2C5&q=BLAKE2b&btnG= 5. https://en.wikipedia.org/wiki/Intel_SHA_extensions ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-06-11 21:19 ` Hash algorithm analysis Ævar Arnfjörð Bjarmason @ 2018-06-21 8:20 ` Johannes Schindelin 2018-06-21 22:39 ` brian m. carlson 1 sibling, 0 replies; 66+ messages in thread From: Johannes Schindelin @ 2018-06-21 8:20 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: brian m. carlson, git, Adam Langley, Jeff King, Mike Hommey, Brandon Williams, Linus Torvalds, Jonathan Nieder, Stefan Beller, Jonathan Tan, Junio Hamano [-- Attachment #1: Type: text/plain, Size: 10948 bytes --] Hi Ævar, On Mon, 11 Jun 2018, Ævar Arnfjörð Bjarmason wrote: > On Sat, Jun 09 2018, brian m. carlson wrote: > > [Expanding the CC list to what we had in the last "what hash" thread[1] > last year]. > > > == Discussion of Candidates > > > > I've implemented and tested the following algorithms, all of which are > > 256-bit (in alphabetical order): > > > > * BLAKE2b (libb2) > > * BLAKE2bp (libb2) > > * KangarooTwelve (imported from the Keccak Code Package) > > * SHA-256 (OpenSSL) > > * SHA-512/256 (OpenSSL) > > * SHA3-256 (OpenSSL) > > * SHAKE128 (OpenSSL) > > > > I also rejected some other candidates. I couldn't find any reference or > > implementation of SHA256×16, so I didn't implement it. I didn't > > consider SHAKE256 because it is nearly identical to SHA3-256 in almost > > all characteristics (including performance). > > > > I imported the optimized 64-bit implementation of KangarooTwelve. The > > AVX2 implementation was not considered for licensing reasons (it's > > partially generated from external code, which falls foul of the GPL's > > "preferred form for modifications" rule). > > > > === BLAKE2b and BLAKE2bp > > > > These are the non-parallelized and parallelized 64-bit variants of > > BLAKE2. > > > > Benefits: > > * Both algorithms provide 256-bit preimage resistance. > > > > Downsides: > > * Some people are uncomfortable that the security margin has been > > decreased from the original SHA-3 submission, although it is still > > considered secure. > > * BLAKE2bp, as implemented in libb2, uses OpenMP (and therefore > > multithreading) by default. It was no longer possible to run the > > testsuite with -j3 on my laptop in this configuration. > > > > === Keccak-based Algorithms > > > > SHA3-256 is the 256-bit Keccak algorithm with 24 rounds, processing 136 > > bytes at a time. SHAKE128 is an extendable output function with 24 > > rounds, processing 168 bytes at a time. KangarooTwelve is an extendable > > output function with 12 rounds, processing 136 bytes at a time. > > > > Benefits: > > * SHA3-256 provides 256-bit preimage resistance. > > * SHA3-256 has been heavily studied and is believed to have a large > > security margin. > > > > I noted the following downsides: > > * There's a lack of a availability of KangarooTwelve in other > > implementations. It may be the least available option in terms of > > implementations. > > * Some people are uncomfortable that the security margin of > > KangarooTwelve has been decreased, although it is still considered > > secure. > > * SHAKE128 and KangarooTwelve provide only 128-bit preimage resistance. > > > > === SHA-256 and SHA-512/256 > > > > These are the 32-bit and 64-bit SHA-2 algorithms that are 256 bits in > > size. > > > > I noted the following benefits: > > * Both algorithms are well known and heavily analyzed. > > * Both algorithms provide 256-bit preimage resistance. > > > > == Implementation Support > > > > |=== > > | Implementation | OpenSSL | libb2 | NSS | ACC | gcrypt | Nettle| CL | > > | SHA-1 | 🗸 | | 🗸 | 🗸 | 🗸 | 🗸 | {1} | > > | BLAKE2b | f | 🗸 | | | 🗸 | | {2} | > > | BLAKE2bp | | 🗸 | | | | | | > > | KangarooTwelve | | | | | | | | > > | SHA-256 | 🗸 | | 🗸 | 🗸 | 🗸 | 🗸 | {1} | > > | SHA-512/256 | 🗸 | | | | | 🗸 | {3} | > > | SHA3-256 | 🗸 | | | | 🗸 | 🗸 | {4} | > > | SHAKE128 | 🗸 | | | | 🗸 | | {5} | > > |=== > > > > f: future version (expected 1.2.0) > > ACC: Apple Common Crypto > > CL: Command-line > > > > :1: OpenSSL, coreutils, Perl Digest::SHA. > > :2: OpenSSL, coreutils. > > :3: OpenSSL > > :4: OpenSSL, Perl Digest::SHA3. > > :5: Perl Digest::SHA3. > > > > === Performance Analysis > > > > The test system used below is my personal laptop, a 2016 Lenovo ThinkPad > > X1 Carbon with an Intel i7-6600U CPU (2.60 GHz) running Debian unstable. > > > > I implemented a test tool helper to compute speed much like OpenSSL > > does. Below is a comparison of speeds. The columns indicate the speed > > in KiB/s for chunks of the given size. The runs are representative of > > multiple similar runs. > > > > 256 and 1024 bytes were chosen to represent common tree and commit > > object sizes and the 8 KiB is an approximate average blob size. > > > > Algorithms are sorted by performance on the 1 KiB column. > > > > |=== > > | Implementation | 256 B | 1 KiB | 8 KiB | 16 KiB | > > | SHA-1 (OpenSSL) | 513963 | 685966 | 748993 | 754270 | > > | BLAKE2b (libb2) | 488123 | 552839 | 576246 | 579292 | > > | SHA-512/256 (OpenSSL) | 181177 | 349002 | 499113 | 495169 | > > | BLAKE2bp (libb2) | 139891 | 344786 | 488390 | 522575 | > > | SHA-256 (OpenSSL) | 264276 | 333560 | 357830 | 355761 | > > | KangarooTwelve | 239305 | 307300 | 355257 | 364261 | > > | SHAKE128 (OpenSSL) | 154775 | 253344 | 337811 | 346732 | > > | SHA3-256 (OpenSSL) | 128597 | 185381 | 198931 | 207365 | > > | BLAKE2bp (libb2; threaded) | 12223 | 49306 | 132833 | 179616 | > > |=== > > > > SUPERCOP (a crypto benchmarking tool; > > https://bench.cr.yp.to/results-hash.html) has also benchmarked these > > algorithms. Note that BLAKE2bp is not listed, KangarooTwelve is k12, > > SHA-512/256 is equivalent to sha512, SHA3-256 is keccakc512, and SHAKE128 is > > keccakc256. > > > > Information is for kizomba, a Kaby Lake system. Counts are in cycles > > per byte (smaller is better; sorted by 1536 B column): > > > > |=== > > | Algorithm | 576 B | 1536 B | 4096 B | long | > > | BLAKE2b | 3.51 | 3.10 | 3.08 | 3.07 | > > | SHA-1 | 4.36 | 3.81 | 3.59 | 3.49 | > > | KangarooTwelve | 4.99 | 4.57 | 4.13 | 3.86 | > > | SHA-512/256 | 6.39 | 5.76 | 5.31 | 5.05 | > > | SHAKE128 | 8.23 | 7.67 | 7.17 | 6.97 | > > | SHA-256 | 8.90 | 8.08 | 7.77 | 7.59 | > > | SHA3-256 | 10.26 | 9.15 | 8.84 | 8.57 | > > |=== > > > > Numbers for genji262, an AMD Ryzen System, which has SHA acceleration: > > > > |=== > > | Algorithm | 576 B | 1536 B | 4096 B | long | > > | SHA-1 | 1.87 | 1.69 | 1.60 | 1.54 | > > | SHA-256 | 1.95 | 1.72 | 1.68 | 1.64 | > > | BLAKE2b | 2.94 | 2.59 | 2.59 | 2.59 | > > | KangarooTwelve | 4.09 | 3.65 | 3.35 | 3.17 | > > | SHA-512/256 | 5.54 | 5.08 | 4.71 | 4.48 | > > | SHAKE128 | 6.95 | 6.23 | 5.71 | 5.49 | > > | SHA3-256 | 8.29 | 7.35 | 7.04 | 6.81 | > > |=== > > > > Note that no mid- to high-end Intel processors provide acceleration. > > AMD Ryzen and some ARM64 processors do. > > > > == Summary > > > > The algorithms with the greatest implementation availability are > > SHA-256, SHA3-256, BLAKE2b, and SHAKE128. > > > > In terms of command-line availability, BLAKE2b, SHA-256, SHA-512/256, > > and SHA3-256 should be available in the near future on a reasonably > > small Debian, Ubuntu, or Fedora install. > > > > As far as security, the most conservative choices appear to be SHA-256, > > SHA-512/256, and SHA3-256. > > > > The performance winners are BLAKE2b unaccelerated and SHA-256 accelerated. > > This is a great summary. Thanks. > > In case it's not apparent from what follows, I have a bias towards > SHA-256. Reasons for that, to summarize some of the discussion the last > time around[1], and to add more details: > > == Popularity > > Other things being equal we should be biased towards whatever's in the > widest use & recommended fon new projects. > > I fear that if e.g. git had used whatever at time was to SHA-1 as > BLAKE2b is to SHA-256 now, we might not even know that it's broken (or > had the sha1collisiondetection work to fall back on), since researchers > are less likely to look at algorithms that aren't in wide use. > > SHA-256 et al were published in 2001 and has ~20k results on Google > Scholar, compared to ~150 for BLAKE2b[4], published in 2008 (but ~1.2K > for "BLAKE2"). > > Between the websites of Intel, AMD & ARM there are thousands of results > for SHA-256 (and existing in-silicon acceleration). There's exactly one > result on all three for BLAKE2b (on amd.com, in the context of a laundry > list of hash algorithms in some presentation. > > Since BLAKE2b lost the SHA-3 competition to Keccak it seems impossible > that it'll get ever get anywhere close to the same scrutiny or support > in silicon as one of the SHA families. > > Which brings me to the next section... > > == Hardware acceleration > > The only widely deployed HW acceleration is for the SHA-1 and SHA-256 > from the SHA-2 family[5], but notably nothing from the newer SHA-3 > family (released in 2015). > > It seems implausible that anything except SHA-3 will get future HW > acceleration given the narrow scope of current HW acceleration > v.s. existing hash algorithms. > > As noted in the thread from last year[1] most git users won't even > notice if the hashing is faster, but it does matter for some big users > (big monorepos), so having the out of purchasing hardware to make things > faster today is great, and given how these new instruction set > extensions get rolled out it seems inevitable that this'll be available > in all consumer CPUs within 5-10 years. > > == Age > > Similar to "popularity" it seems better to bias things towards a hash > that's been out there for a while, i.e. it would be too early to pick > SHA-3. > > The hash transitioning plan, once implemented, also makes it easier to > switch to something else in the future, so we shouldn't be in a rush to > pick some newer hash because we'll need to keep it forever, we can > always do another transition in another 10-15 years. > > == Conclusion > > For all the above reasons I think we should pick SHA-256. > > 1. https://public-inbox.org/git/87y3ss8n4h.fsf@gmail.com/#t > 2. https://github.com/cr-marcstevens/sha1collisiondetection > 3. https://scholar.google.nl/scholar?hl=en&as_sdt=0%2C5&q=SHA-256&btnG= > 4. https://scholar.google.nl/scholar?hl=en&as_sdt=0%2C5&q=BLAKE2b&btnG= > 5. https://en.wikipedia.org/wiki/Intel_SHA_extensions I agree with that reasoning. More importantly, my cryptography researcher colleagues agree with this assessment, and I do trust them quite a bit (you know one of them very well already as we, ahem, *might* be using his code for SHA-1 collision detection all the time now *cough, cough*). Ciao, Dscho ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: Hash algorithm analysis 2018-06-11 21:19 ` Hash algorithm analysis Ævar Arnfjörð Bjarmason 2018-06-21 8:20 ` Johannes Schindelin @ 2018-06-21 22:39 ` brian m. carlson 1 sibling, 0 replies; 66+ messages in thread From: brian m. carlson @ 2018-06-21 22:39 UTC (permalink / raw) To: Ævar Arnfjörð Bjarmason Cc: git, Adam Langley, Jeff King, Mike Hommey, Brandon Williams, Linus Torvalds, Jonathan Nieder, Stefan Beller, Jonathan Tan, Junio Hamano, Johannes Schindelin [-- Attachment #1: Type: text/plain, Size: 2700 bytes --] On Mon, Jun 11, 2018 at 11:19:10PM +0200, Ævar Arnfjörð Bjarmason wrote: > This is a great summary. Thanks. > > In case it's not apparent from what follows, I have a bias towards > SHA-256. Reasons for that, to summarize some of the discussion the last > time around[1], and to add more details: To summarize my view, I think my ordered preference of hashes is BLAKE2b, SHA-256, and SHA3-256. I agree with AGL that all three of these options are secure and will be for some time. I believe there's sufficient literature on all three of them and there will continue to be for some time. I've seen and read papers from the IACR archves on all three of them, and because all three are widely used, they'll continue to be interesting to cryptologists for a long time to come. I'm personally partial to having full preimage resistance, which I think makes SHAKE128 less appealing. SHAKE128 also has fewer crypto library implementations than the others. My rationale for this ordering is essentially performance. BLAKE2b is quite fast on all known hardware, and it is almost as fast as an accelerated SHA-256. The entire rationale for BLAKE2b is to give people a secure algorithm that is faster than MD5 and SHA-1, so there's no reason to use an insecure algorithm. It also greatly outperforms the other two even in pure C, which matters for the fallback implementation we'll need to ship. I tend to think SHA3-256 is the most conservative of these choices as far as security. It has had an open development process and has a large security margin. It has gained a lot of cryptanalysis and come out quite well, and the versatility of the Keccak sponge construction means that it's going to get a lot more attention. Pretty much the only downside is its performance relative to the other two. I placed SHA-256 in the middle because of its potential for acceleration on Intel hardware. I know such changes are coming, but they won't likely be here for another two years. While hashing performance isn't a huge concern for Git now, I had planned to implement an rsync-based delta algorithm for large files that could make storing some large files in a Git repository viable (of course, there will still be many cases where Git LFS and git-annex are better). The algorithm is extremely sensitive to hash performance and would simply not be viable with an unaccelerated SHA-256 or SHA3-256, although it would perform reasonably well with BLAKE2b. Having said that, I'd be happy with any of the three, and would support a consensus around any of them as well. -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 867 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: State of NewHash work, future directions, and discussion 2018-06-09 20:56 State of NewHash work, future directions, and discussion brian m. carlson 2018-06-09 21:26 ` Ævar Arnfjörð Bjarmason 2018-06-09 22:49 ` Hash algorithm analysis brian m. carlson @ 2018-06-11 18:09 ` Duy Nguyen 2018-06-12 1:28 ` brian m. carlson 2018-06-11 19:01 ` Jonathan Nieder 3 siblings, 1 reply; 66+ messages in thread From: Duy Nguyen @ 2018-06-11 18:09 UTC (permalink / raw) To: brian m. carlson, Git Mailing List On Sat, Jun 9, 2018 at 10:57 PM brian m. carlson <sandals@crustytoothpaste.net> wrote: > > Since there's been a lot of questions recently about the state of the > NewHash work, I thought I'd send out a summary. > > == Status > > I have patches to make the entire codebase work, including passing all > tests, when Git is converted to use a 256-bit hash algorithm. > Obviously, such a Git is incompatible with the current version, but it > means that we've fixed essentially all of the hard-coded 20 and 40 > constants (and therefore Git doesn't segfault). This is so cool! > == Future Design > > The work I've done necessarily involves porting everything to use > the_hash_algo. Essentially, when the piece I'm currently working on is > complete, we'll have a transition stage 4 implementation (all NewHash). > Stage 2 and 3 will be implemented next. > > My vision of how data is stored is that the .git directory is, except > for pack indices and the loose object lookup table, entirely in one > format. It will be all SHA-1 or all NewHash. This algorithm will be > stored in the_hash_algo. > > I plan on introducing an array of hash algorithms into struct repository > (and wrapper macros) which stores, in order, the output hash, and if > used, the additional input hash. I'm actually thinking that putting the_hash_algo inside struct repository is a mistake. We have code that's supposed to work without a repo and it shows this does not really make sense to forcefully use a partially-valid repo. Keeping the_hash_algo a separate variable sounds more elegant. > If people are interested, I've done some analysis on availability of > implementations, performance, and other attributes described in the > transition plan and can send that to the list. I quickly skimmed through that document. I have two more concerns that are less about any specific hash algorithm: - how does larger hash size affects git (I guess you covered cpu aspect, but what about cache-friendliness, disk usage, memory consumption) - how does all the function redirection (from abstracting away SHA-1) affects git performance. E.g. hashcmp could be optimized and inlined by the compiler. Now it still probably can optimize the memcmp(,,20), but we stack another indirect function call on top. I guess I might be just paranoid and this is not a big deal after all. -- Duy ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: State of NewHash work, future directions, and discussion 2018-06-11 18:09 ` State of NewHash work, future directions, and discussion Duy Nguyen @ 2018-06-12 1:28 ` brian m. carlson 0 siblings, 0 replies; 66+ messages in thread From: brian m. carlson @ 2018-06-12 1:28 UTC (permalink / raw) To: Duy Nguyen; +Cc: Git Mailing List [-- Attachment #1: Type: text/plain, Size: 1236 bytes --] On Mon, Jun 11, 2018 at 08:09:47PM +0200, Duy Nguyen wrote: > I'm actually thinking that putting the_hash_algo inside struct > repository is a mistake. We have code that's supposed to work without > a repo and it shows this does not really make sense to forcefully use > a partially-valid repo. Keeping the_hash_algo a separate variable > sounds more elegant. It can fairly easily be moved out if we want. > I quickly skimmed through that document. I have two more concerns that > are less about any specific hash algorithm: > > - how does larger hash size affects git (I guess you covered cpu > aspect, but what about cache-friendliness, disk usage, memory > consumption) > > - how does all the function redirection (from abstracting away SHA-1) > affects git performance. E.g. hashcmp could be optimized and inlined > by the compiler. Now it still probably can optimize the memcmp(,,20), > but we stack another indirect function call on top. I guess I might be > just paranoid and this is not a big deal after all. I would have to run some numbers on this. I probably won't get around to doing that until Friday or Saturday. -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 867 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: State of NewHash work, future directions, and discussion 2018-06-09 20:56 State of NewHash work, future directions, and discussion brian m. carlson ` (2 preceding siblings ...) 2018-06-11 18:09 ` State of NewHash work, future directions, and discussion Duy Nguyen @ 2018-06-11 19:01 ` Jonathan Nieder 2018-06-12 2:28 ` brian m. carlson 3 siblings, 1 reply; 66+ messages in thread From: Jonathan Nieder @ 2018-06-11 19:01 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Duy Nguyen Hi, brian m. carlson wrote: > Since there's been a lot of questions recently about the state of the > NewHash work, I thought I'd send out a summary. Yay! [...] > I plan on introducing an array of hash algorithms into struct repository > (and wrapper macros) which stores, in order, the output hash, and if > used, the additional input hash. Interesting. In principle the four following are separate things: 1. Hash to be used for command output to the terminal 2. Hash used in pack files 3. Additional hashes (beyond (2)) that we can look up using the translation table 4. Additional hashes (beyond (1)) accepted in input from the command line and stdin In principle, (1) and (4) would be globals, and (2) and (3) would be tied to the repository. I think this is always what Duy was hinting at. All that said, as long as there is some notion of (1) and (4), I'm excited. :) Details of how they are laid out in memory are less important. [...] > The transition plan anticipates a stage 1 where accept only SHA-1 on > input and produce only SHA-1 on output, but store in NewHash. As I've > worked with our tests, I've realized such an implementation is not > entirely possible. We have various tools that expect to accept invalid > object IDs, and obviously there's no way to have those continue to work. Can you give an example? Do you mean commands like "git mktree"? [...] > If you're working on new features and you'd like to implement the best > possible compatibility with this work, here are some recommendations: This list is great. Thanks for it. [...] > == Discussion about an Actual NewHash > > Since I'll be writing new code, I'll be writing tests for this code. > However, writing tests for creating and initializing repositories > requires that I be able to test that objects are being serialized > correctly, and therefore requires that I actually know what the hash > algorithm is going to be. I also can't submit code for multi-hash packs > when we officially only support one hash algorithm. Thanks for restarting this discussion as well. You can always use something like e.g. "doubled SHA-1" as a proof of concept, but I agree that it's nice to be able to avoid some churn by using an actual hash function that we're likely to switch to. Sincerely, Jonathan ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: State of NewHash work, future directions, and discussion 2018-06-11 19:01 ` Jonathan Nieder @ 2018-06-12 2:28 ` brian m. carlson 2018-06-12 2:42 ` Jonathan Nieder 0 siblings, 1 reply; 66+ messages in thread From: brian m. carlson @ 2018-06-12 2:28 UTC (permalink / raw) To: Jonathan Nieder; +Cc: git, Duy Nguyen [-- Attachment #1: Type: text/plain, Size: 2262 bytes --] On Mon, Jun 11, 2018 at 12:01:03PM -0700, Jonathan Nieder wrote: > Hi, > > brian m. carlson wrote: > > I plan on introducing an array of hash algorithms into struct repository > > (and wrapper macros) which stores, in order, the output hash, and if > > used, the additional input hash. > > Interesting. In principle the four following are separate things: > > 1. Hash to be used for command output to the terminal > 2. Hash used in pack files > 3. Additional hashes (beyond (2)) that we can look up using the > translation table > 4. Additional hashes (beyond (1)) accepted in input from the command > line and stdin > > In principle, (1) and (4) would be globals, and (2) and (3) would be > tied to the repository. I think this is always what Duy was hinting > at. > > All that said, as long as there is some notion of (1) and (4), I'm > excited. :) Details of how they are laid out in memory are less > important. I'm happy to hear suggestions on how this should or shouldn't work. I'm seeing these things in my head, but it can be helpful to have feedback about what people expect out of the code before I spend a bunch of time writing it. > [...] > > The transition plan anticipates a stage 1 where accept only SHA-1 on > > input and produce only SHA-1 on output, but store in NewHash. As I've > > worked with our tests, I've realized such an implementation is not > > entirely possible. We have various tools that expect to accept invalid > > object IDs, and obviously there's no way to have those continue to work. > > Can you give an example? Do you mean commands like "git mktree"? I mean situations like git update-index. We allow the user to insert any old invalid value (and in fact check that the user can do this). t0000 does this, for example. > You can always use something like e.g. "doubled SHA-1" as a proof of > concept, but I agree that it's nice to be able to avoid some churn by > using an actual hash function that we're likely to switch to. I have a hash that I've been using, but redoing the work would be less enjoyable. I'd rather write the tests only once if I can help it. -- brian m. carlson: Houston, Texas, US OpenPGP: https://keybase.io/bk2204 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 867 bytes --] ^ permalink raw reply [flat|nested] 66+ messages in thread
* Re: State of NewHash work, future directions, and discussion 2018-06-12 2:28 ` brian m. carlson @ 2018-06-12 2:42 ` Jonathan Nieder 0 siblings, 0 replies; 66+ messages in thread From: Jonathan Nieder @ 2018-06-12 2:42 UTC (permalink / raw) To: brian m. carlson; +Cc: git, Duy Nguyen brian m. carlson wrote: > On Mon, Jun 11, 2018 at 12:01:03PM -0700, Jonathan Nieder wrote: >> 1. Hash to be used for command output to the terminal >> 2. Hash used in pack files >> 3. Additional hashes (beyond (2)) that we can look up using the >> translation table >> 4. Additional hashes (beyond (1)) accepted in input from the command >> line and stdin >> >> In principle, (1) and (4) would be globals, and (2) and (3) would be >> tied to the repository. I think this is always what Duy was hinting Here, by 'always' I meant 'also'. Sorry for the confusion. >> at. >> >> All that said, as long as there is some notion of (1) and (4), I'm >> excited. :) Details of how they are laid out in memory are less >> important. > > I'm happy to hear suggestions on how this should or shouldn't work. I'm > seeing these things in my head, but it can be helpful to have feedback > about what people expect out of the code before I spend a bunch of time > writing it. So far you're doing pretty well. :) I just noticed that I have some copy-edits for the hash-function-transition doc from last year that I hadn't sent out yet (oops). I'll send them tonight or tomorrow morning. [...] >> brian m. carlson wrote: >>> The transition plan anticipates a stage 1 where accept only SHA-1 on >>> input and produce only SHA-1 on output, but store in NewHash. As I've >>> worked with our tests, I've realized such an implementation is not >>> entirely possible. We have various tools that expect to accept invalid >>> object IDs, and obviously there's no way to have those continue to work. >> >> Can you give an example? Do you mean commands like "git mktree"? > > I mean situations like git update-index. We allow the user to insert > any old invalid value (and in fact check that the user can do this). > t0000 does this, for example. I think we can forbid this in the new mode (using a test prereq to ensure the relevant tests don't get run). Likewise for the similar functionality in "git mktree" and "git hash-object -w". >> You can always use something like e.g. "doubled SHA-1" as a proof of >> concept, but I agree that it's nice to be able to avoid some churn by >> using an actual hash function that we're likely to switch to. > > I have a hash that I've been using, but redoing the work would be less > enjoyable. I'd rather write the tests only once if I can help it. Thanks for the test fixes so far that make most of the test suite hash-agnostic! For t0000, yeah, there's no way around having to hard-code the new hash there. Thanks, Jonathan ^ permalink raw reply [flat|nested] 66+ messages in thread
end of thread, other threads:[~2018-09-18 16:50 UTC | newest] Thread overview: 66+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-06-09 20:56 State of NewHash work, future directions, and discussion brian m. carlson 2018-06-09 21:26 ` Ævar Arnfjörð Bjarmason 2018-06-09 22:49 ` Hash algorithm analysis brian m. carlson 2018-06-11 19:29 ` Jonathan Nieder 2018-06-11 20:20 ` Linus Torvalds 2018-06-11 23:27 ` Ævar Arnfjörð Bjarmason 2018-06-12 0:11 ` David Lang 2018-06-12 0:45 ` Linus Torvalds 2018-06-11 22:35 ` brian m. carlson 2018-06-12 16:21 ` Gilles Van Assche 2018-06-13 23:58 ` brian m. carlson 2018-06-15 10:33 ` Gilles Van Assche 2018-07-20 21:52 ` brian m. carlson 2018-07-21 0:31 ` Jonathan Nieder 2018-07-21 19:52 ` Ævar Arnfjörð Bjarmason 2018-07-21 20:25 ` brian m. carlson 2018-07-21 22:38 ` Johannes Schindelin 2018-07-21 23:09 ` Linus Torvalds 2018-07-21 23:59 ` brian m. carlson 2018-07-22 9:34 ` Eric Deplagne 2018-07-22 14:21 ` brian m. carlson 2018-07-22 14:55 ` Eric Deplagne 2018-07-26 10:05 ` Johannes Schindelin 2018-07-22 15:23 ` Joan Daemen 2018-07-22 18:54 ` Adam Langley 2018-07-26 10:31 ` Johannes Schindelin 2018-07-23 12:40 ` demerphq 2018-07-23 12:48 ` Sitaram Chamarty 2018-07-23 12:55 ` demerphq 2018-07-23 18:23 ` Linus Torvalds 2018-07-23 17:57 ` Stefan Beller 2018-07-23 18:35 ` Jonathan Nieder 2018-07-24 19:01 ` Edward Thomson 2018-07-24 20:31 ` Linus Torvalds 2018-07-24 20:49 ` Jonathan Nieder 2018-07-24 21:13 ` Junio C Hamano 2018-07-24 22:10 ` brian m. carlson 2018-07-30 9:06 ` Johannes Schindelin 2018-07-30 20:01 ` Dan Shumow 2018-08-03 2:57 ` Jonathan Nieder 2018-09-18 15:18 ` Joan Daemen 2018-09-18 15:32 ` Jonathan Nieder 2018-09-18 16:50 ` Linus Torvalds 2018-07-25 8:30 ` [PATCH 0/2] document that NewHash is now SHA-256 Ævar Arnfjörð Bjarmason 2018-07-25 8:30 ` [PATCH 1/2] doc hash-function-transition: note the lack of a changelog Ævar Arnfjörð Bjarmason 2018-07-25 8:30 ` [PATCH 2/2] doc hash-function-transition: pick SHA-256 as NewHash Ævar Arnfjörð Bjarmason 2018-07-25 16:45 ` Junio C Hamano 2018-07-25 17:25 ` Jonathan Nieder 2018-07-25 21:32 ` Junio C Hamano 2018-07-26 13:41 ` [PATCH v2 " Ævar Arnfjörð Bjarmason 2018-08-03 7:20 ` Jonathan Nieder 2018-08-03 16:40 ` Junio C Hamano 2018-08-03 17:01 ` Linus Torvalds 2018-08-03 16:42 ` Linus Torvalds 2018-08-03 17:43 ` Ævar Arnfjörð Bjarmason 2018-08-04 8:52 ` Jonathan Nieder 2018-08-03 17:45 ` brian m. carlson 2018-07-25 22:56 ` [PATCH " brian m. carlson 2018-06-11 21:19 ` Hash algorithm analysis Ævar Arnfjörð Bjarmason 2018-06-21 8:20 ` Johannes Schindelin 2018-06-21 22:39 ` brian m. carlson 2018-06-11 18:09 ` State of NewHash work, future directions, and discussion Duy Nguyen 2018-06-12 1:28 ` brian m. carlson 2018-06-11 19:01 ` Jonathan Nieder 2018-06-12 2:28 ` brian m. carlson 2018-06-12 2:42 ` Jonathan Nieder
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).