From: Junio C Hamano <email@example.com> To: Jonathan Nieder <firstname.lastname@example.org> Cc: Shawn Pearce <email@example.com>, Linus Torvalds <firstname.lastname@example.org>, Git Mailing List <email@example.com>, Stefan Beller <firstname.lastname@example.org>, email@example.com, Jonathan Tan <firstname.lastname@example.org>, Jeff King <email@example.com>, David Lang <firstname.lastname@example.org>, "brian m. carlson" <email@example.com>, Masaya Suzuki <firstname.lastname@example.org>, email@example.com, The Keccak Team <firstname.lastname@example.org>, Johannes Schindelin <Johannes.Schindelin@gmx.de> Subject: Re: [PATCH v4] technical doc: add a design doc for hash function transition Date: Tue, 03 Oct 2017 14:40:26 +0900 Message-ID: <email@example.com> (raw) In-Reply-To: <20170928044320.GA84719@aiede.mtv.corp.google.com> Jonathan Nieder <firstname.lastname@example.org> writes: > +Signed Tags > +~~~~~~~~~~~ > +We add a new field "gpgsig-newhash" to the tag object format to allow > +signing tags without relying on SHA-1. Its signed payload is the > +newhash-content of the tag with its gpgsig-newhash field and "-----BEGIN PGP > +SIGNATURE-----" delimited in-body signature removed. > + > +This means tags can be signed > +1. using SHA-1 only, as in existing signed tag objects > +2. using both SHA-1 and NewHash, by using gpgsig-newhash and an in-body > + signature. > +3. using only NewHash, by only using the gpgsig-newhash field. I have the same issue with signed commit. The signed parts for SHA-1 contents exclude the in-body signature (obviously) and all the headers including gpgsig-newhash that is not known to our old clients are included. The signed parts for NewHash contents exclude the in-body signature and gpgsig-newhash header, but all other headers. I somehow feel that we should just reserve gpgsig-* to prepare for the day when we introduce newhash2 and later and exclude all of them from the computation. Treat the difference between how SHA-1 contents excludes _only_ it knows about and how NewHash contents excludes _all_ possible signatures, just like the differece between where SHA-1 and NewHash contents has the signature. That is, yes, we didn't know better when we designed SHA-1 contents, but now we know better and are correcting the mistakes by moving the signature from in-body tail to a header, and by excluding anything gpgsig-*, not just the known ones. > +Mergetag embedding > +~~~~~~~~~~~~~~~~~~ > +The mergetag field in the sha1-content of a commit contains the > +sha1-content of a tag that was merged by that commit. > + > +The mergetag field in the newhash-content of the same commit contains the > +newhash-content of the same tag. OK. We do not have a tool that extracts them and creates a tag object, but if such a tool is invented in the future, it would only have to worry about newhash content, as it would be a local operation. Makes sense. > +Submodules > +~~~~~~~~~~ > +To convert recorded submodule pointers, you need to have the converted > +submodule repository in place. The translation table of the submodule > +can be used to look up the new hash. OK, I earlier commented on a paragraph that I couldn't tell what it was talking about, but this is a lot more understandable. Perhaps the earlier one can be removed? We saw earlier what happens during "fetch". This seems to hint that we would need to do a "recursive" fetch in the bottom-up direction, but without fetching the superproject, you wouldn't know what submodules are needed and from where, so there is a bit of chicken-and-egg problem we need to address, as we further make the design more detailed. > +Loose objects and unreachable objects > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > ... > +"git gc --auto" currently waits for there to be 50 packs present > +before combining packfiles. Packing loose objects more aggressively > +may cause the number of pack files to grow too quickly. This can be > +mitigated by using a strategy similar to Martin Fick's exponential > +rolling garbage collection script: > +https://gerrit-review.googlesource.com/c/gerrit/+/35215 Yes, concatenating into the latest pack that still is small may be a reasonable way, as there won't be many good chances to create good deltas anyway until you have blobs and trees at sufficiently numbers of different versions, to do a "quick GC whose only purpose is to keep the number of loose object down". > +To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be > +combined under certain circumstances. If "gc.garbageTtl" is set to > +greater than one day, then packs created within a single calendar day, > +UTC, can be coalesced together. The resulting packfile would have an > +mtime before midnight on that day, so this makes the effective maximum > +ttl the garbageTtl + 1 day. If "gc.garbageTtl" is less than one day, > +then we divide the calendar day into intervals one-third of that ttl > +in duration. Packs created within the same interval can be coalesced > +together. The resulting packfile would have an mtime before the end of > +the interval, so this makes the effective maximum ttl equal to the > +garbageTtl * 4/3. OK. Is the use of mtime essential, or because packs are "write once and from there access read-only", would a timestamp written somewhere in the header or the trailer of the file, if existed, work equally well? Not a strong objection, but a mild suggestion that not relying on mtime may be a good idea (it will keep an accidental / unintended "touch" from keeping garbage alive longer than you want). > +The UNREACHABLE_GARBAGE setting goes in the PSRC field of the pack > +index. More generally, that field indicates where a pack came from: > + > + - 1 (PACK_SOURCE_RECEIVE) for a pack received over the network > + - 2 (PACK_SOURCE_AUTO) for a pack created by a lightweight > + "gc --auto" operation > + - 3 (PACK_SOURCE_GC) for a pack created by a full gc > + - 4 (PACK_SOURCE_UNREACHABLE_GARBAGE) for potential garbage > + discovered by gc > + - 5 (PACK_SOURCE_INSERT) for locally created objects that were > + written directly to a pack file, e.g. from "git add ." > + > +This information can be useful for debugging and for "gc --auto" to > +make appropriate choices about which packs to coalesce. Would this be the direction we want to take to reduce the number of auxiliary files like *.keep, *.promised, etc., or we do not envision these to be useful for anything other than "gc"? > +Caveats > +------- > +Invalid objects > +... > +More profoundly broken objects (e.g., a commit with a truncated "tree" > +header line) cannot be converted but were not usable by current Git > +anyway. Fair enough. > +Shallow clone and submodules > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > +Because it requires all referenced objects to be available in the > +locally generated translation table, this design does not support > +shallow clone or unfetched submodules. Protocol improvements might > +allow lifting this restriction. OK, I think it is sensible to leave them outside the scope at the moment. All we need is a reliable way to learn the NewHash name of the objects immediately beyond the cut-off points, but it will have to become a huge discussion how to ensure that reliability, without trusting the remote too much. > +Alternates > +~~~~~~~~~~ > +For the same reason, a newhash repository cannot borrow objects from a > +sha1 repository using objects/info/alternates or > +$GIT_ALTERNATE_OBJECT_REPOSITORIES. Correct. In addition, if the alternate has already fully migrated away from SHA-1 compatiblity, we can only use it for local operation. ... goes back and thinks No, we cannot use such an alternate even for local operation. So a newhash repository cannot borrow objects from a SHA-1 repository, and from a newhash repository that lost SHA-1 compatiblity if it itself wants to retain SHA-1 compatiblity. Which again is "fair enough", I'd say. > +git notes > +~~~~~~~~~ > +The "git notes" tool annotates objects using their sha1-name as key. > +This design does not describe a way to migrate notes trees to use > +newhash-names. That migration is expected to happen separately (for > +example using a file at the root of the notes tree to describe which > +hash it uses). To be consistent with the remainder of the design, I think they should also be translated to NewHash, but punting it is OK to limit the scope of the initial migration. > +Server-side cost > +~~~~~~~~~~~~~~~~ > +Until Git protocol gains NewHash support, using NewHash based storage > +on public-facing Git servers is strongly discouraged. Once Git > +protocol gains NewHash support, NewHash based servers are likely not > +to support SHA-1 compatibility, to avoid what may be a very expensive > +hash reencode during clone and to encourage peers to modernize. I doubt that the first sentence is needed. We as git-core community will not help people to run Git service backed by NewHash storage that talks SHA-1 over the wire, by limiting the scope to "NewHash Git fetching from SHA-1 Git" and "NewHash Git pushing to SHA-1 Git" and not including the other two combinations. That may be worth saying here. Masochist server operators are still welcome to build and operate such a service and we don't really care. It's not our business. > +The design described here allows fetches by SHA-1 clients of a > +personal NewHash repository because it's not much more difficult than > +allowing pushes from that repository. Does the design described here really allow that? I thought what I read was "everybody talks SHA-1 over the wire, and those who want to use NewHash converts". So a user may be able to push from a personal NewHash repository to a personal SHA-1 repository (to simulate a fetch going in the reverse direction). In any case, I do not think I saw conversion issues discussed for a fetch from NewHash repository earlier in the document, where conversion considerations for other two modes (fetch to NewHash, and push from NewHash) were reasonably well described. If we are to allow this third mode, we'd need to make sure "because it's not much more difficult" is true. > This support needs to be guarded > +by a configuration option --- servers like git.kernel.org that serve a > +large number of clients would not be expected to bear that cost. Yes, of course. And if these 6 lines are not unintended leftover from earlier round of the design that we wanted to remove but forget to do so, then the first paragraph I doubted its validity of starts to make sense. > +Meaning of signatures > +~~~~~~~~~~~~~~~~~~~~~ > +The signed payload for signed commits and tags does not explicitly > +name the hash used to identify objects. If some day Git adopts a new > +hash function with the same length as the current SHA-1 (40 > +hexadecimal digit) or NewHash (64 hexadecimal digit) objects then the > +intent behind the PGP signed payload in an object signature is > +unclear: > + > + object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 > + type commit > + tag v2.12.0 > + tagger Junio C Hamano <email@example.com> 1487962205 -0800 > + > + Git 2.12 > + > +Does this mean Git v2.12.0 is the commit with sha1-name > +e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with > +new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7? > + > +Fortunately NewHash and SHA-1 have different lengths. If Git starts > +using another hash with the same length to name objects, then it will > +need to change the format of signed payloads using that hash to > +address this issue. This is not just signatures, is it? The reference to parent commits and its tree in a commit object would also have ambiguity between SHA-1 and new-40-digit-hash. And the "no mixed repository" rule resolved that for us---isn't that sufficient for the signed tag (or commit), too? If such a signed-tag appears in a SHA-1 content of a tag, then the "object" reference is made with SHA-1. If the tag is in NewHash40 content, "object" reference is made with NewHash40, no? > +Object names on the command line > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > +To support the transition (see Transition plan below), this design > +supports four different modes of operation: > + > + 1. ("dark launch") Treat object names input by the user as SHA-1 and > + convert any object names written to output to SHA-1, but store > + objects using NewHash. This allows users to test the code with no > + visible behavior change except for performance. This allows > + allows running even tests that assume the SHA-1 hash function, to > + sanity-check the behavior of the new mode. Oooooh. That's ambitious. > + 2. ("early transition") Allow both SHA-1 and NewHash object names in > + input. Any object names written to output use SHA-1. This allows > + users to continue to make use of SHA-1 to communicate with peers > + (e.g. by email) that have not migrated yet and prepares for mode 3. This and others also make sense. > +Transition plan > +--------------- > +Some initial steps can be implemented independently of one another: > +... > +- introducing index v3 Just making sure; this is pack .idx v3? > +The infrastructure supporting fetch also allows converting an existing > +repository. In converted repositories and new clones, end users can > +gain support for the new hash function without any visible change in > +behavior (see "dark launch" in the "Object names on the command line" > +section). In particular this allows users to verify NewHash signatures > +on objects in the repository, and it should ensure the transition code > +is stable in production in preparation for using it more widely. > + > +Over time projects would encourage their users to adopt the "early > +transition" and then "late transition" modes to take advantage of the > +new, more futureproof NewHash object names. > + > +When objectFormat and compatObjectFormat are both set, commands > +generating signatures would generate both SHA-1 and NewHash signatures > +by default to support both new and old users. > + > +In projects using NewHash heavily, users could be encouraged to adopt > +the "post-transition" mode to avoid accidentally making implicit use > +of SHA-1 object names. > + > +Once a critical mass of users have upgraded to a version of Git that > +can verify NewHash signatures and have converted their existing > +repositories to support verifying them, we can add support for a > +setting to generate only NewHash signatures. This is expected to be at > +least a year later. > + > +That is also a good moment to advertise the ability to convert > +repositories to use NewHash only, stripping out all SHA-1 related > +metadata. This improves performance by eliminating translation > +overhead and security by avoiding the possibility of accidentally > +relying on the safety of SHA-1. > + > +Updating Git's protocols to allow a server to specify which hash > +functions it supports is also an important part of this transition. It > +is not discussed in detail in this document but this transition plan > +assumes it happens. :) All of the above sounds sensible to me. > +Alternatives considered > +----------------------- This message stops here...
next prev parent reply index Thread overview: 113+ messages / expand[flat|nested] mbox.gz Atom feed top 2017-03-04 1:12 RFC: Another proposed hash function transition plan Jonathan Nieder 2017-03-05 2:35 ` Linus Torvalds 2017-03-06 0:26 ` brian m. carlson 2017-03-06 18:24 ` Brandon Williams 2017-06-15 10:30 ` Which hash function to use, was " Johannes Schindelin 2017-06-15 11:05 ` Mike Hommey 2017-06-15 13:01 ` Jeff King 2017-06-15 16:30 ` Ævar Arnfjörð Bjarmason 2017-06-15 19:34 ` Johannes Schindelin 2017-06-15 21:59 ` Adam Langley 2017-06-15 22:41 ` brian m. carlson 2017-06-15 23:36 ` Ævar Arnfjörð Bjarmason 2017-06-16 0:17 ` brian m. carlson 2017-06-16 6:25 ` Ævar Arnfjörð Bjarmason 2017-06-16 13:24 ` Johannes Schindelin 2017-06-16 17:38 ` Adam Langley 2017-06-16 20:52 ` Junio C Hamano 2017-06-16 21:12 ` Junio C Hamano 2017-06-16 21:24 ` Jonathan Nieder 2017-06-16 21:39 ` Ævar Arnfjörð Bjarmason 2017-06-16 20:42 ` Jeff King 2017-06-19 9:26 ` Johannes Schindelin 2017-06-15 21:10 ` Mike Hommey 2017-06-16 4:30 ` Jeff King 2017-06-15 17:36 ` Brandon Williams 2017-06-15 19:20 ` Junio C Hamano 2017-06-15 19:13 ` Jonathan Nieder 2017-03-07 0:17 ` RFC v3: " Jonathan Nieder 2017-03-09 19:14 ` Shawn Pearce 2017-03-09 20:24 ` Jonathan Nieder 2017-03-10 19:38 ` Jeff King 2017-03-10 19:55 ` Jonathan Nieder 2017-09-28 4:43 ` [PATCH v4] technical doc: add a design doc for hash function transition Jonathan Nieder 2017-09-29 6:06 ` Junio C Hamano 2017-09-29 8:09 ` Junio C Hamano 2017-09-29 17:34 ` Jonathan Nieder 2017-10-02 8:25 ` Junio C Hamano 2017-10-02 19:41 ` Jason Cooper 2017-10-02 9:02 ` Junio C Hamano 2017-10-02 19:23 ` Jason Cooper 2017-10-03 5:40 ` Junio C Hamano [this message] 2017-10-03 13:08 ` Jason Cooper 2017-10-04 1:44 ` Junio C Hamano 2017-09-06 6:28 ` RFC v3: Another proposed hash function transition plan Junio C Hamano 2017-09-08 2:40 ` Junio C Hamano 2017-09-08 3:34 ` Jeff King 2017-09-11 18:59 ` Brandon Williams 2017-09-13 12:05 ` Johannes Schindelin 2017-09-13 13:43 ` demerphq 2017-09-13 22:51 ` Jonathan Nieder 2017-09-14 18:26 ` Johannes Schindelin 2017-09-14 18:40 ` Jonathan Nieder 2017-09-14 22:09 ` Johannes Schindelin 2017-09-13 23:30 ` Linus Torvalds 2017-09-14 18:45 ` Johannes Schindelin 2017-09-18 12:17 ` Gilles Van Assche 2017-09-18 22:16 ` Johannes Schindelin 2017-09-19 16:45 ` Gilles Van Assche 2017-09-29 13:17 ` Johannes Schindelin 2017-09-29 14:54 ` Joan Daemen 2017-09-29 22:33 ` Johannes Schindelin 2017-09-30 22:02 ` Joan Daemen 2017-10-02 14:26 ` Johannes Schindelin 2017-09-18 22:25 ` Jonathan Nieder 2017-09-26 17:05 ` Jason Cooper 2017-09-26 22:11 ` Johannes Schindelin 2017-09-26 22:25 ` [PATCH] technical doc: add a design doc for hash function transition Stefan Beller 2017-09-26 23:38 ` Jonathan Nieder 2017-09-26 23:51 ` RFC v3: Another proposed hash function transition plan Jonathan Nieder 2017-10-02 14:54 ` Jason Cooper 2017-10-02 16:50 ` Brandon Williams 2017-10-02 14:00 ` Jason Cooper 2017-10-02 17:18 ` Linus Torvalds 2017-10-02 19:37 ` Jeff King 2017-09-13 16:30 ` Jonathan Nieder 2017-09-13 21:52 ` Junio C Hamano 2017-09-13 22:07 ` Stefan Beller 2017-09-13 22:18 ` Jonathan Nieder 2017-09-14 2:13 ` Junio C Hamano 2017-09-14 15:23 ` Johannes Schindelin 2017-09-14 15:45 ` demerphq 2017-09-14 22:06 ` Johannes Schindelin 2017-09-13 22:15 ` Junio C Hamano 2017-09-13 22:27 ` Jonathan Nieder 2017-09-14 2:10 ` Junio C Hamano 2017-09-14 12:39 ` Johannes Schindelin 2017-09-14 16:36 ` Brandon Williams 2017-09-14 18:49 ` Jonathan Nieder 2017-09-15 20:42 ` Philip Oakley 2017-03-05 11:02 ` RFC: " David Lang [not found] ` <CA+dhYEXHbQfJ6KUB1tWS9u1MLEOJL81fTYkbxu4XO-i+379LPw@mail.gmail.com> 2017-03-06 9:43 ` Jeff King 2017-03-06 23:40 ` Jonathan Nieder 2017-03-07 0:03 ` Mike Hommey 2017-03-06 8:43 ` Jeff King 2017-03-06 18:39 ` Jonathan Tan 2017-03-06 19:22 ` Linus Torvalds 2017-03-06 19:59 ` Brandon Williams 2017-03-06 21:53 ` Junio C Hamano 2017-03-07 8:59 ` Jeff King 2017-03-06 18:43 ` Junio C Hamano 2017-03-07 18:57 ` Ian Jackson 2017-03-07 19:15 ` Linus Torvalds 2017-03-08 11:20 ` Ian Jackson 2017-03-08 15:37 ` Johannes Schindelin 2017-03-08 15:40 ` Johannes Schindelin 2017-03-20 5:21 ` Use base32? Jason Hennessey 2017-03-20 5:58 ` Michael Steuer 2017-03-20 8:05 ` Jacob Keller 2017-03-21 3:07 ` Michael Steuer 2017-03-13 9:24 ` RFC: Another proposed hash function transition plan The Keccak Team 2017-03-13 17:48 ` Jonathan Nieder 2017-03-13 18:34 ` ankostis 2017-03-17 11:07 ` Johannes Schindelin
Reply instructions: You may reply publically to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style List information: http://vger.kernel.org/majordomo-info.html * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --firstname.lastname@example.org \ --email@example.com \ --cc=Johannes.Schindelin@gmx.de \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ --email@example.com \ --firstname.lastname@example.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
email@example.com mailing list mirror (one of many) Archives are clonable: git clone --mirror https://public-inbox.org/git git clone --mirror http://ou63pmih66umazou.onion/git git clone --mirror http://czquwvybam4bgbro.onion/git git clone --mirror http://hjrcffqmbrq6wope.onion/git Newsgroups are available over NNTP: nntp://news.public-inbox.org/inbox.comp.version-control.git nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git nntp://news.gmane.org/gmane.comp.version-control.git note: .onion URLs require Tor: https://www.torproject.org/ or Tor2web: https://www.tor2web.org/ AGPL code for this site: git clone https://public-inbox.org/ public-inbox