* RFC: Separate commit identification from Merkle hashing @ 2019-05-21 1:32 Eric S. Raymond 2019-05-21 1:57 ` Jonathan Nieder 2019-05-23 19:09 ` Jakub Narebski 0 siblings, 2 replies; 13+ messages in thread From: Eric S. Raymond @ 2019-05-21 1:32 UTC (permalink / raw) To: git I have been thinking hard about the problems raised during my request for unique timestamps. I think I've found a better way to bust the box I was trying to break out of. I am therefore withdrawing that proposal and replacing it with this one. It's time to separate commit identification from Merkle hashing. One reason I am sure of this is the SHA-1 to whatever transition. We can't count on the successor hash to survive attack forever. Accordingly, git's design needs to be stable against the possibility of having to accommodate multiple future hash algorithms in the future. Here's how to do it: 1. Commit IDs and Merkle-tree hashes become separate commit properties in the git filesystem. 2. The data structure representing a Merkle-tree hash becomes a pair consisting of a value and a hash-algorithm tag. An empty tag is interpreted as SHA-1. I will call this entity the "verification hash" and avoid unqualified use of "hash" in the rest of this proposal. 3. The initial value of a commit's ID in a live repository is a copy of its verification hash, except in one important case. 4. When a repository is exported to a stream, the commit-id is dumped with other commit metadata. Thus, anything that can read a stream can resolve commit references in its change comments. 5. When a stream is imported, if a commit has a commit-id field it overrides the default assignment of the generated verification hash to that field. 6. Commit IDs are free-format and not interpreted by git except as lookup keys. When git changes verification-hash functions, commit IDs do not change. Notice several important properties of this design. A. Git becomes absolutely future-proofed against hash-algorithm changes. It can even support the use of multiple hash types over the lifetime of one repo. B. All SHA-1 commit references will resolve forever even after git stops generating them. All future hash-based commit references will also be good forever. C. The id/verification split will be invisible from clients at start, because initially they coincide and will continue to do so unless an explicit decision changes either the verification-hash algorithm or the way commit-IDs are initialized. D. My wish for forward-portable unique commit IDs is granted. They're not by default eyeball-friendly, but I can live with that. Furthermore, because they're preserved in streams they can be eternally stable even as hash algorithms and preferred ID formats change. E. There is now a unique total order on the repo, modulo highly unlikely (and in priciple completely avoidable) commit-ID collisions. It's commit date tie-broken by commit-ID sort order. It too survives hash-function changes. F. There's no need for timestamp uniqueness any more. G. When a repository is imported from (say) Subversion, the Subversion IDs *don't have to break*! They can be used to initialize the commit-ID fields. Many users migrating from other VCSes will be deeply, deeply grateful for this feature. I believe this solves every problem I walked in with except timestamp truncation. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> Probably fewer than 2% of handguns and well under 1% of all guns will ever be involved in a violent crime. Thus, the problem of criminal gun violence is concentrated within a very small subset of gun owners, indicating that gun control aimed at the general population faces a serious needle-in-the-haystack problem. -- Gary Kleck, "Point Blank: Handgun Violence In America" ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RFC: Separate commit identification from Merkle hashing 2019-05-21 1:32 RFC: Separate commit identification from Merkle hashing Eric S. Raymond @ 2019-05-21 1:57 ` Jonathan Nieder 2019-05-21 2:38 ` Eric S. Raymond 2019-05-23 19:09 ` Jakub Narebski 1 sibling, 1 reply; 13+ messages in thread From: Jonathan Nieder @ 2019-05-21 1:57 UTC (permalink / raw) To: Eric S. Raymond; +Cc: git Hi! Eric S. Raymond wrote: > One reason I am sure of this is the SHA-1 to whatever transition. > We can't count on the successor hash to survive attack forever. > Accordingly, git's design needs to be stable against the possibility > of having to accommodate multiple future hash algorithms in the > future. Have you read through Documentation/technical/hash-function-transition? It takes the case where the new hash function is found to be weak into account. Hope that helps, Jonathan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RFC: Separate commit identification from Merkle hashing 2019-05-21 1:57 ` Jonathan Nieder @ 2019-05-21 2:38 ` Eric S. Raymond 2019-05-21 2:58 ` Jonathan Nieder 0 siblings, 1 reply; 13+ messages in thread From: Eric S. Raymond @ 2019-05-21 2:38 UTC (permalink / raw) To: Jonathan Nieder; +Cc: git Jonathan Nieder <jrnieder@gmail.com>: > Hi! > > Eric S. Raymond wrote: > > > One reason I am sure of this is the SHA-1 to whatever transition. > > We can't count on the successor hash to survive attack forever. > > Accordingly, git's design needs to be stable against the possibility > > of having to accommodate multiple future hash algorithms in the > > future. > > Have you read through Documentation/technical/hash-function-transition? It > takes the case where the new hash function is found to be weak into account. > > Hope that helps, > Jonathan Reading now... At first sight I think it looks pretty compatible with what I am proposing. The goals anyway, some of the implementation tactics would change a bit. I think it's a weakness, though, that most of it is written as though it assumes only one hash transition will be necessary. (This is me thinking on long timescales again.) Instead of having a gpgsig-sha256 field, I would change the code so all hash cookies have an delimited optional prefix giving the hash-algorithm type, with an absent prefix interpreted as SHA-1. I think the idea of mapping future hashes to SHA-1s, which are then used as fs lookup keys, is sound. The same technique (probably the same code!) could be used to map the otherwise uninterpreted commit-IDs I'm proposing to lookup keys. I should have said in my previous mail that I'm prepared to put my coding fingers into making all this happen. I am pretty sure my gramty manager will approve. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RFC: Separate commit identification from Merkle hashing 2019-05-21 2:38 ` Eric S. Raymond @ 2019-05-21 2:58 ` Jonathan Nieder 2019-05-21 3:31 ` Eric S. Raymond 0 siblings, 1 reply; 13+ messages in thread From: Jonathan Nieder @ 2019-05-21 2:58 UTC (permalink / raw) To: Eric S. Raymond; +Cc: git Hi, Eric S. Raymond wrote: > Jonathan Nieder <jrnieder@gmail.com>: >> Eric S. Raymond wrote: >>> One reason I am sure of this is the SHA-1 to whatever transition. >>> We can't count on the successor hash to survive attack forever. [...] >> Have you read through Documentation/technical/hash-function-transition? It >> takes the case where the new hash function is found to be weak into account. >> >> Hope that helps, >> Jonathan > > Reading now... Take your time. :) [...] > I think it's a weakness, though, that most of it is written as though it > assumes only one hash transition will be necessary. (This is me thinking > on long timescales again.) Hm, can you point to what part of the doc suggested that? Best to make the text clearer, to avoid confusing the next person. On the contrary, the design is very careful to be able to support the next transition. [...] > The same technique (probably the > same code!) could be used to map the otherwise uninterpreted > commit-IDs I'm proposing to lookup keys. No, since Git relies on commit IDs for integrity checking. The hash function transition described in that document relies on round-tripping ability for the duration of the transition. Jonathan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RFC: Separate commit identification from Merkle hashing 2019-05-21 2:58 ` Jonathan Nieder @ 2019-05-21 3:31 ` Eric S. Raymond 0 siblings, 0 replies; 13+ messages in thread From: Eric S. Raymond @ 2019-05-21 3:31 UTC (permalink / raw) To: Jonathan Nieder; +Cc: git Jonathan Nieder <jrnieder@gmail.com>: > > I think it's a weakness, though, that most of it is written as though it > > assumes only one hash transition will be necessary. (This is me thinking > > on long timescales again.) > > Hm, can you point to what part of the doc suggested that? Best to make > the text clearer, to avoid confusing the next person. I will reread it with an editorial eye and try to come up with concrete suggestions, perhaps a patch. My relative ignorance should actually be helpful here. > > The same technique (probably the > > same code!) could be used to map the otherwise uninterpreted > > commit-IDs I'm proposing to lookup keys. > > No, since Git relies on commit IDs for integrity checking. The hash > function transition described in that document relies on > round-tripping ability for the duration of the transition. I do not quite understand this comment yet. But I don't think it matters that I don't, and I will by the time I write any code. I expect the worst case is that the separated IDs require a different lookup table from the hashes, but will resolve at the same speed. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RFC: Separate commit identification from Merkle hashing 2019-05-21 1:32 RFC: Separate commit identification from Merkle hashing Eric S. Raymond 2019-05-21 1:57 ` Jonathan Nieder @ 2019-05-23 19:09 ` Jakub Narebski 2019-05-23 20:09 ` Jonathan Nieder 2019-05-23 20:50 ` Eric S. Raymond 1 sibling, 2 replies; 13+ messages in thread From: Jakub Narebski @ 2019-05-23 19:09 UTC (permalink / raw) To: Eric S. Raymond; +Cc: git esr@thyrsus.com (Eric S. Raymond) writes: > I have been thinking hard about the problems raised during my > request for unique timestamps. I think I've found a better way > to bust the box I was trying to break out of. I am therefore > withdrawing that proposal and replacing it with this one. > > It's time to separate commit identification from Merkle hashing. Documentation/technical/hash-function-transition.txt identifies similar problem, namely that existing signatures in signed tags, signed commits and merges of signed tags are signatures of their SHA-1 form. We want to be able to verify those signatures, even if this verification may be considered less secure now. You want both more (stable IDs for all commits, not only those signed) and less (you don't need verification down the tree using IDs used for commit ID). > One reason I am sure of this is the SHA-1 to whatever transition. > We can't count on the successor hash to survive attack forever. > Accordingly, git's design needs to be stable against the possibility > of having to accommodate multiple future hash algorithms in the > future. > > Here's how to do it: > > 1. Commit IDs and Merkle-tree hashes become separate commit > properties in the git filesystem. The issue you need to consider is that for signatures to be secure they must be over verification-hash Merkle-tree. It is not only commits that are identified by hashes, but also trees, blobs and tags. Commits reference other commits ("parent" lines) and a tree ("tree"); trees reference other trees, blobs and possibly commits (if submodules are used). Tags can reference any object, but most common reference commits. Blobs, i.e. file contents, do not reference any other objects. For security, all those references should use most strong hash function. Changing referecing hash (e.g. "parent" uses SHA-256 instead of "SHA-1") means that the contents of object changes, and thus its hash. Documentation/technical/hash-function-transition.txt therefore talks about SHA-256 and SHA-1 forms and SHA-256 and SHA-1 object names. "The sha1-name of an object is the SHA-1 of the concatenation of its type, length, a nul byte, and the object's sha1-content. This is the traditional <sha1> used in Git to name objects. The sha256-name of an object is the SHA-256 of the concatenation of its type, length, a nul byte, and the object's sha256-content." > 2. The data structure representing a Merkle-tree hash becomes > a pair consisting of a value and a hash-algorithm tag. An > empty tag is interpreted as SHA-1. I will call this entity the > "verification hash" and avoid unqualified use of "hash" in the > rest of this proposal. Currently Git makes use of the fact that SHA-1 and SHA-256 identifiers are of different lengths to distinguish them (see section "Meaning of signatures") in Documentation/technical/hash-function-transition.txt There might be, I think, the problem for "tree" objects. As opposed to all other places, "tree" objects use binary representation of hash, and not hexadecimal textual representation (some consider that a design mistake). > > 3. The initial value of a commit's ID in a live repository is a copy > of its verification hash, except in one important case. > > 4. When a repository is exported to a stream, the commit-id is dumped > with other commit metadata. Thus, anything that can read a stream > can resolve commit references in its change comments. > > 5. When a stream is imported, if a commit has a commit-id field it > overrides the default assignment of the generated verification hash > to that field. I think Documentation/technical/hash-function-transition.txt misses considerations for fast-import format (it talks about problem with submodules, shallow clones, and currently not solved problem of translating notes; it does not talk about git-replace, either). > > 6. Commit IDs are free-format and not interpreted by git except > as lookup keys. When git changes verification-hash functions, > commit IDs do not change. All right. Looks sensible on first glance. For security, all references in Merkle-tree of hashes must use strong verification hash. This means that you need to be able to refer to any object, including commit, by its verification hash name of its verification hash form (where all references inside object, like "parent" and "tree" headers in commit objects, use verification hashes). You need to store this commit ID somewhere. Current proposal for transitional period in Documentation/technical/hash-function-transition.txt talks about loose object index ($GIT_OBJECT_DIR/loose-object-idx) with the following format: # loose-object-idx (sha256-name SP sha1-name LF)* In packfile index contains separate SHA-1 indices and SHA-256 indices into packfile, providing fast mapping from SHA-1 name or SHA-256 name to position (index) of object in the packfile. Something similar might have been needed for commit IDs mapping. One problem is that neither loose object index, not the packfile index are transported alongside with the objects. So we may need to put commit ID elsewhere... Note that we cannot put X-hash identifier into X-hash object form, that is you cannot add "id" header to object (though you might add "other-id" header, assuming that if ID is hash based it is on the other-id form without other-id header). id <sha-1 identifier of this object> tree 0fa044a4d161254a3eae0bd06c0452d79e489593 parent 6505413ad94ddfc01f9e2f5c1b79ea6b8ffbabbb author A U Thor <author@example.com> 1558619302 +0200 committer C O Mitter <committer@example.com> 1558628753 -0500 fixes > Notice several important properties of this design. > > A. Git becomes absolutely future-proofed against hash-algorithm > changes. It can even support the use of multiple hash types over > the lifetime of one repo. > > B. All SHA-1 commit references will resolve forever even after git > stops generating them. All future hash-based commit references will > also be good forever. We might need to be able to distinguish commit IDs from hash-based object identifier of commit on command line, perhaps with something like <commit-id>^{id} This is similar to proposed git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256} > C. The id/verification split will be invisible from clients at start, > because initially they coincide and will continue to do so unless > an explicit decision changes either the verification-hash algorithm > or the way commit-IDs are initialized. The problem may be with reusing command output for input (to refer to objects and commits). > > D. My wish for forward-portable unique commit IDs is granted. > They're not by default eyeball-friendly, but I can live with that. > Furthermore, because they're preserved in streams they can be > eternally stable even as hash algorithms and preferred ID > formats change. Good. > > E. There is now a unique total order on the repo, modulo highly > unlikely (and in priciple completely avoidable) commit-ID > collisions. It's commit date tie-broken by commit-ID sort order. > It too survives hash-function changes. Nice. > > F. There's no need for timestamp uniqueness any more. > > G. When a repository is imported from (say) Subversion, the Subversion > IDs *don't have to break*! They can be used to initialize the > commit-ID fields. Many users migrating from other VCSes will be > deeply, deeply grateful for this feature. There would also need to be some support to retrieve commits using their "commit ID" stable identifiers. It may not need to be very fast. > > I believe this solves every problem I walked in with except timestamp > truncation. Best, -- Jakub Narębski ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RFC: Separate commit identification from Merkle hashing 2019-05-23 19:09 ` Jakub Narebski @ 2019-05-23 20:09 ` Jonathan Nieder 2019-05-23 20:53 ` Eric S. Raymond 2019-05-23 20:50 ` Eric S. Raymond 1 sibling, 1 reply; 13+ messages in thread From: Jonathan Nieder @ 2019-05-23 20:09 UTC (permalink / raw) To: Jakub Narebski; +Cc: Eric S. Raymond, git Hi, Jakub Narebski wrote: > I think Documentation/technical/hash-function-transition.txt misses > considerations for fast-import format (it talks about problem with > submodules, shallow clones, and currently not solved problem of > translating notes; it does not talk about git-replace, either). Hm, can you say more? I think fast-import is not significantly different from other tools that want to pick an appropriate object format for input and an appropriate object format for output. Do you mean that the fast-import file should have a field for explicitly specifying the input object format, and that that doc ought to call it out? [...] > For security, all references in Merkle-tree of hashes must use strong > verification hash. This means that you need to be able to refer to any > object, including commit, by its verification hash name of its > verification hash form (where all references inside object, like > "parent" and "tree" headers in commit objects, use verification hashes). This kind of crypto agility weakens any guarantees that rely on strength of a hash function. The security level would be that of the weakest of the supported hash functions. In other words, usually the benefit of supporting multiple hash functions as a reader is that you want the strength of the strongest of those hash functions and you need a migration path to get there. If you don't have a way to eventually drop support for the weaker hashes, then what benefit do you get from supporting multiple hash functions? Jonathan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RFC: Separate commit identification from Merkle hashing 2019-05-23 20:09 ` Jonathan Nieder @ 2019-05-23 20:53 ` Eric S. Raymond 0 siblings, 0 replies; 13+ messages in thread From: Eric S. Raymond @ 2019-05-23 20:53 UTC (permalink / raw) To: Jonathan Nieder; +Cc: Jakub Narebski, git Jonathan Nieder <jrnieder@gmail.com>: > In other words, usually the benefit of supporting multiple hash > functions as a reader is that you want the strength of the strongest > of those hash functions and you need a migration path to get there. > If you don't have a way to eventually drop support for the weaker > hashes, then what benefit do you get from supporting multiple hash > functions? Not losing the capability to verify old parts of histories up to the strength of the old hash algorithm. Not perfect, but better than nothing. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RFC: Separate commit identification from Merkle hashing 2019-05-23 19:09 ` Jakub Narebski 2019-05-23 20:09 ` Jonathan Nieder @ 2019-05-23 20:50 ` Eric S. Raymond 2019-05-23 20:54 ` Jonathan Nieder 1 sibling, 1 reply; 13+ messages in thread From: Eric S. Raymond @ 2019-05-23 20:50 UTC (permalink / raw) To: Jakub Narebski; +Cc: git Jakub Narebski <jnareb@gmail.com>: > You want both more (stable IDs for all commits, not only those signed) > and less (you don't need verification down the tree using IDs used for > commit ID). That's right. My assumption is that future VCSes will do their own hash chaining in ways we don't really want to try to anticipate or constrain. > Currently Git makes use of the fact that SHA-1 and SHA-256 identifiers > are of different lengths to distinguish them (see section "Meaning of > signatures") in Documentation/technical/hash-function-transition.txt That's the obvious hack. As a future-proofing issue, though, I think it would be unwise to count on all future hashes being of distinguishable lengths. Explicit algorithm tagging is better, at least internally. > There might be, I think, the problem for "tree" objects. As opposed to > all other places, "tree" objects use binary representation of hash, and > not hexadecimal textual representation (some consider that a design > mistake). I'm inclined to agree that it was a mistake. But whether it gets replaced by a binary struct holding an {algorithm-tag,value} pair or a textual representation of same is not something I care about a lot. > I think Documentation/technical/hash-function-transition.txt misses > considerations for fast-import format You can count on me to stay on top of that; fast-import format is utterly critical to how reposurgeon works, so I have a strong incentive to make sure it stays healthy. (Some of you may not know - reposurgeon solves the thorny problems of editing repositories by sidestepping to the textual serialized representation of them. It's basically a structure editor for fast-import streams that fools the outside world into thinking it edits live repositories by having importers and exporters at either end of its data flow.) > All right. Looks sensible on first glance. I am very relieved to hear that. My view of git is outside-in; I was quite worried I might have missed some crucial issue. > For security, all references in Merkle-tree of hashes must use strong > verification hash. This means that you need to be able to refer to any > object, including commit, by its verification hash name of its > verification hash form (where all references inside object, like > "parent" and "tree" headers in commit objects, use verification hashes). Fair enough. One minor way in which my thinking has evolved since I wrote the RFC is that I now think it might be fruitful not to throw away the idea of the verification hash as naming a commit, but rather to think of the separated commit-ID as an alias for the verification hash. This reframing won't make any difference to the code, but it clarifies what to do if, for example, an import stream declares the same commit ID for multiple commits, or fails to declare a commit ID at all. In both cases the commit is still uniquely named by its verification hash. Commit-ID namespace-management failures become annoying but not critical. > You need to store this commit ID somewhere. Current proposal for > transitional period in Documentation/technical/hash-function-transition.txt > talks about loose object index ($GIT_OBJECT_DIR/loose-object-idx) with > the following format: > > # loose-object-idx > (sha256-name SP sha1-name LF)* > > In packfile index contains separate SHA-1 indices and SHA-256 indices > into packfile, providing fast mapping from SHA-1 name or SHA-256 name to > position (index) of object in the packfile. I would generalize this to something like (hash-algorithm-tag:value SP sha1-name LF) > Something similar might have been needed for commit IDs mapping. I think so, yes. > One problem is that neither loose object index, not the packfile index > are transported alongside with the objects. So we may need to put > commit ID elsewhere... > > Note that we cannot put X-hash identifier into X-hash object form, that > is you cannot add "id" header to object (though you might add "other-id" > header, assuming that if ID is hash based it is on the other-id form > without other-id header). > > id <sha-1 identifier of this object> > tree 0fa044a4d161254a3eae0bd06c0452d79e489593 > parent 6505413ad94ddfc01f9e2f5c1b79ea6b8ffbabbb > author A U Thor <author@example.com> 1558619302 +0200 > committer C O Mitter <committer@example.com> 1558628753 -0500 > > fixes Implementation details. Let's get the design right and properly specified before worrying too hard about this level of the problem. I may do another RFC about how to avoid having this problem ever again. In truth, I think git objects should have open property lists, like bzr, with a property namespace reserved for system expansion. That way, when you need objects to have new semantics, you can do it without having an object-format flag day > > Notice several important properties of this design. > > > > A. Git becomes absolutely future-proofed against hash-algorithm > > changes. It can even support the use of multiple hash types over > > the lifetime of one repo. > > > > B. All SHA-1 commit references will resolve forever even after git > > stops generating them. All future hash-based commit references will > > also be good forever. > > We might need to be able to distinguish commit IDs from hash-based > object identifier of commit on command line, perhaps with something like > > <commit-id>^{id} > > This is similar to proposed > > git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256} Reasonable. > > C. The id/verification split will be invisible from clients at start, > > because initially they coincide and will continue to do so unless > > an explicit decision changes either the verification-hash algorithm > > or the way commit-IDs are initialized. > > The problem may be with reusing command output for input (to refer to > objects and commits). Solvable, I think. > > D. My wish for forward-portable unique commit IDs is granted. > > They're not by default eyeball-friendly, but I can live with that. > > Furthermore, because they're preserved in streams they can be > > eternally stable even as hash algorithms and preferred ID > > formats change. > > Good. Oh, man, you have no idea how good yet. You won't until you've done a few repo conversions yourself. /me needs a cross-eyed emoji here > > E. There is now a unique total order on the repo, modulo highly > > unlikely (and in priciple completely avoidable) commit-ID > > collisions. It's commit date tie-broken by commit-ID sort order. > > It too survives hash-function changes. > > Nice. One thing I will commit to do if we get this far is write the fast-export code that does canonical order. I need this badly for reposurgeon tests. > > F. There's no need for timestamp uniqueness any more. > > > > G. When a repository is imported from (say) Subversion, the Subversion > > IDs *don't have to break*! They can be used to initialize the > > commit-ID fields. Many users migrating from other VCSes will be > > deeply, deeply grateful for this feature. > > There would also need to be some support to retrieve commits using their > "commit ID" stable identifiers. It may not need to be very fast. Agreed. OK, what do we do next? Who needs to sign off on this? Should I prepare an edit for the hash-function-transition.txt describing the splitting off of commit IDs? -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RFC: Separate commit identification from Merkle hashing 2019-05-23 20:50 ` Eric S. Raymond @ 2019-05-23 20:54 ` Jonathan Nieder 2019-05-23 21:19 ` Eric S. Raymond 2019-05-23 21:50 ` Ævar Arnfjörð Bjarmason 0 siblings, 2 replies; 13+ messages in thread From: Jonathan Nieder @ 2019-05-23 20:54 UTC (permalink / raw) To: Eric S. Raymond; +Cc: Jakub Narebski, git Eric S. Raymond wrote: > Jakub Narebski <jnareb@gmail.com>: >> Currently Git makes use of the fact that SHA-1 and SHA-256 identifiers >> are of different lengths to distinguish them (see section "Meaning of >> signatures") in Documentation/technical/hash-function-transition.txt > > That's the obvious hack. As a future-proofing issue, though, I think > it would be unwise to count on all future hashes being of distinguishable > lengths. We're not counting on that. As discussed in that section, future hashes can change the format. [...] >> All right. Looks sensible on first glance. > > I am very relieved to hear that. My view of git is outside-in; I was quite > worried I might have missed some crucial issue. Honestly, I do think you have missed some fundamental issues. https://public-inbox.org/git/ab3222ab-9121-9534-1472-fac790bf08a4@gmail.com/ discusses this further. Regards, Jonathan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RFC: Separate commit identification from Merkle hashing 2019-05-23 20:54 ` Jonathan Nieder @ 2019-05-23 21:19 ` Eric S. Raymond 2019-05-23 21:39 ` Randall S. Becker 2019-05-23 21:50 ` Ævar Arnfjörð Bjarmason 1 sibling, 1 reply; 13+ messages in thread From: Eric S. Raymond @ 2019-05-23 21:19 UTC (permalink / raw) To: Jonathan Nieder; +Cc: Jakub Narebski, git Jonathan Nieder <jrnieder@gmail.com>: > Honestly, I do think you have missed some fundamental issues. > https://public-inbox.org/git/ab3222ab-9121-9534-1472-fac790bf08a4@gmail.com/ > discusses this further. Have re-read. That was a different pair of proposals. I have abandoned the idea of forcing timestamp uniqueness entirely - that was a hack to define a canonical commit order, and my new RFC describes a better way to get this. I still think finer-grained timestamps would be a good idea, but that is much less important than the different set of properties we can guarantee via the new RFC. -- <a href="http://www.catb.org/~esr/">Eric S. Raymond</a> ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: RFC: Separate commit identification from Merkle hashing 2019-05-23 21:19 ` Eric S. Raymond @ 2019-05-23 21:39 ` Randall S. Becker 0 siblings, 0 replies; 13+ messages in thread From: Randall S. Becker @ 2019-05-23 21:39 UTC (permalink / raw) To: esr, 'Jonathan Nieder'; +Cc: 'Jakub Narebski', git On May 23, 2019 17:19, Eric S. Raymond wrote: > Jonathan Nieder <jrnieder@gmail.com>: > > Honestly, I do think you have missed some fundamental issues. > > https://public-inbox.org/git/ab3222ab-9121-9534-1472- > fac790bf08a4@gmai > > l.com/ > > discusses this further. > > Have re-read. That was a different pair of proposals. > > I have abandoned the idea of forcing timestamp uniqueness entirely - that > was a hack to define a canonical commit order, and my new RFC describes a > better way to get this. > > I still think finer-grained timestamps would be a good idea, but that is much > less important than the different set of properties we can guarantee via the > new RFC. I don't think finer-grained timestamps will help long-term. The faster systems get, the more resolution we need. At this point, I can easily get two commits within the same microsecond. The weird part is that if the commits are done from two different CPUs on my platform, it is theoretically possible (although highly unlikely) that the second commit could be one microsecond earlier than the first commit, on the same file system, if a inter-CPU clock-sync had not been done for the past few seconds. On a broader scale, that is somewhat obvious and assumes global time synchronisation is maintained. It also makes me wonder what happens when git runs on a quantum computer and a commit goes to the wrong universe (joke). Just my $0.014 Randall ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: RFC: Separate commit identification from Merkle hashing 2019-05-23 20:54 ` Jonathan Nieder 2019-05-23 21:19 ` Eric S. Raymond @ 2019-05-23 21:50 ` Ævar Arnfjörð Bjarmason 1 sibling, 0 replies; 13+ messages in thread From: Ævar Arnfjörð Bjarmason @ 2019-05-23 21:50 UTC (permalink / raw) To: Jonathan Nieder; +Cc: Eric S. Raymond, Jakub Narebski, git On Thu, May 23 2019, Jonathan Nieder wrote: > Eric S. Raymond wrote: >> Jakub Narebski <jnareb@gmail.com>: > >>> Currently Git makes use of the fact that SHA-1 and SHA-256 identifiers >>> are of different lengths to distinguish them (see section "Meaning of >>> signatures") in Documentation/technical/hash-function-transition.txt >> >> That's the obvious hack. As a future-proofing issue, though, I think >> it would be unwise to count on all future hashes being of distinguishable >> lengths. > > We're not counting on that. As discussed in that section, future > hashes can change the format. I think both of you are also missing something that's implicit (but unfortunately not very explicitly talked about) in that document, which is that such hash transitions are assumed to have an out-of-bounds temporal component to them. I.e. let's assume that instead of SHA-256 we're switching to SHA-X, which like SHA-1 is also a 20 byte hash function, so they're the same length. You'd then get a git.git with SHA-1 today, next year you'd have A SHA-1<->SHA-X mapping table, but the year after that we'd be fully on SHA-X for new content. So even though we carry code and lookup table for looking up the old SHA-1 values we're not going to continue to pointlessly generate that bidirectional mapping forever. We'll have some sort of gravestone marker where we say "past this point it's SHA-X only". That's not implemented or specified yet, but could e.g. be a magic ref of some sort advertised by the server, and the client would enforce that such a marker could only be made with the stronger hash function. Thus a couple of years after that the SHA-1 -> SHA-X transition someone generating a colliding tag where a new good SHA-X tag *could* point to bad SHA-1 content won't be exploitable in practice. At that point clients won't be downloading SHA-1'd content or generating the mapping table anymore. So I don't see why a format change for the tags is needed, it would only matter *if* we have a full collision *and* the hashes are the same length (which we have no plan for), *and* if we assume we don't have some other mitigations in play. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2019-05-23 21:50 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-05-21 1:32 RFC: Separate commit identification from Merkle hashing Eric S. Raymond 2019-05-21 1:57 ` Jonathan Nieder 2019-05-21 2:38 ` Eric S. Raymond 2019-05-21 2:58 ` Jonathan Nieder 2019-05-21 3:31 ` Eric S. Raymond 2019-05-23 19:09 ` Jakub Narebski 2019-05-23 20:09 ` Jonathan Nieder 2019-05-23 20:53 ` Eric S. Raymond 2019-05-23 20:50 ` Eric S. Raymond 2019-05-23 20:54 ` Jonathan Nieder 2019-05-23 21:19 ` Eric S. Raymond 2019-05-23 21:39 ` Randall S. Becker 2019-05-23 21:50 ` Ævar Arnfjörð Bjarmason
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).