RFC: Separate commit identification from Merkle hashing

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* RFC: Separate commit identification from Merkle hashing
@ 2019-05-21  1:32 Eric S. Raymond
  2019-05-21  1:57 ` Jonathan Nieder
  2019-05-23 19:09 ` Jakub Narebski
  0 siblings, 2 replies; 13+ messages in thread
From: Eric S. Raymond @ 2019-05-21  1:32 UTC (permalink / raw)
  To: git

I have been thinking hard about the problems raised during my
request for unique timestamps.  I think I've found a better way
to bust the box I was trying to break out of.  I am therefore
withdrawing that proposal and replacing it with this one.

It's time to separate commit identification from Merkle hashing.

One reason I am sure of this is the SHA-1 to whatever transition.
We can't count on the successor hash to survive attack forever.
Accordingly, git's design needs to be stable against the possibility
of having to accommodate multiple future hash algorithms in the
future.

Here's how to do it:

1. Commit IDs and Merkle-tree hashes become separate commit
   properties in the git filesystem.

2. The data structure representing a Merkle-tree hash becomes
   a pair consisting of a value and a hash-algorithm tag. An
   empty tag is interpreted as SHA-1. I will call this entity the
   "verification hash" and avoid unqualified use of "hash" in the
   rest of this proposal.

3. The initial value of a commit's ID in a live repository is a copy
   of its verification hash, except in one important case.

4. When a repository is exported to a stream, the commit-id is dumped
   with other commit metadata.  Thus, anything that can read a stream
   can resolve commit references in its change comments.

5. When a stream is imported, if a commit has a commit-id field it
   overrides the default assignment of the generated verification hash
   to that field.

6. Commit IDs are free-format and not interpreted by git except
   as lookup keys. When git changes verification-hash functions,
   commit IDs do not change.

Notice several important properties of this design.

A. Git becomes absolutely future-proofed against hash-algorithm
   changes. It can even support the use of multiple hash types over
   the lifetime of one repo.

B. All SHA-1 commit references will resolve forever even after git
   stops generating them.  All future hash-based commit references will
   also be good forever.

C. The id/verification split will be invisible from clients at start,
   because initially they coincide and will continue to do so unless
   an explicit decision changes either the verification-hash algorithm
   or the way commit-IDs are initialized.

D. My wish for forward-portable unique commit IDs is granted.
   They're not by default eyeball-friendly, but I can live with that.
   Furthermore, because they're preserved in streams they can be
   eternally stable even as hash algorithms and preferred ID
   formats change.

E. There is now a unique total order on the repo, modulo highly
   unlikely (and in priciple completely avoidable) commit-ID
   collisions. It's commit date tie-broken by commit-ID sort order.
   It too survives hash-function changes.

F. There's no need for timestamp uniqueness any more.

G. When a repository is imported from (say) Subversion, the Subversion
   IDs *don't have to break*!  They can be used to initialize the
   commit-ID fields. Many users migrating from other VCSes will be
   deeply, deeply grateful for this feature.

I believe this solves every problem I walked in with except timestamp
truncation.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

Probably fewer than 2% of handguns and well under 1% of all guns will
ever be involved in a violent crime. Thus, the problem of criminal gun
violence is concentrated within a very small subset of gun owners,
indicating that gun control aimed at the general population faces a
serious needle-in-the-haystack problem.
	-- Gary Kleck, "Point Blank: Handgun Violence In America"

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC: Separate commit identification from Merkle hashing
  2019-05-21  1:32 RFC: Separate commit identification from Merkle hashing Eric S. Raymond
@ 2019-05-21  1:57 ` Jonathan Nieder
  2019-05-21  2:38   ` Eric S. Raymond
  2019-05-23 19:09 ` Jakub Narebski
  1 sibling, 1 reply; 13+ messages in thread
From: Jonathan Nieder @ 2019-05-21  1:57 UTC (permalink / raw)
  To: Eric S. Raymond; +Cc: git

Hi!

Eric S. Raymond wrote:

> One reason I am sure of this is the SHA-1 to whatever transition.
> We can't count on the successor hash to survive attack forever.
> Accordingly, git's design needs to be stable against the possibility
> of having to accommodate multiple future hash algorithms in the
> future.

Have you read through Documentation/technical/hash-function-transition?  It
takes the case where the new hash function is found to be weak into account.

Hope that helps,
Jonathan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC: Separate commit identification from Merkle hashing
  2019-05-21  1:57 ` Jonathan Nieder
@ 2019-05-21  2:38   ` Eric S. Raymond
  2019-05-21  2:58     ` Jonathan Nieder
  0 siblings, 1 reply; 13+ messages in thread
From: Eric S. Raymond @ 2019-05-21  2:38 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git

Jonathan Nieder <jrnieder@gmail.com>:
> Hi!
> 
> Eric S. Raymond wrote:
> 
> > One reason I am sure of this is the SHA-1 to whatever transition.
> > We can't count on the successor hash to survive attack forever.
> > Accordingly, git's design needs to be stable against the possibility
> > of having to accommodate multiple future hash algorithms in the
> > future.
> 
> Have you read through Documentation/technical/hash-function-transition?  It
> takes the case where the new hash function is found to be weak into account.
> 
> Hope that helps,
> Jonathan

Reading now...

At first sight I think it looks pretty compatible with what I am proposing.
The goals anyway, some of the implementation tactics would change a bit.

I think it's a weakness, though, that most of it is written as though it
assumes only one hash transition will be necessary.  (This is me thinking
on long timescales again.)

Instead of having a gpgsig-sha256 field, I would change the code so all
hash cookies have an delimited optional prefix giving the hash-algorithm
type, with an absent prefix interpreted as SHA-1.

I think the idea of mapping future hashes to SHA-1s, which are then
used as fs lookup keys, is sound.  The same technique (probably the
same code!) could be used to map the otherwise uninterpreted
commit-IDs I'm proposing to lookup keys.

I should have said in my previous mail that I'm prepared to put
my coding fingers into making all this happen. I am pretty sure my
gramty manager will approve.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC: Separate commit identification from Merkle hashing
  2019-05-21  2:38   ` Eric S. Raymond
@ 2019-05-21  2:58     ` Jonathan Nieder
  2019-05-21  3:31       ` Eric S. Raymond
  0 siblings, 1 reply; 13+ messages in thread
From: Jonathan Nieder @ 2019-05-21  2:58 UTC (permalink / raw)
  To: Eric S. Raymond; +Cc: git

Hi,

Eric S. Raymond wrote:
> Jonathan Nieder <jrnieder@gmail.com>:
>> Eric S. Raymond wrote:

>>> One reason I am sure of this is the SHA-1 to whatever transition.
>>> We can't count on the successor hash to survive attack forever.
[...]
>> Have you read through Documentation/technical/hash-function-transition?  It
>> takes the case where the new hash function is found to be weak into account.
>>
>> Hope that helps,
>> Jonathan
>
> Reading now...

Take your time. :)

[...]
> I think it's a weakness, though, that most of it is written as though it
> assumes only one hash transition will be necessary.  (This is me thinking
> on long timescales again.)

Hm, can you point to what part of the doc suggested that?  Best to make
the text clearer, to avoid confusing the next person.

On the contrary, the design is very careful to be able to support the
next transition.

[...]
>                                    The same technique (probably the
> same code!) could be used to map the otherwise uninterpreted
> commit-IDs I'm proposing to lookup keys.

No, since Git relies on commit IDs for integrity checking.  The hash
function transition described in that document relies on
round-tripping ability for the duration of the transition.

Jonathan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC: Separate commit identification from Merkle hashing
  2019-05-21  2:58     ` Jonathan Nieder
@ 2019-05-21  3:31       ` Eric S. Raymond
  0 siblings, 0 replies; 13+ messages in thread
From: Eric S. Raymond @ 2019-05-21  3:31 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: git

Jonathan Nieder <jrnieder@gmail.com>:
> > I think it's a weakness, though, that most of it is written as though it
> > assumes only one hash transition will be necessary.  (This is me thinking
> > on long timescales again.)
> 
> Hm, can you point to what part of the doc suggested that?  Best to make
> the text clearer, to avoid confusing the next person.

I will reread it with an editorial eye and try to come up with
concrete suggestions, perhaps a patch. My relative ignorance
should actually be helpful here.

> >                                    The same technique (probably the
> > same code!) could be used to map the otherwise uninterpreted
> > commit-IDs I'm proposing to lookup keys.
> 
> No, since Git relies on commit IDs for integrity checking.  The hash
> function transition described in that document relies on
> round-tripping ability for the duration of the transition.

I do not quite understand this comment yet. But I don't think it
matters that I don't, and I will by the time I write any code.  I
expect the worst case is that the separated IDs require a different
lookup table from the hashes, but will resolve at the same speed.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC: Separate commit identification from Merkle hashing
  2019-05-21  1:32 RFC: Separate commit identification from Merkle hashing Eric S. Raymond
  2019-05-21  1:57 ` Jonathan Nieder
@ 2019-05-23 19:09 ` Jakub Narebski
  2019-05-23 20:09   ` Jonathan Nieder
  2019-05-23 20:50   ` Eric S. Raymond
  1 sibling, 2 replies; 13+ messages in thread
From: Jakub Narebski @ 2019-05-23 19:09 UTC (permalink / raw)
  To: Eric S. Raymond; +Cc: git

esr@thyrsus.com (Eric S. Raymond) writes:

> I have been thinking hard about the problems raised during my
> request for unique timestamps.  I think I've found a better way
> to bust the box I was trying to break out of.  I am therefore
> withdrawing that proposal and replacing it with this one.
>
> It's time to separate commit identification from Merkle hashing.

Documentation/technical/hash-function-transition.txt identifies similar
problem, namely that existing signatures in signed tags, signed commits
and merges of signed tags are signatures of their SHA-1 form.  We want
to be able to verify those signatures, even if this verification may be
considered less secure now.

You want both more (stable IDs for all commits, not only those signed)
and less (you don't need verification down the tree using IDs used for
commit ID).

> One reason I am sure of this is the SHA-1 to whatever transition.
> We can't count on the successor hash to survive attack forever.
> Accordingly, git's design needs to be stable against the possibility
> of having to accommodate multiple future hash algorithms in the
> future.
>
> Here's how to do it:
>
> 1. Commit IDs and Merkle-tree hashes become separate commit
>    properties in the git filesystem.

The issue you need to consider is that for signatures to be secure they
must be over verification-hash Merkle-tree.  It is not only commits that
are identified by hashes, but also trees, blobs and tags.

Commits reference other commits ("parent" lines) and a tree ("tree");
trees reference other trees, blobs and possibly commits (if submodules
are used).  Tags can reference any object, but most common reference
commits.  Blobs, i.e. file contents, do not reference any other
objects.  For security, all those references should use most strong hash
function.

Changing referecing hash (e.g. "parent" uses SHA-256 instead of "SHA-1")
means that the contents of object changes, and thus its hash.
Documentation/technical/hash-function-transition.txt therefore talks
about SHA-256 and SHA-1 forms and SHA-256 and SHA-1 object names.

 "The sha1-name of an object is the SHA-1 of the concatenation of its
  type, length, a nul byte, and the object's sha1-content. This is the
  traditional <sha1> used in Git to name objects.

  The sha256-name of an object is the SHA-256 of the concatenation of its
  type, length, a nul byte, and the object's sha256-content."

> 2. The data structure representing a Merkle-tree hash becomes
>    a pair consisting of a value and a hash-algorithm tag. An
>    empty tag is interpreted as SHA-1. I will call this entity the
>    "verification hash" and avoid unqualified use of "hash" in the
>    rest of this proposal.

Currently Git makes use of the fact that SHA-1 and SHA-256 identifiers
are of different lengths to distinguish them (see section "Meaning of
signatures") in Documentation/technical/hash-function-transition.txt

There might be, I think, the problem for "tree" objects.  As opposed to
all other places, "tree" objects use binary representation of hash, and
not hexadecimal textual representation (some consider that a design
mistake).

>
> 3. The initial value of a commit's ID in a live repository is a copy
>    of its verification hash, except in one important case.
>
> 4. When a repository is exported to a stream, the commit-id is dumped
>    with other commit metadata.  Thus, anything that can read a stream
>    can resolve commit references in its change comments.
>
> 5. When a stream is imported, if a commit has a commit-id field it
>    overrides the default assignment of the generated verification hash
>    to that field.

I think Documentation/technical/hash-function-transition.txt misses
considerations for fast-import format (it talks about problem with
submodules, shallow clones, and currently not solved problem of
translating notes; it does not talk about git-replace, either).

>
> 6. Commit IDs are free-format and not interpreted by git except
>    as lookup keys. When git changes verification-hash functions,
>    commit IDs do not change.

All right.  Looks sensible on first glance.

For security, all references in Merkle-tree of hashes must use strong
verification hash.  This means that you need to be able to refer to any
object, including commit, by its verification hash name of its
verification hash form (where all references inside object, like
"parent" and "tree" headers in commit objects, use verification hashes).

You need to store this commit ID somewhere.  Current proposal for
transitional period in Documentation/technical/hash-function-transition.txt
talks about loose object index ($GIT_OBJECT_DIR/loose-object-idx) with
the following format:

  # loose-object-idx
  (sha256-name SP sha1-name LF)*

In packfile index contains separate SHA-1 indices and SHA-256 indices
into packfile, providing fast mapping from SHA-1 name or SHA-256 name to
position (index) of object in the packfile.

Something similar might have been needed for commit IDs mapping.

One problem is that neither loose object index, not the packfile index
are transported alongside with the objects.  So we may need to put
commit ID elsewhere...

Note that we cannot put X-hash identifier into X-hash object form, that
is you cannot add "id" header to object (though you might add "other-id"
header, assuming that if ID is hash based it is on the other-id form
without other-id header).

  id <sha-1 identifier of this object>
  tree 0fa044a4d161254a3eae0bd06c0452d79e489593
  parent 6505413ad94ddfc01f9e2f5c1b79ea6b8ffbabbb
  author A U Thor <author@example.com> 1558619302 +0200
  committer C O Mitter <committer@example.com> 1558628753 -0500

  fixes

> Notice several important properties of this design.
>
> A. Git becomes absolutely future-proofed against hash-algorithm
>    changes. It can even support the use of multiple hash types over
>    the lifetime of one repo.
>
> B. All SHA-1 commit references will resolve forever even after git
>    stops generating them.  All future hash-based commit references will
>    also be good forever.

We might need to be able to distinguish commit IDs from hash-based
object identifier of commit on command line, perhaps with something like

  <commit-id>^{id}

This is similar to proposed

  git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256}

> C. The id/verification split will be invisible from clients at start,
>    because initially they coincide and will continue to do so unless
>    an explicit decision changes either the verification-hash algorithm
>    or the way commit-IDs are initialized.

The problem may be with reusing command output for input (to refer to
objects and commits).

>
> D. My wish for forward-portable unique commit IDs is granted.
>    They're not by default eyeball-friendly, but I can live with that.
>    Furthermore, because they're preserved in streams they can be
>    eternally stable even as hash algorithms and preferred ID
>    formats change.

Good.

>
> E. There is now a unique total order on the repo, modulo highly
>    unlikely (and in priciple completely avoidable) commit-ID
>    collisions. It's commit date tie-broken by commit-ID sort order.
>    It too survives hash-function changes.

Nice.

>
> F. There's no need for timestamp uniqueness any more.
>
> G. When a repository is imported from (say) Subversion, the Subversion
>    IDs *don't have to break*!  They can be used to initialize the
>    commit-ID fields. Many users migrating from other VCSes will be
>    deeply, deeply grateful for this feature.

There would also need to be some support to retrieve commits using their
"commit ID" stable identifiers.  It may not need to be very fast.

>
> I believe this solves every problem I walked in with except timestamp
> truncation.

Best,
--
Jakub Narębski

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC: Separate commit identification from Merkle hashing
  2019-05-23 19:09 ` Jakub Narebski
@ 2019-05-23 20:09   ` Jonathan Nieder
  2019-05-23 20:53     ` Eric S. Raymond
  2019-05-23 20:50   ` Eric S. Raymond
  1 sibling, 1 reply; 13+ messages in thread
From: Jonathan Nieder @ 2019-05-23 20:09 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Eric S. Raymond, git

Hi,

Jakub Narebski wrote:

> I think Documentation/technical/hash-function-transition.txt misses
> considerations for fast-import format (it talks about problem with
> submodules, shallow clones, and currently not solved problem of
> translating notes; it does not talk about git-replace, either).

Hm, can you say more?  I think fast-import is not significantly
different from other tools that want to pick an appropriate object
format for input and an appropriate object format for output.

Do you mean that the fast-import file should have a field for
explicitly specifying the input object format, and that that doc
ought to call it out?

[...]
> For security, all references in Merkle-tree of hashes must use strong
> verification hash.  This means that you need to be able to refer to any
> object, including commit, by its verification hash name of its
> verification hash form (where all references inside object, like
> "parent" and "tree" headers in commit objects, use verification hashes).

This kind of crypto agility weakens any guarantees that rely on
strength of a hash function.  The security level would be that of the
weakest of the supported hash functions.

In other words, usually the benefit of supporting multiple hash
functions as a reader is that you want the strength of the strongest
of those hash functions and you need a migration path to get there.
If you don't have a way to eventually drop support for the weaker
hashes, then what benefit do you get from supporting multiple hash
functions?

Jonathan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC: Separate commit identification from Merkle hashing
  2019-05-23 20:09   ` Jonathan Nieder
@ 2019-05-23 20:53     ` Eric S. Raymond
  0 siblings, 0 replies; 13+ messages in thread
From: Eric S. Raymond @ 2019-05-23 20:53 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jakub Narebski, git

Jonathan Nieder <jrnieder@gmail.com>:
> In other words, usually the benefit of supporting multiple hash
> functions as a reader is that you want the strength of the strongest
> of those hash functions and you need a migration path to get there.
> If you don't have a way to eventually drop support for the weaker
> hashes, then what benefit do you get from supporting multiple hash
> functions?

Not losing the capability to verify old parts of histories up to the
strength of the old hash algorithm.  Not perfect, but better than nothing.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC: Separate commit identification from Merkle hashing
  2019-05-23 19:09 ` Jakub Narebski
  2019-05-23 20:09   ` Jonathan Nieder
@ 2019-05-23 20:50   ` Eric S. Raymond
  2019-05-23 20:54     ` Jonathan Nieder
  1 sibling, 1 reply; 13+ messages in thread
From: Eric S. Raymond @ 2019-05-23 20:50 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: git

Jakub Narebski <jnareb@gmail.com>:
> You want both more (stable IDs for all commits, not only those signed)
> and less (you don't need verification down the tree using IDs used for
> commit ID).

That's right.  My assumption is that future VCSes will do their own
hash chaining in ways we don't really want to try to anticipate or
constrain.

> Currently Git makes use of the fact that SHA-1 and SHA-256 identifiers
> are of different lengths to distinguish them (see section "Meaning of
> signatures") in Documentation/technical/hash-function-transition.txt

That's the obvious hack.  As a future-proofing issue, though, I think
it would be unwise to count on all future hashes being of distinguishable
lengths. Explicit algorithm tagging is better, at least internally.

> There might be, I think, the problem for "tree" objects.  As opposed to
> all other places, "tree" objects use binary representation of hash, and
> not hexadecimal textual representation (some consider that a design
> mistake).

I'm inclined to agree that it was a mistake.  But whether it gets
replaced by a binary struct holding an {algorithm-tag,value} pair or a
textual representation of same is not something I care about a lot.

> I think Documentation/technical/hash-function-transition.txt misses
> considerations for fast-import format

You can count on me to stay on top of that; fast-import format is utterly
critical to how reposurgeon works, so I have a strong incentive to make
sure it stays healthy.

(Some of you may not know - reposurgeon solves the thorny problems of
editing repositories by sidestepping to the textual serialized
representation of them.  It's basically a structure editor for
fast-import streams that fools the outside world into thinking it
edits live repositories by having importers and exporters at either
end of its data flow.)

> All right.  Looks sensible on first glance.

I am very relieved to hear that. My view of git is outside-in; I was quite
worried I might have missed some crucial issue.

> For security, all references in Merkle-tree of hashes must use strong
> verification hash.  This means that you need to be able to refer to any
> object, including commit, by its verification hash name of its
> verification hash form (where all references inside object, like
> "parent" and "tree" headers in commit objects, use verification hashes).

Fair enough. One minor way in which my thinking has evolved since
I wrote the RFC is that I now think it might be fruitful not to throw away
the idea of the verification hash as naming a commit, but rather to think
of the separated commit-ID as an alias for the verification hash.

This reframing won't make any difference to the code, but it clarifies
what to do if, for example, an import stream declares the same commit
ID for multiple commits, or fails to declare a commit ID at all.  In both
cases the commit is still uniquely named by its verification hash. Commit-ID
namespace-management failures become annoying but not critical.

> You need to store this commit ID somewhere.  Current proposal for
> transitional period in Documentation/technical/hash-function-transition.txt
> talks about loose object index ($GIT_OBJECT_DIR/loose-object-idx) with
> the following format:
> 
>   # loose-object-idx
>   (sha256-name SP sha1-name LF)*
> 
> In packfile index contains separate SHA-1 indices and SHA-256 indices
> into packfile, providing fast mapping from SHA-1 name or SHA-256 name to
> position (index) of object in the packfile.

I would generalize this to something like

(hash-algorithm-tag:value SP sha1-name LF)

> Something similar might have been needed for commit IDs mapping.

I think so, yes.

> One problem is that neither loose object index, not the packfile index
> are transported alongside with the objects.  So we may need to put
> commit ID elsewhere...
> 
> Note that we cannot put X-hash identifier into X-hash object form, that
> is you cannot add "id" header to object (though you might add "other-id"
> header, assuming that if ID is hash based it is on the other-id form
> without other-id header).
> 
>   id <sha-1 identifier of this object>
>   tree 0fa044a4d161254a3eae0bd06c0452d79e489593
>   parent 6505413ad94ddfc01f9e2f5c1b79ea6b8ffbabbb
>   author A U Thor <author@example.com> 1558619302 +0200
>   committer C O Mitter <committer@example.com> 1558628753 -0500
> 
>   fixes

Implementation details. Let's get the design right and properly specified
before worrying too hard about this level of the problem.

I may do another RFC about how to avoid having this problem ever
again.  In truth, I think git objects should have open property lists,
like bzr, with a property namespace reserved for system
expansion. That way, when you need objects to have new semantics, you
can do it without having an object-format flag day

> > Notice several important properties of this design.
> >
> > A. Git becomes absolutely future-proofed against hash-algorithm
> >    changes. It can even support the use of multiple hash types over
> >    the lifetime of one repo.
> >
> > B. All SHA-1 commit references will resolve forever even after git
> >    stops generating them.  All future hash-based commit references will
> >    also be good forever.
> 
> We might need to be able to distinguish commit IDs from hash-based
> object identifier of commit on command line, perhaps with something like
> 
>   <commit-id>^{id}
> 
> This is similar to proposed
> 
>   git --output-format=sha1 log abac87a^{sha1}..f787cac^{sha256}

Reasonable.

> > C. The id/verification split will be invisible from clients at start,
> >    because initially they coincide and will continue to do so unless
> >    an explicit decision changes either the verification-hash algorithm
> >    or the way commit-IDs are initialized.
> 
> The problem may be with reusing command output for input (to refer to
> objects and commits).

Solvable, I think.

> > D. My wish for forward-portable unique commit IDs is granted.
> >    They're not by default eyeball-friendly, but I can live with that.
> >    Furthermore, because they're preserved in streams they can be
> >    eternally stable even as hash algorithms and preferred ID
> >    formats change.
> 
> Good.

Oh, man, you have no idea how good yet.  You won't until you've done a
few repo conversions yourself.

/me needs a cross-eyed emoji here

> > E. There is now a unique total order on the repo, modulo highly
> >    unlikely (and in priciple completely avoidable) commit-ID
> >    collisions. It's commit date tie-broken by commit-ID sort order.
> >    It too survives hash-function changes.
> 
> Nice.

One thing I will commit to do if we get this far is write the fast-export
code that does canonical order.  I need this badly for reposurgeon tests.

> > F. There's no need for timestamp uniqueness any more.
> >
> > G. When a repository is imported from (say) Subversion, the Subversion
> >    IDs *don't have to break*!  They can be used to initialize the
> >    commit-ID fields. Many users migrating from other VCSes will be
> >    deeply, deeply grateful for this feature.
> 
> There would also need to be some support to retrieve commits using their
> "commit ID" stable identifiers.  It may not need to be very fast.

Agreed.

OK, what do we do next?  Who needs to sign off on this?  Should I prepare
an edit for the hash-function-transition.txt describing the splitting off
of commit IDs?
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC: Separate commit identification from Merkle hashing
  2019-05-23 20:50   ` Eric S. Raymond
@ 2019-05-23 20:54     ` Jonathan Nieder
  2019-05-23 21:19       ` Eric S. Raymond
  2019-05-23 21:50       ` Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 13+ messages in thread
From: Jonathan Nieder @ 2019-05-23 20:54 UTC (permalink / raw)
  To: Eric S. Raymond; +Cc: Jakub Narebski, git

Eric S. Raymond wrote:
> Jakub Narebski <jnareb@gmail.com>:

>> Currently Git makes use of the fact that SHA-1 and SHA-256 identifiers
>> are of different lengths to distinguish them (see section "Meaning of
>> signatures") in Documentation/technical/hash-function-transition.txt
>
> That's the obvious hack.  As a future-proofing issue, though, I think
> it would be unwise to count on all future hashes being of distinguishable
> lengths.

We're not counting on that.  As discussed in that section, future
hashes can change the format.

[...]
>> All right.  Looks sensible on first glance.
>
> I am very relieved to hear that. My view of git is outside-in; I was quite
> worried I might have missed some crucial issue.

Honestly, I do think you have missed some fundamental issues.
https://public-inbox.org/git/ab3222ab-9121-9534-1472-fac790bf08a4@gmail.com/
discusses this further.

Regards,
Jonathan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC: Separate commit identification from Merkle hashing
  2019-05-23 20:54     ` Jonathan Nieder
@ 2019-05-23 21:19       ` Eric S. Raymond
  2019-05-23 21:39         ` Randall S. Becker
  2019-05-23 21:50       ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 13+ messages in thread
From: Eric S. Raymond @ 2019-05-23 21:19 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jakub Narebski, git

Jonathan Nieder <jrnieder@gmail.com>:
> Honestly, I do think you have missed some fundamental issues.
> https://public-inbox.org/git/ab3222ab-9121-9534-1472-fac790bf08a4@gmail.com/
> discusses this further.

Have re-read.  That was a different pair of proposals.

I have abandoned the idea of forcing timestamp uniqueness entirely - that was
a hack to define a canonical commit order, and my new RFC describes a better
way to get this.

I still think finer-grained timestamps would be a good idea, but that is
much less important than the different set of properties we can guarantee
via the new RFC.
-- 
		<a href="http://www.catb.org/~esr/">Eric S. Raymond</a>



^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: RFC: Separate commit identification from Merkle hashing
  2019-05-23 21:19       ` Eric S. Raymond
@ 2019-05-23 21:39         ` Randall S. Becker
  0 siblings, 0 replies; 13+ messages in thread
From: Randall S. Becker @ 2019-05-23 21:39 UTC (permalink / raw)
  To: esr, 'Jonathan Nieder'; +Cc: 'Jakub Narebski', git

On May 23, 2019 17:19, Eric S. Raymond wrote:
> Jonathan Nieder <jrnieder@gmail.com>:
> > Honestly, I do think you have missed some fundamental issues.
> > https://public-inbox.org/git/ab3222ab-9121-9534-1472-
> fac790bf08a4@gmai
> > l.com/
> > discusses this further.
> 
> Have re-read.  That was a different pair of proposals.
> 
> I have abandoned the idea of forcing timestamp uniqueness entirely - that
> was a hack to define a canonical commit order, and my new RFC describes a
> better way to get this.
> 
> I still think finer-grained timestamps would be a good idea, but that is
much
> less important than the different set of properties we can guarantee via
the
> new RFC.

I don't think finer-grained timestamps will help long-term. The faster
systems get, the more resolution we need. At this point, I can easily get
two commits within the same microsecond. The weird part is that if the
commits are done from two different CPUs on my platform, it is theoretically
possible (although highly unlikely) that the second commit could be one
microsecond earlier than the first commit, on the same file system, if a
inter-CPU clock-sync had not been done for the past few seconds. On a
broader scale, that is somewhat obvious and assumes global time
synchronisation is maintained. It also makes me wonder what happens when git
runs on a quantum computer and a commit goes to the wrong universe (joke).

Just my $0.014

Randall

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: RFC: Separate commit identification from Merkle hashing
  2019-05-23 20:54     ` Jonathan Nieder
  2019-05-23 21:19       ` Eric S. Raymond
@ 2019-05-23 21:50       ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 13+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-05-23 21:50 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Eric S. Raymond, Jakub Narebski, git

On Thu, May 23 2019, Jonathan Nieder wrote:

> Eric S. Raymond wrote:
>> Jakub Narebski <jnareb@gmail.com>:
>
>>> Currently Git makes use of the fact that SHA-1 and SHA-256 identifiers
>>> are of different lengths to distinguish them (see section "Meaning of
>>> signatures") in Documentation/technical/hash-function-transition.txt
>>
>> That's the obvious hack.  As a future-proofing issue, though, I think
>> it would be unwise to count on all future hashes being of distinguishable
>> lengths.
>
> We're not counting on that.  As discussed in that section, future
> hashes can change the format.

I think both of you are also missing something that's implicit (but
unfortunately not very explicitly talked about) in that document, which
is that such hash transitions are assumed to have an out-of-bounds
temporal component to them.

I.e. let's assume that instead of SHA-256 we're switching to SHA-X,
which like SHA-1 is also a 20 byte hash function, so they're the same
length.

You'd then get a git.git with SHA-1 today, next year you'd have A
SHA-1<->SHA-X mapping table, but the year after that we'd be fully on
SHA-X for new content.

So even though we carry code and lookup table for looking up the old
SHA-1 values we're not going to continue to pointlessly generate that
bidirectional mapping forever. We'll have some sort of gravestone marker
where we say "past this point it's SHA-X only".

That's not implemented or specified yet, but could e.g. be a magic ref
of some sort advertised by the server, and the client would enforce that
such a marker could only be made with the stronger hash function.

Thus a couple of years after that the SHA-1 -> SHA-X transition someone
generating a colliding tag where a new good SHA-X tag *could* point to
bad SHA-1 content won't be exploitable in practice. At that point
clients won't be downloading SHA-1'd content or generating the mapping
table anymore.

So I don't see why a format change for the tags is needed, it would only
matter *if* we have a full collision *and* the hashes are the same
length (which we have no plan for), *and* if we assume we don't have
some other mitigations in play.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-05-23 21:50 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-21  1:32 RFC: Separate commit identification from Merkle hashing Eric S. Raymond
2019-05-21  1:57 ` Jonathan Nieder
2019-05-21  2:38   ` Eric S. Raymond
2019-05-21  2:58     ` Jonathan Nieder
2019-05-21  3:31       ` Eric S. Raymond
2019-05-23 19:09 ` Jakub Narebski
2019-05-23 20:09   ` Jonathan Nieder
2019-05-23 20:53     ` Eric S. Raymond
2019-05-23 20:50   ` Eric S. Raymond
2019-05-23 20:54     ` Jonathan Nieder
2019-05-23 21:19       ` Eric S. Raymond
2019-05-23 21:39         ` Randall S. Becker
2019-05-23 21:50       ` Ævar Arnfjörð Bjarmason

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).