git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Ian Jackson <ijackson@chiark.greenend.org.uk>
To: Jeff King <peff@peff.net>
Cc: Brandon Williams <bmwill@google.com>,
	Jason Cooper <git@lakedaemon.net>,
	Junio C Hamano <gitster@pobox.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Joey Hess <id@joeyh.name>, Git Mailing List <git@vger.kernel.org>
Subject: Re: SHA1 collisions found
Date: Fri, 3 Mar 2017 14:54:56 +0000	[thread overview]
Message-ID: <22713.33728.502854.338516@chiark.greenend.org.uk> (raw)
In-Reply-To: <20170303111347.6uzuhvmpdwr27qjw@sigill.intra.peff.net>

Jeff King writes ("Re: SHA1 collisions found"):
> I think you've read more into my "conversion" than I intended. The old
> history won't get rewritten. It will just be grafted onto the bottom of
> the commit history you've got, and the new trees will all be written
> with the new hash.
> 
> So you still have those old objects hanging around that refer to things
> by their sha1 (not to mention bug trackers, commit messages, etc, which
> all use commit ids). And you want to be able to quickly resolve those
> references.
> 
> What _does_ get rewritten is what's in your ref files, your pack .idx,
> etc. Those are all sha256 (or whatever), and work as sha1's do now.

This all sounds very similar to my proposal.

> Looking up a sha1 reference from an old object just goes through the
> extra level of indirection.

I don't understand why this is a level of indirection, rather than
simply a retention of the existing SHA-1 object database (in parallel,
but deprecated).

Perhaps I have misunderstood what you mean by "graft".  I assume you
don't mean info/grafts, because that is not conveyed by transfer
protocols.


Stepping back a bit, the key question is what the data structure will
look like after the transition.

Specifically, the parent reference in the first post-transition commit
has to refer to something.  What does it refer to ?  The possibilities
seem to be:

 1a. It names the SHA1 hash of an old commit object
 1b. It names the BLAKE[1] hash of an old commit object, which
    object of course refers to its parents by SHA1.

 2. It names the BLAKE hash of a translated version of the
    old commit object.

 3. It doesn't name the parent, and the old history is not
    automatically transferred by clone and not readily accessible.

(1a) and (1b) are different variants of something like my mixed hash
proposal.  Old and new hashes live side by side.

(2) involves rewriting all of the old history, to recursively generate
new objects (with BLAKE names, and which in turn refer to other
rewritten old objects by BLAKE names).  The first time a particular
tree needs to look up an object by a BLAKE name, it needs to run a
conversion its own entire existing history.

For (2) there would have to be some kind of mapping table in every
tree, which allowed object names to be maped in both directions.  The
object content translation would have to be reversible, so that the
actual pre-translation objects would not need to be stored; rather
they would be reconstructed from the post-translation objects, when
someone asks for a pre-translation object.  In principle it would be
possible to convert future BLAKE commits to SHA-1 ones, again by
recursive rewriting.

I don't think anyone is seriously suggesting (3).


So there is a choice between:

(1) a unified hash tree containing a mixture of different hashes at
different reference points, where each object has one identity and one
name.

(2) parallel hash tree structures, each using only a single hash, with
at least every old object present in both tree structures.

I think (1) is preferable because it provides, to callers of git, the
existing object naming semantics.  Callers need to be taught to accept
an extension to the object name format.  Existing object names stored
elsewhere than in git remain valid.

Conversely, (2) requires many object names stored elsewhere than in
git to be updated.  It's possible with (2) to do ad-hoc lookups on
object names in mailing list messages or commit messages and so on.
Even if it is possible for the new git to answer questions like "does
this new branch with BLAKE hash X' contain the commit with SHA1 hash
Y" by implicitly looking up the corresponding BLAKE commit Y' and
answering the question with reference to Y', this isn't going to help
if external code does things like "have git compute the merge base of
X and Y' and check that it is equal to Z".  Either the external
database's record of Z would have to be changed to refer to Z', or the
external code would have to be taught to apply an object name
normalisation operation to turn Z into Z' each time.

Also, (2) has trouble with shallow clones.  This is because it's not
possible to translate old objects to new ones without descending to
the roots of the object tree and recursively translating everything
(or looking up the answer of a previous translation).


Then there is the question of naming syntax.

The distinction between (1) single unified mixed hash tree, and
(2) multiple parallel homogenous hash trees, is mostly orthogonal to
the command-line (and in-commit-object etc.) representation of new
hashes.

The main thing here is that, regardless of the choice between (1) or
(2), we need to choose whether object names specified on the git
command line, and printed by normal git commands, explicitly identify
the hash function.

I think there are roughly three options:

 (A) Decorate all new hashes with a hash function indication
     (sha256:<hex> or blake_<hex> or H<hex>)

 (B) Infer the hash function from the object name length
     (and do some kind of bodge for abbreviated object names).

 (C) Hash function is implicit from context.  (This is compatible with
     (2) only, because (1) requires any object to be able to refer to
     any hash function.)

I think (A) is best because it means everything is unambiguous, and it
allows future hash function changes without further difficulty.

(B) is a reasonable possibility although the abbreviated object name
bodge would be quite ugly and probably involve thinking about several
annoying edge cases.

I think (C) is really bad, because it instantly makes all existing
application code which calls git to be buggy.  Such application code
would need to be adjusted to know for itself which of the object names
it has recorded are what hash function, and explicitly specify this to
its git operations somehow.

All of these options involve updating many callers of git.  In any
case any git caller which explicitly checks the object name length
will need to be changed.  For (a), many git callers which match object
names using something like [0-9a-f]+ rather than \w+ will need to be
changed - but at least it's a simple change with little semantic
import.

(A) has the additional advantage that it becomes possible to make
object names syntactically distinguishable from ref names.


The final argument I would make is this:

We don't know what hash function research will look like in 10-20
years.  We would like to not have a bunch of pain again.  So ideally
we would deploy a framework now that would let us switch hash function
again without further history-rewriting.

(1)(A) and perhaps (1)(B) are the only options which support this
well.


Ian.

[1] I'm going to keep assuming that the bikeshed will be blue, because
I think BLAKE2b has is a better choice.  It has probably had more
serious people looking at it than SHA-3, at least, and it has good
performance.  The web page has an impressive adoption list - probably
wider than SHA-3.

-- 
Ian Jackson <ijackson@chiark.greenend.org.uk>   These opinions are my own.

If I emailed you from an address @fyvzl.net or @evade.org.uk, that is
a private address which bypasses my fierce spamfilter.

  reply	other threads:[~2017-03-03 15:21 UTC|newest]

Thread overview: 136+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-23 16:43 SHA1 collisions found Joey Hess
2017-02-23 17:00 ` David Lang
2017-02-23 17:02 ` Junio C Hamano
2017-02-23 17:12   ` David Lang
2017-02-23 20:49     ` Jakub Narębski
2017-02-23 20:57       ` Jeff King
2017-02-23 17:18   ` Junio C Hamano
2017-02-23 17:35   ` Joey Hess
2017-02-23 17:52     ` Linus Torvalds
2017-02-23 18:21       ` Joey Hess
2017-02-23 18:31         ` Joey Hess
2017-02-23 19:13           ` Morten Welinder
2017-02-24 15:52             ` Geert Uytterhoeven
2017-02-23 18:40         ` Linus Torvalds
2017-02-23 18:46           ` Jeff King
2017-02-23 19:09             ` Linus Torvalds
2017-02-23 19:32               ` Jeff King
2017-02-23 19:47                 ` Linus Torvalds
2017-02-23 19:57                   ` Jeff King
     [not found]                     ` <alpine.LFD.2.20.1702231428540.30435@i7.lan>
2017-02-23 22:43                       ` Jeff King
2017-02-23 22:50                         ` Linus Torvalds
2017-02-23 23:05                         ` Jeff King
2017-02-23 23:05                           ` [PATCH 1/3] add collision-detecting sha1 implementation Jeff King
2017-02-23 23:15                             ` Stefan Beller
2017-02-24  0:01                               ` Jeff King
2017-02-24  0:12                                 ` Linus Torvalds
2017-02-24  0:16                                   ` Jeff King
2017-02-23 23:05                           ` [PATCH 2/3] sha1dc: adjust header includes for git Jeff King
2017-02-23 23:06                           ` [PATCH 3/3] Makefile: add USE_SHA1DC knob Jeff King
2017-02-24 18:36                             ` HW42
2017-02-24 18:57                               ` Jeff King
2017-02-23 23:14                           ` SHA1 collisions found Linus Torvalds
2017-02-28 18:41                           ` Junio C Hamano
2017-02-28 19:07                             ` Junio C Hamano
2017-02-28 19:20                               ` Jeff King
2017-03-01  8:57                                 ` Dan Shumow
2017-02-28 19:34                               ` Linus Torvalds
2017-02-28 19:52                                 ` Shawn Pearce
2017-02-28 22:56                                   ` Linus Torvalds
2017-02-28 21:22                                 ` Dan Shumow
2017-02-28 22:50                                   ` Marc Stevens
2017-02-28 23:11                                     ` Linus Torvalds
2017-03-01 19:05                                       ` Jeff King
2017-02-23 20:47               ` Øyvind A. Holm
2017-02-23 20:46             ` Joey Hess
2017-02-23 18:42         ` Jeff King
2017-02-23 17:52     ` David Lang
2017-02-23 19:20   ` David Lang
2017-02-23 17:19 ` Linus Torvalds
2017-02-23 17:29   ` Linus Torvalds
2017-02-23 18:10   ` Joey Hess
2017-02-23 18:29     ` Linus Torvalds
2017-02-23 18:38     ` Junio C Hamano
2017-02-24  9:42 ` Duy Nguyen
2017-02-25 19:04   ` brian m. carlson
2017-02-27 13:29     ` René Scharfe
2017-02-28 13:25       ` brian m. carlson
2017-02-24 15:13 ` Ian Jackson
2017-02-24 17:04   ` ankostis
2017-02-24 17:23   ` Jason Cooper
2017-02-25 23:22     ` ankostis
2017-02-24 17:32   ` Junio C Hamano
2017-02-24 17:45     ` David Lang
2017-02-24 18:14       ` Junio C Hamano
2017-02-24 18:58         ` Stefan Beller
2017-02-24 19:20           ` Junio C Hamano
2017-02-24 20:05             ` ankostis
2017-02-24 20:32               ` Junio C Hamano
2017-02-25  0:31                 ` ankostis
2017-02-26  0:16                   ` Jason Cooper
2017-02-26 17:38                     ` brian m. carlson
2017-02-26 19:11                       ` Linus Torvalds
2017-02-26 21:38                         ` Ævar Arnfjörð Bjarmason
2017-02-26 21:52                           ` Jeff King
2017-02-27 13:00                             ` Transition plan for git to move to a new hash function Ian Jackson
2017-02-27 14:37                               ` Why BLAKE2? Markus Trippelsdorf
2017-02-27 15:42                                 ` Ian Jackson
2017-02-27 19:26                               ` Transition plan for git to move to a new hash function Tony Finch
2017-02-28 21:47                               ` brian m. carlson
2017-03-02 18:13                                 ` Ian Jackson
2017-03-04 22:49                                   ` brian m. carlson
2017-03-05 13:45                                     ` Ian Jackson
2017-03-05 23:45                                       ` brian m. carlson
2017-02-24 20:05             ` SHA1 collisions found Junio C Hamano
2017-02-24 20:33           ` Philip Oakley
2017-02-24 23:39     ` Jeff King
2017-02-25  0:39       ` Linus Torvalds
2017-02-25  0:54         ` Linus Torvalds
2017-02-25  1:16         ` Jeff King
2017-02-26 18:55           ` Junio C Hamano
2017-02-25  6:10         ` Junio C Hamano
2017-02-26  1:13           ` Jason Cooper
2017-02-26  5:18             ` Jeff King
2017-02-26 18:30               ` brian m. carlson
2017-03-02 21:46               ` Brandon Williams
2017-03-03 11:13                 ` Jeff King
2017-03-03 14:54                   ` Ian Jackson [this message]
2017-03-03 22:18                     ` Jeff King
2017-03-02 19:55         ` Linus Torvalds
2017-03-02 20:43           ` Junio C Hamano
2017-03-02 21:21             ` Linus Torvalds
2017-03-02 21:54               ` Joey Hess
2017-03-02 22:27                 ` Linus Torvalds
2017-03-03  1:50                   ` Mike Hommey
2017-03-03  2:19                     ` Linus Torvalds
2017-03-03 11:04           ` Jeff King
2017-03-03 21:47           ` Stefan Beller
2017-02-25  1:00       ` David Lang
2017-02-25  1:15         ` Stefan Beller
2017-02-25  1:21         ` Jeff King
2017-02-25  1:39           ` David Lang
2017-02-25  1:47             ` Jeff King
2017-02-25  1:56               ` David Lang
2017-02-25  2:28             ` Jacob Keller
2017-02-25  2:26           ` Jacob Keller
2017-02-25  5:39             ` grarpamp
2017-02-24 23:43     ` Ian Jackson
2017-02-25  0:06       ` Ian Jackson
2017-02-25 18:50     ` brian m. carlson
2017-02-25 19:26       ` Jeff King
2017-02-25 22:09         ` Mike Hommey
2017-02-26 17:38           ` brian m. carlson
2017-02-24 22:47 ` Jakub Narębski
2017-02-24 22:53   ` Santiago Torres
2017-02-24 23:05     ` Jakub Narębski
2017-02-24 23:24       ` Øyvind A. Holm
2017-02-24 23:06   ` Jeff King
2017-02-24 23:35     ` Jakub Narębski
2017-02-25 22:35     ` Lars Schneider
2017-02-26  0:46       ` Jeff King
2017-02-26 18:22         ` Junio C Hamano
2017-02-26 18:57     ` Thomas Braun
2017-02-26 21:30       ` Jeff King
2017-02-27  9:57         ` Geert Uytterhoeven
2017-02-27 10:43           ` Jeff King
2017-02-27 12:39             ` Morten Welinder

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=22713.33728.502854.338516@chiark.greenend.org.uk \
    --to=ijackson@chiark.greenend.org.uk \
    --cc=bmwill@google.com \
    --cc=git@lakedaemon.net \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=id@joeyh.name \
    --cc=peff@peff.net \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).