git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Jeff King <peff@peff.net>
Cc: Junio C Hamano <gitster@pobox.com>,
	Ian Jackson <ijackson@chiark.greenend.org.uk>,
	Joey Hess <id@joeyh.name>, Git Mailing List <git@vger.kernel.org>
Subject: Re: SHA1 collisions found
Date: Thu, 2 Mar 2017 11:55:45 -0800	[thread overview]
Message-ID: <CA+55aFwXaSAMF41Dz3u3nS+2S24umdUFv0+k+s18UyPoj+v31g@mail.gmail.com> (raw)
In-Reply-To: <CA+55aFw6BLjPK-F0RGd9LT7X5xosKOXOxuhmKX65ZHn09r1xow@mail.gmail.com>

On Fri, Feb 24, 2017 at 4:39 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Honestly, I think that a primary goal for a new hash implementation
> absolutely needs to be to minimize mixing.
>
> Not for security issues, but because of combinatorics. You want to
> have a model that basically reads old data, but that very aggressively
> approaches "new data only" in order to avoid the situation where you
> have basically the exact same tree state, just _represented_
> differently.
>
> For example, what I would suggest the rules be is something like this:

Hmm. Having looked at this a fair amount, and in particularly having
looked at the code as part of the hash typesafety patch I did, I am
actually starting to think that it would not be too painful at all to
have a totally different approach, which might be a lot easier to do.

So bear with me, let me try to explain my thinking:

 (a) if we want to be backwards compatible and not force people to
convert their trees on some flag day, we're going to be stuck with
having to have the SHA1 code around and all the existing object
parsing for basically forever

Now, this basically means that the _easy_ solution would be that we
just do the flag day, switch to sha-256, extend everything to 32-byte
hashes, and just have a "git2 fast-import" that makes it easy to
convert stuff.

But it really would be a completely different version of git, with a
new pack-file format and no real compatibility. Such a flag-day
approach would certainly have advantages: it would allow for just
re-architecting some bad choices:

 - make the hashing be something that can be threaded (ie maybe we can
just block it up in 4MB chunks that you can hash in parallel, and make
the git object hash be the hash of hashes)

 - replace zlib with something like zstd

 - get rid of old pack formats etc.

but  on the whole, I still think that the compatibility would be worth
much more than the possible technical advantages of a clean slate
restart.

 (b) the SHA1 hash is actually still quite strong, and the collision
detection code effectively means that we don't really have to worry
about collisions in the immediate future.

In other words, the mitigation of the current attack is actually
really easy technically (modulo perhaps the performance concerns), and
there's still nothing fundamentally wrong with using SHA1 as a content
hash. It's still a great hash.

Now, my initial reaction (since it's been discussed for so long
anyway) was obviously "pick a different hash". That was everybody's
initial reaction, I think.

But I'm starting to think that maybe that initial obvious reaction was wrong.

The thing that makes collision attacks so nasty is that our reaction
to a collision is so deadly.  But that's not necessarily fundamental:
we certainly uses hashes with collisions every day, and they work
fine. And they work fine because the code that uses those hashes is
designed to simply deal gracefully - although very possibly with a
performance degradation - with two different things hashing to the
same bucket.

So what if the solution to "SHA1 has potential collisions" is "any
hash can have collisions in theory, let's just make sure we handle
them gracefully"?

Because I'm looking at our data structures that have hashes in them,
and many of them are actually of the type where I go

  "Hmm..  Replacing the hash entirely is really really painful - but
it wouldn't necessarily be all that nasty to extend the format to have
additional version information".

and the advantage of that approach is that it actually makes the
compatibility part trivial. No more "have to pick a different hash and
two different formats", and more of a "just the same format with
extended information that might not be there for old objects".

So we have a few different types of formats:

 - the purely local data structures: the pack index file, the file
index, our refs etc

   These we could in change completely, and it wouldn't even be all
that painful. The pack index has already gone through versions, and it
doesn't affect anything else.

 - the actual objects.

   These are fairly painful to change, particularly things like the
"tree" object which is probably the worst designed of the lot. Making
it contain a fixed-size binary thing was really a nasty mistake. My
bad.

 - the pack format and the protocol to exchange "I have this" information

   This is *really* painful to change, because it contains not just
the raw object data, but it obviously ends up being the wire format
for remote accesses.

and it turns out that *all* of these formats look like they would be
fairly easy to extend to having extra object version information. Some
of that extra object version information we already have and don't
use, in fact.

Even the tree format, with the annoying fixed-size binary blob. Yes,
it has that fixed size binary blob, but it has it in _addition_ to the
ASCII textual form that would be really easy to just extend upon. We
have that "tree entry type" that we've already made extensions with by
using it for submodules. It would be quite easy to just say that a
tree entry also has a "file version" field, so that you can have
multiple objects that just hash to the same SHA1, and git wouldn't
even *care*.

The transfer protocol is the same: yes, we transfer hashes around, but
it would not be all that hard to extend it to "transfer hash and
object version".

And the difference is that then the "backwards compatibility" part
just means interacting with somebody who didn't know to transfer the
object version. So suddenly being backwards compatible isn't a whole
different object parsing thing, it's just a small extension.

IOW, we could make it so that the SHA1 is just a hash into a list of
objects. Even the pack index format wouldn't need to change - right
now we assume that an index hit gives us the direct pointer into the
pack file, but we *could* just make it mean that it gives us a direct
pointer to the first object in the pack file with that SHA1 hash.
Exactly like you'd normally use a hash table with linear probing.

Linear probing is usually considered a horrible approach to hash
tables,. but it's actually a really useful one for the case where
collisions are very rare.

Anyway, I do have a suggestion for what the "object version" would be,
but I'm not even going to mention it, because I want people to first
think about the _concept_ and not the implementation.

So: What do you think about the concept?

               Linus

  parent reply	other threads:[~2017-03-02 20:04 UTC|newest]

Thread overview: 136+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-23 16:43 SHA1 collisions found Joey Hess
2017-02-23 17:00 ` David Lang
2017-02-23 17:02 ` Junio C Hamano
2017-02-23 17:12   ` David Lang
2017-02-23 20:49     ` Jakub Narębski
2017-02-23 20:57       ` Jeff King
2017-02-23 17:18   ` Junio C Hamano
2017-02-23 17:35   ` Joey Hess
2017-02-23 17:52     ` Linus Torvalds
2017-02-23 18:21       ` Joey Hess
2017-02-23 18:31         ` Joey Hess
2017-02-23 19:13           ` Morten Welinder
2017-02-24 15:52             ` Geert Uytterhoeven
2017-02-23 18:40         ` Linus Torvalds
2017-02-23 18:46           ` Jeff King
2017-02-23 19:09             ` Linus Torvalds
2017-02-23 19:32               ` Jeff King
2017-02-23 19:47                 ` Linus Torvalds
2017-02-23 19:57                   ` Jeff King
     [not found]                     ` <alpine.LFD.2.20.1702231428540.30435@i7.lan>
2017-02-23 22:43                       ` Jeff King
2017-02-23 22:50                         ` Linus Torvalds
2017-02-23 23:05                         ` Jeff King
2017-02-23 23:05                           ` [PATCH 1/3] add collision-detecting sha1 implementation Jeff King
2017-02-23 23:15                             ` Stefan Beller
2017-02-24  0:01                               ` Jeff King
2017-02-24  0:12                                 ` Linus Torvalds
2017-02-24  0:16                                   ` Jeff King
2017-02-23 23:05                           ` [PATCH 2/3] sha1dc: adjust header includes for git Jeff King
2017-02-23 23:06                           ` [PATCH 3/3] Makefile: add USE_SHA1DC knob Jeff King
2017-02-24 18:36                             ` HW42
2017-02-24 18:57                               ` Jeff King
2017-02-23 23:14                           ` SHA1 collisions found Linus Torvalds
2017-02-28 18:41                           ` Junio C Hamano
2017-02-28 19:07                             ` Junio C Hamano
2017-02-28 19:20                               ` Jeff King
2017-03-01  8:57                                 ` Dan Shumow
2017-02-28 19:34                               ` Linus Torvalds
2017-02-28 19:52                                 ` Shawn Pearce
2017-02-28 22:56                                   ` Linus Torvalds
2017-02-28 21:22                                 ` Dan Shumow
2017-02-28 22:50                                   ` Marc Stevens
2017-02-28 23:11                                     ` Linus Torvalds
2017-03-01 19:05                                       ` Jeff King
2017-02-23 20:47               ` Øyvind A. Holm
2017-02-23 20:46             ` Joey Hess
2017-02-23 18:42         ` Jeff King
2017-02-23 17:52     ` David Lang
2017-02-23 19:20   ` David Lang
2017-02-23 17:19 ` Linus Torvalds
2017-02-23 17:29   ` Linus Torvalds
2017-02-23 18:10   ` Joey Hess
2017-02-23 18:29     ` Linus Torvalds
2017-02-23 18:38     ` Junio C Hamano
2017-02-24  9:42 ` Duy Nguyen
2017-02-25 19:04   ` brian m. carlson
2017-02-27 13:29     ` René Scharfe
2017-02-28 13:25       ` brian m. carlson
2017-02-24 15:13 ` Ian Jackson
2017-02-24 17:04   ` ankostis
2017-02-24 17:23   ` Jason Cooper
2017-02-25 23:22     ` ankostis
2017-02-24 17:32   ` Junio C Hamano
2017-02-24 17:45     ` David Lang
2017-02-24 18:14       ` Junio C Hamano
2017-02-24 18:58         ` Stefan Beller
2017-02-24 19:20           ` Junio C Hamano
2017-02-24 20:05             ` ankostis
2017-02-24 20:32               ` Junio C Hamano
2017-02-25  0:31                 ` ankostis
2017-02-26  0:16                   ` Jason Cooper
2017-02-26 17:38                     ` brian m. carlson
2017-02-26 19:11                       ` Linus Torvalds
2017-02-26 21:38                         ` Ævar Arnfjörð Bjarmason
2017-02-26 21:52                           ` Jeff King
2017-02-27 13:00                             ` Transition plan for git to move to a new hash function Ian Jackson
2017-02-27 14:37                               ` Why BLAKE2? Markus Trippelsdorf
2017-02-27 15:42                                 ` Ian Jackson
2017-02-27 19:26                               ` Transition plan for git to move to a new hash function Tony Finch
2017-02-28 21:47                               ` brian m. carlson
2017-03-02 18:13                                 ` Ian Jackson
2017-03-04 22:49                                   ` brian m. carlson
2017-03-05 13:45                                     ` Ian Jackson
2017-03-05 23:45                                       ` brian m. carlson
2017-02-24 20:05             ` SHA1 collisions found Junio C Hamano
2017-02-24 20:33           ` Philip Oakley
2017-02-24 23:39     ` Jeff King
2017-02-25  0:39       ` Linus Torvalds
2017-02-25  0:54         ` Linus Torvalds
2017-02-25  1:16         ` Jeff King
2017-02-26 18:55           ` Junio C Hamano
2017-02-25  6:10         ` Junio C Hamano
2017-02-26  1:13           ` Jason Cooper
2017-02-26  5:18             ` Jeff King
2017-02-26 18:30               ` brian m. carlson
2017-03-02 21:46               ` Brandon Williams
2017-03-03 11:13                 ` Jeff King
2017-03-03 14:54                   ` Ian Jackson
2017-03-03 22:18                     ` Jeff King
2017-03-02 19:55         ` Linus Torvalds [this message]
2017-03-02 20:43           ` Junio C Hamano
2017-03-02 21:21             ` Linus Torvalds
2017-03-02 21:54               ` Joey Hess
2017-03-02 22:27                 ` Linus Torvalds
2017-03-03  1:50                   ` Mike Hommey
2017-03-03  2:19                     ` Linus Torvalds
2017-03-03 11:04           ` Jeff King
2017-03-03 21:47           ` Stefan Beller
2017-02-25  1:00       ` David Lang
2017-02-25  1:15         ` Stefan Beller
2017-02-25  1:21         ` Jeff King
2017-02-25  1:39           ` David Lang
2017-02-25  1:47             ` Jeff King
2017-02-25  1:56               ` David Lang
2017-02-25  2:28             ` Jacob Keller
2017-02-25  2:26           ` Jacob Keller
2017-02-25  5:39             ` grarpamp
2017-02-24 23:43     ` Ian Jackson
2017-02-25  0:06       ` Ian Jackson
2017-02-25 18:50     ` brian m. carlson
2017-02-25 19:26       ` Jeff King
2017-02-25 22:09         ` Mike Hommey
2017-02-26 17:38           ` brian m. carlson
2017-02-24 22:47 ` Jakub Narębski
2017-02-24 22:53   ` Santiago Torres
2017-02-24 23:05     ` Jakub Narębski
2017-02-24 23:24       ` Øyvind A. Holm
2017-02-24 23:06   ` Jeff King
2017-02-24 23:35     ` Jakub Narębski
2017-02-25 22:35     ` Lars Schneider
2017-02-26  0:46       ` Jeff King
2017-02-26 18:22         ` Junio C Hamano
2017-02-26 18:57     ` Thomas Braun
2017-02-26 21:30       ` Jeff King
2017-02-27  9:57         ` Geert Uytterhoeven
2017-02-27 10:43           ` Jeff King
2017-02-27 12:39             ` Morten Welinder

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CA+55aFwXaSAMF41Dz3u3nS+2S24umdUFv0+k+s18UyPoj+v31g@mail.gmail.com \
    --to=torvalds@linux-foundation.org \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=id@joeyh.name \
    --cc=ijackson@chiark.greenend.org.uk \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).