Val Henson's critique of hash-based content storage systems

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Val Henson's critique of hash-based content storage systems
@ 2005-04-29  0:06 Rob Jellinghaus
  2005-04-29 19:45 ` Linus Torvalds
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Rob Jellinghaus @ 2005-04-29  0:06 UTC (permalink / raw)
  To: git

I assume most people here have read this, but just in case:

http://www.usenix.org/events/hotos03/tech/full_papers/henson/henson.pdf

Is git vulnerable to attacks in the event that SHA-1 is broken?

If an attacker used an SHA-1 attack to create a blob that matched the hash of
some well-known git object (say, the tree for Linux 2.7-rc1), and spammed public
git repositories with it ahead of Linus's release, what would be the potential
for mischief, and what would the recovery process be?

It seems that git is optimized to support networks of trust, so provided you
accept only signed commits from people you trust, it's likely that corruption
and mischief can be mostly avoided.  But probably not completely; there is still
a window of vulnerability.

It seems that git repositories could (at great expense) be regenerated to use a
new hash algorithm.  Is that the plan if SHA-1 is compromised (or comes so close
to compromise as to make Linus nervous ;-)?

Cheers,
Rob

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Val Henson's critique of hash-based content storage systems
  2005-04-29  0:06 Val Henson's critique of hash-based content storage systems Rob Jellinghaus
@ 2005-04-29 19:45 ` Linus Torvalds
  2005-04-29 19:52   ` Tom Lord
  2005-04-29 20:14 ` H. Peter Anvin
  2005-04-29 20:47 ` Morten Welinder
  2 siblings, 1 reply; 8+ messages in thread
From: Linus Torvalds @ 2005-04-29 19:45 UTC (permalink / raw)
  To: Rob Jellinghaus; +Cc: Git Mailing List

On Fri, 29 Apr 2005, Rob Jellinghaus wrote:
> 
> If an attacker used an SHA-1 attack to create a blob that matched the hash of
> some well-known git object (say, the tree for Linux 2.7-rc1), and spammed public
> git repositories with it ahead of Linus's release, what would be the potential
> for mischief, and what would the recovery process be?

I really think people should not consider the sha1 the "security". 

The real security is in distribution. 

With the distributed setup, developers don't use public trees. They use 
their own _private_ trees, and the public ones are just staging areas for 
synchronization.

So in order to actually replace a blob, let's say that you can create an 
object with the right sha1 trivially. What then?

You now have to break into _every_ repository that has that object, and 
replace it silently. Because if you don't, the good one will still be 
around.

That's just not going to happen.

So let's say that you break into kernel.org, and replace one of the blobs
in my repository.  What happens?

First off, I'll never notice, because it's not actually my repository, so 
I won't even have the corrupt copy. So what _will_ happen?

What will happen is that people who download new stuff from kernel.org
will get the "evil" object. Not all of them, though - just the ones that
hadn't downloaded the proper one. So first off, in order to be really
_effective_, the attack really has to not just replace an object, it
really wants to replace a pretty _recent_ object, because replacing an old
just just doesn't do a whole lot.

So they get the evil object. What happens? NOTHING. Absolutely nada.  
Either they use that evil object, or they don't. Not using it might be
because it's not even top-of-tree any more, and you really just replaced
some old version of a file. Or it might be because it's a object for a
driver that you don't have, so you'd never see it.

So let's ignore that case, and say that the attacker has successfully
replaced an object that is (a) recent enough to matter and (b) actually
used.

What now? You'll get a compile error. Big deal. People will notice that
something is wrong, complain about it, we'll think they have disk
corruption for a while, and then we'll figure it out, and replace the
object. Done.

Why? Because even if you successfully find an object with the same SHA1, 
the likelihood that that object actually makes _sense_ in that conctext is 
pretty damn near zero. 

Think about it. We've had this before: people whose files got flipped
around due to driver bugs or just hardware problems, and even just a
single bit error most of the time results in real honest-to-God compiler
errors.

And because we found the bad one, and we have the good one somewhere else, 
who cares? The security industry will be all atwitter about somebody 
finding a matching SHA1 object, and it will be _huge_ news, but did it 
actually hurt the kernel integrity? No.

So let's say that somebody breaks in to _my_ personal machine. I'm behind 
a few firewalls and a NAT setup, and I don't accept even incoming ssh, but 
hey, they could crowbar my door and break in that way. 

ONLY A TOTAL IDIOT would then replace an object in my database with
something else. That would be _stupid_. He'd just guarantee that all the 
same problems as above were true, except now we'd have to find the 
good object in some _other_ database than mine.

So if you actually wanted to corrupt the kernel tree, you'd do it by just
fooling me into accepting a crap patch. Hey, it happens all the time.  
People send me buggy stuff. We figure out the bugs. What's so different
here?

In other words, the security isn't in the hash. The hash is an added level 
to make it much harder to fool, but it's not "the security". 

And if we are really really unlucky, and a meteorite hits us, and we get
an object collision that has the same sha1 for _real_, and actually makes
sense, then hey, shit happens. We can fix it by "poisoning" that sha1, and
modifying both files trivially so that they don't match any more, and then
we add a list of "illegal" sha1's to fsck, and we'll make that list be ten
entries long, just in case the meteorite strikes ten times, but the fact
is it's simply not going to happen.

(It's going to be very very obvious, very very quickly, btw: the person
who actually created the object that happened to collide will not write
the new SHA1 out, because he already "had" the same object, so next time
somebody updates the tree, the file that matches will now have the "old
contents" from some other colliding file, and the new code simply won't do
what it was supposed to. So don't worry about it - collisions, even if
they happen, will be noticed as quite obvious _bugs_ in the end result,
the same way we find the common source of bugs - bad programming).

In other words: don't depend on hashes if you only have one copy of the
data. But if you have backups of old versions (which essentially the
distribution guarantees as long as we have "stupid" mirrors that just look
at the filename) having a hash collision doesn't mean that you lost any
real data.

So anybody who thinks that a hash collision is a fundamental problem just
hasn't thought things through. It's an _annoyance_, nothing more. But we
have tons of much more pressing annoyances, and pretty much all of them
are a hell of a lot more likely than a collission, whether intentional or
unintentional.

			Linus

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Val Henson's critique of hash-based content storage systems
  2005-04-29 19:45 ` Linus Torvalds
@ 2005-04-29 19:52   ` Tom Lord
  2005-04-29 20:17     ` C. Scott Ananian
  0 siblings, 1 reply; 8+ messages in thread
From: Tom Lord @ 2005-04-29 19:52 UTC (permalink / raw)
  To: git; +Cc: robj

I wouldn't expect outright successful attacks like forged replacements
for arbitrary files.

I would expect someone to have on hand a small number of blobs that are
different but have different hashes and, eventually, to drop said files
into a blob-based infrastructure to wreak havoc.

So: a way to locally mark a given checksum as "controversial" seems 
prudent, to me (hence, support for such in my blob-db code/spec).

-t

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Val Henson's critique of hash-based content storage systems
  2005-04-29  0:06 Val Henson's critique of hash-based content storage systems Rob Jellinghaus
  2005-04-29 19:45 ` Linus Torvalds
@ 2005-04-29 20:14 ` H. Peter Anvin
  2005-04-29 20:47 ` Morten Welinder
  2 siblings, 0 replies; 8+ messages in thread
From: H. Peter Anvin @ 2005-04-29 20:14 UTC (permalink / raw)
  To: Rob Jellinghaus; +Cc: git

Rob Jellinghaus wrote:
> I assume most people here have read this, but just in case:
> 
> http://www.usenix.org/events/hotos03/tech/full_papers/henson/henson.pdf
> 

I have to pull out the big flamethrower, especially against someone I 
consider a friend, but that paper is a classic example on how many 
people don't understand probability.

The *only* valid criticism in it is that we may not know enough about 
the future validity of cryptographic hash function, however, she also 
does not analyze the failure scenarios applicable to those kinds of 
failures barely at all.

In the end, the whole paper centers around "this makes me feel nervous", 
without really justifying it in any reasonable way.

It is just one of many papers on cryptoanalysis written by someone with 
no real background in the field.  It really saddens me to see someone 
like Val fall into that particular trap.

	-hpa

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Val Henson's critique of hash-based content storage systems
  2005-04-29 19:52   ` Tom Lord
@ 2005-04-29 20:17     ` C. Scott Ananian
  2005-04-29 20:37       ` Tom Lord
  0 siblings, 1 reply; 8+ messages in thread
From: C. Scott Ananian @ 2005-04-29 20:17 UTC (permalink / raw)
  To: Tom Lord; +Cc: git, robj

On Fri, 29 Apr 2005, Tom Lord wrote:

> I would expect someone to have on hand a small number of blobs that are
> different but have different hashes and, eventually, to drop said files
> into a blob-based infrastructure to wreak havoc.

This is just ridiculous.  The number of known collisions in SHA1 is 
*exactly zero* at this point in time --- not guaranteed to stay that way, 
of course, but generating collisions is likely to remain relatively 
expensive for some time.  The collisions are highly structured; they are 
not just arbitrary blobs.  If, after doing your 2^69 work or so to 
generate a real honest-to-goodness SHA-1 collision, you think an 
attacker would "DROP THEM IN A REPOSITORY TO CREATE HAVOC"?  You'd have to 
break into the repository, etc, and then you'd find that *NOTHING 
REFERENCED THEM* and so *ABSOLUTELY NOTHING WOULD HAPPEN*.

It's far more likely that SHA1 collisions will be used to generate forged 
X509 certificates, for a number of highly technical reasons.

Git's highly constrained and derided 'brittle' file formats also serve
to protect against the collision attacks against SHA-1 which are beginning 
to look possible.

> So: a way to locally mark a given checksum as "controversial" seems
> prudent, to me (hence, support for such in my blob-db code/spec).

Arguably that's what *upgrades* to the spec might be for -- git has a 
solid philosophy of not creating 'features' unless it is sure that they 
are needed/will be used, and I think this is always the wise route in 
software development.  Of much specification comes no code.

And, if you actually create a 'flexible' blob-db spec with 'room for 
expansion' -- congratulations, you've just made yourself more vulnerable 
to collision attacks.
  --scott

terrorist MI5 SKILLET hack AMLASH security KMPLEBE KUFIRE SCRANTON 
D5 SLBM LINCOLN KUDESK SMOTH Kojarena Moscow HTAUTOMAT WSBURNT Chechnya
                          ( http://cscott.net/ )

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Val Henson's critique of hash-based content storage systems
  2005-04-29 20:17     ` C. Scott Ananian
@ 2005-04-29 20:37       ` Tom Lord
  2005-04-29 20:41         ` C. Scott Ananian
  0 siblings, 1 reply; 8+ messages in thread
From: Tom Lord @ 2005-04-29 20:37 UTC (permalink / raw)
  To: cscott; +Cc: git, robj

  lord:

  > I would expect someone to have on hand a small number of blobs that are
  > different but have different hashes and, eventually, to drop said files
  > into a blob-based infrastructure to wreak havoc.

  cscott:

  This is just ridiculous.  The number of known collisions in SHA1 is 
  *exactly zero* at this point in time --- not guaranteed to stay that way, 
  of course, but generating collisions is likely to remain relatively 
  expensive for some time.

Blob-dbs and the low-level object system (trees, file-contents, and
changesets) are pretty fundamental things.  It is likely (and
desirable) -- not guaranteed but likely (and desirable) -- that people
will invest heavily in building infrastructure that operates solely at
that level of abstraction.  Arguably, that is already happening.

Simultaneously, it is very desirable that some mathemetican somewhere
will discover two bitstrings which are different but have SHA1
checksums, and then tell everyone in the world about their discovery.

My point is simply that blob-db implementations should assume that the
mathemeticians will succeed and take the small steps necessary to make
sure that those bitstrings can't be used to crash a distributed
blob-db infrastructure.

-t

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Val Henson's critique of hash-based content storage systems
  2005-04-29 20:37       ` Tom Lord
@ 2005-04-29 20:41         ` C. Scott Ananian
  0 siblings, 0 replies; 8+ messages in thread
From: C. Scott Ananian @ 2005-04-29 20:41 UTC (permalink / raw)
  To: Tom Lord; +Cc: git, robj

On Fri, 29 Apr 2005, Tom Lord wrote:

> My point is simply that blob-db implementations should assume that the
> mathemeticians will succeed and take the small steps necessary to make
> sure that those bitstrings can't be used to crash a distributed
> blob-db infrastructure.

And my point is that you haven't *begun* to describe how one might use an 
arbitrary hash collision to "crash a distributed blob-db infrastructure".

Remember, first you've got to get some reference to your collision into 
the db...  (and if you can do that, why are you mucking around with hash 
collisions?)
   --scott

Philadelphia PBPRIME STANDEL for Dummies milita Richard Tomlinson 
ESSENCE SUMAC Nader KUCLUB WSHOOFS QKENCHANT AK-47 AMQUACK supercomputer
                          ( http://cscott.net/ )

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Val Henson's critique of hash-based content storage systems
  2005-04-29  0:06 Val Henson's critique of hash-based content storage systems Rob Jellinghaus
  2005-04-29 19:45 ` Linus Torvalds
  2005-04-29 20:14 ` H. Peter Anvin
@ 2005-04-29 20:47 ` Morten Welinder
  2 siblings, 0 replies; 8+ messages in thread
From: Morten Welinder @ 2005-04-29 20:47 UTC (permalink / raw)
  To: Rob Jellinghaus; +Cc: git

On 4/28/05, Rob Jellinghaus <robj@unrealities.com> wrote:
> I assume most people here have read this, but just in case:
> 
> http://www.usenix.org/events/hotos03/tech/full_papers/henson/henson.pdf

The math in section 3 is bogus.  1-(1-2^-b)^n  isn't hard to compute and
even if it was, it is the wrong formula.  (Set n==2^b; you obviously should
get probability 1 for collision.)

The right formula is 1-B!/B^n/(B-n)! where B=2^n.  For n=2^80 and b=160
you get about 39%.

Morten

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-04-29 20:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-04-29  0:06 Val Henson's critique of hash-based content storage systems Rob Jellinghaus
2005-04-29 19:45 ` Linus Torvalds
2005-04-29 19:52   ` Tom Lord
2005-04-29 20:17     ` C. Scott Ananian
2005-04-29 20:37       ` Tom Lord
2005-04-29 20:41         ` C. Scott Ananian
2005-04-29 20:14 ` H. Peter Anvin
2005-04-29 20:47 ` Morten Welinder

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).