git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* Re: Hash collision count
@ 2005-04-24 23:16 linux
  0 siblings, 0 replies; 14+ messages in thread
From: linux @ 2005-04-24 23:16 UTC (permalink / raw)
  To: jgarzik; +Cc: git

*Sigh*.

> A collision -will- occur eventually, and it is trivial to avoid this 
> problem:

Yes, it will occur *eventually*.  Let's put some numbers on this
"eventually" business.

The earth's sun will run out of hydrogen and become a red giant in about
6 billion years, 1.3 * 2^57 seconds.  (Many people say 5 billion. but
I'll round up for safety.)  Suppose we add 1 file per second to our
repository until we are interrupted by the sun's photosphere engulfing
the planet and melting all our computers.

That's a total of n*(n-1)/2 = 1.7 * 2^113 pairs of files which can
possibly collide.  I'll round up to 2^114 for simplicity.

Assuming we have a good uniform hash function, the chance that any given
pair of different file versions produces an identical hash is 1/2^160.

The chance that there are no collsions is (1-1/2^160)^(2^114).  What is
this numerically?  Well, (1-a)*(1-b) = 1 - a - b + a*b.  When multiplying
probabilities very close to 1 (a and b small), the probability is
a little bit larger than (1-(a+b)).

Likewise, when exponentiating proabilities very close to 1, (1-a)^n is
a bit larger than 1-n*a.

Thus, the probability of no collisions is a bit larger than (1 - 2^114/2^160),
or (1 - 2^-46).  The probability of one or more collisions is a bit *less*
than 2^-46, 1 in 70 trillion.


With odds like that, I submit that it's not worth fixing; the odds
of introducing a bug are far higher than that.


If you can't find a hard drive large enough to store 183 quadrillion
file versions, the probability of a collision decreases as the square of
the fraction of that number you do store.  For example, if you only have
400 billion versions, the chance of a collision is around 2^-84.

We have a fair bit of safety margin in our "good uniform hash" assumption.

Running out of 32-bit inode number space is a *far* more urgent problem.

Not to mention that over the next 1000 years of kernel maintainers, one
of them may find something they like better than git.

^ permalink raw reply	[flat|nested] 14+ messages in thread
* Hash collision count
@ 2005-04-23 20:27 Jeff Garzik
  2005-04-23 20:33 ` Jeff Garzik
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Jeff Garzik @ 2005-04-23 20:27 UTC (permalink / raw)
  To: Git Mailing List, Linus Torvalds


Ideally a hash + collision-count pair would make the best key, rather 
than just hash alone.

A collision -will- occur eventually, and it is trivial to avoid this 
problem:

	$n = 0
	attempt to store as $hash-$n
	if $hash-$n exists (unlikely)
		$n++
		goto restart
	key = $hash-$n

Tangent-as-the-reason-I-bring-this-up:

One of my long-term projects is an archive service, somewhat like 
Plan9's venti:  a multi-server key-value database, with sha1 hash as the 
key.

However, as the database grows into the terabyte (possibly petabyte) 
range, the likelihood of a collision transitions rapidly from unlikely 
-> possible -> likely.

Since it is -so- simple to guarantee that you avoid collisions, I'm 
hoping git will do so before the key structure is too ingrained.

	Jeff




^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2005-04-25 23:55 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-04-24 23:16 Hash collision count linux
  -- strict thread matches above, loose matches on Subject: below --
2005-04-23 20:27 Jeff Garzik
2005-04-23 20:33 ` Jeff Garzik
2005-04-23 23:00 ` Ray Heasman
2005-04-23 23:20   ` Jeff Garzik
2005-04-23 23:46     ` Petr Baudis
2005-04-24  0:35       ` Jeff Garzik
2005-04-24  0:40         ` Petr Baudis
2005-04-24  0:43           ` Jeff Garzik
2005-04-24 21:24             ` Imre Simon
2005-04-25 23:50       ` Tom Lord
2005-04-26  0:00         ` Petr Baudis
2005-04-24  1:01     ` Ray Heasman
2005-04-24  7:56 ` David Lang

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).