git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Ray Heasman <lists@mythral.org>
To: Jeff Garzik <jgarzik@pobox.com>
Cc: Git Mailing List <git@vger.kernel.org>
Subject: Re: Hash collision count
Date: Sat, 23 Apr 2005 18:01:08 -0700	[thread overview]
Message-ID: <1114304468.10043.45.camel@maze.mythral.org> (raw)
In-Reply-To: <426AD835.5070404@pobox.com>

On Sat, 2005-04-23 at 19:20 -0400, Jeff Garzik wrote:
> Ray Heasman wrote:
> > On Sat, 2005-04-23 at 16:27 -0400, Jeff Garzik wrote:
> > 
> >>Ideally a hash + collision-count pair would make the best key, rather 
> >>than just hash alone.
> >>
> >>A collision -will- occur eventually, and it is trivial to avoid this 
> >>problem:
> >>
> >>	$n = 0
> >>	attempt to store as $hash-$n
> >>	if $hash-$n exists (unlikely)
> >>		$n++
> >>		goto restart
> >>	key = $hash-$n
> >>
> > 
> > 
> > Great. So what have you done here? Suppose you have 32 bits of counter
> > for n. Whoopee, you just added 32 bits to your hash, using a two stage
> > algorithm. So, you have a 192 bit hash assuming you started with the 160
> > bit SHA. And, one day your 32 bit counter won't be enough. Then what?
> 
> First, there is no 32-bit limit.  git stores keys (aka hashes) as 
> strings.  As it should.

Oh great, now we have variable length id strings too. And we'll have to
pretend the OS can store infinite length file names.

> Second, in your scenario, it's highly unlikely you would get 4 billion 
> sha1 hash collisions, even if you had the disk space to store such a git 
> database.

Er. So your so-unlikely-the-sun-will-burn-out-first scenario beats my
so-unlikely-the-sun-will-burn-out-first scenario? Why am I not worried?

> > You aren't solving anything. You're just putting it off, and doing it in
> > a way that breaks all the wonderful semantics possible by just assuming
> > that the hash is unique. All of a sudden we are doing checks of data
> > that we never did before, and we have to do the check trillions of times
> > before the CPU time spent pays off.
> 
> First, the hash is NOT unique.

Nooooo. Really?

Why not just use a 8192 bit hash for each 1KiB of data? We could store a
zero length file and store all the data in the filename. Guaranteed, no
hash collisions that way.

We make an assumption that we know is right most of the time, and we use
it because we know our computer will crash from random quantum
fluctuations before we have a chance of bumping into the problem. You do
know that metastability means that every logic gate in your computer
hardware is guaranteed to fail every "x" operations, where x is defined
by process size, voltage and temperature? Sure, any failures in git
would be data dependent rather than random, but that just means we don't
get to store carefully crafted blocks invented by hypothetical
cryptographers that have completely broken SHA.

> Second, you lose data if you pretend it is unique.  I don't like losing 
> data.

You lose data either way. Just we get to burn out a few extra suns
before yours dies, and I can burn out whole galaxies before mine dies by
using a 256 bit hash, anyway.

> Third, a data check only occurs in the highly unlikely case that a hash 
> already exists -- a collision.  Rather than "trillions of times", more 
> like "one in a trillion chance."

Heh. I calculate it has a 50% probability of happening after you have
seen 10^24 input blocks. So, you are off by a factor of a trillion or
so.

Assuming we store 1 KiB blocks with a 160-bit hash, we would be able to
store 1000 Trillion Terabytes before the chance of hitting a collision
goes to 50%. To use marketing units, that is around 10 Trillion
Libraries of Congress. Every 2 bits we add to the hash doubles the
amount of data we can store before we hit a 50% probability of
collision.

I'm not sure how I could convince you that we're arguing about the
number of angels that could dance on a pin.

Ciao,
Ray


  parent reply	other threads:[~2005-04-24  0:56 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2005-04-23 20:27 Hash collision count Jeff Garzik
2005-04-23 20:33 ` Jeff Garzik
2005-04-23 23:00 ` Ray Heasman
2005-04-23 23:20   ` Jeff Garzik
2005-04-23 23:46     ` Petr Baudis
2005-04-24  0:35       ` Jeff Garzik
2005-04-24  0:40         ` Petr Baudis
2005-04-24  0:43           ` Jeff Garzik
2005-04-24 21:24             ` Imre Simon
2005-04-24 22:25               ` Whales falling on houses - was: " Jon Seymour
2005-04-25 23:50       ` Tom Lord
2005-04-26  0:00         ` Petr Baudis
2005-04-24  1:01     ` Ray Heasman [this message]
2005-04-24  7:56 ` David Lang
  -- strict thread matches above, loose matches on Subject: below --
2005-04-24 23:16 linux

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1114304468.10043.45.camel@maze.mythral.org \
    --to=lists@mythral.org \
    --cc=git@vger.kernel.org \
    --cc=jgarzik@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).