From mboxrd@z Thu Jan 1 00:00:00 1970 From: Ray Heasman Subject: Re: Hash collision count Date: Sat, 23 Apr 2005 18:01:08 -0700 Message-ID: <1114304468.10043.45.camel@maze.mythral.org> References: <426AAFC3.800@pobox.com> <1114297231.10264.12.camel@maze.mythral.org> <426AD835.5070404@pobox.com> Mime-Version: 1.0 Content-Type: text/plain Content-Transfer-Encoding: 7bit Cc: Git Mailing List X-From: git-owner@vger.kernel.org Sun Apr 24 02:56:40 2005 Return-path: Received: from vger.kernel.org ([12.107.209.244]) by ciao.gmane.org with esmtp (Exim 4.43) id 1DPVQE-0001tq-0H for gcvg-git@gmane.org; Sun, 24 Apr 2005 02:56:30 +0200 Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S262213AbVDXBBQ (ORCPT ); Sat, 23 Apr 2005 21:01:16 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S262214AbVDXBBQ (ORCPT ); Sat, 23 Apr 2005 21:01:16 -0400 Received: from soggy199.drizzle.com ([216.162.199.199]:20742 "EHLO mythral.org") by vger.kernel.org with ESMTP id S262213AbVDXBBL (ORCPT ); Sat, 23 Apr 2005 21:01:11 -0400 Received: (qmail 32364 invoked from network); 23 Apr 2005 18:01:09 -0700 Received: from dsl254-013-025.sea1.dsl.speakeasy.net (HELO maze.mythral.org) (authuser@216.254.13.25) by 192.168.0.8 with RC4-MD5 encrypted SMTP; 23 Apr 2005 18:01:09 -0700 To: Jeff Garzik In-Reply-To: <426AD835.5070404@pobox.com> X-Mailer: Evolution 2.0.2 Sender: git-owner@vger.kernel.org Precedence: bulk X-Mailing-List: git@vger.kernel.org On Sat, 2005-04-23 at 19:20 -0400, Jeff Garzik wrote: > Ray Heasman wrote: > > On Sat, 2005-04-23 at 16:27 -0400, Jeff Garzik wrote: > > > >>Ideally a hash + collision-count pair would make the best key, rather > >>than just hash alone. > >> > >>A collision -will- occur eventually, and it is trivial to avoid this > >>problem: > >> > >> $n = 0 > >> attempt to store as $hash-$n > >> if $hash-$n exists (unlikely) > >> $n++ > >> goto restart > >> key = $hash-$n > >> > > > > > > Great. So what have you done here? Suppose you have 32 bits of counter > > for n. Whoopee, you just added 32 bits to your hash, using a two stage > > algorithm. So, you have a 192 bit hash assuming you started with the 160 > > bit SHA. And, one day your 32 bit counter won't be enough. Then what? > > First, there is no 32-bit limit. git stores keys (aka hashes) as > strings. As it should. Oh great, now we have variable length id strings too. And we'll have to pretend the OS can store infinite length file names. > Second, in your scenario, it's highly unlikely you would get 4 billion > sha1 hash collisions, even if you had the disk space to store such a git > database. Er. So your so-unlikely-the-sun-will-burn-out-first scenario beats my so-unlikely-the-sun-will-burn-out-first scenario? Why am I not worried? > > You aren't solving anything. You're just putting it off, and doing it in > > a way that breaks all the wonderful semantics possible by just assuming > > that the hash is unique. All of a sudden we are doing checks of data > > that we never did before, and we have to do the check trillions of times > > before the CPU time spent pays off. > > First, the hash is NOT unique. Nooooo. Really? Why not just use a 8192 bit hash for each 1KiB of data? We could store a zero length file and store all the data in the filename. Guaranteed, no hash collisions that way. We make an assumption that we know is right most of the time, and we use it because we know our computer will crash from random quantum fluctuations before we have a chance of bumping into the problem. You do know that metastability means that every logic gate in your computer hardware is guaranteed to fail every "x" operations, where x is defined by process size, voltage and temperature? Sure, any failures in git would be data dependent rather than random, but that just means we don't get to store carefully crafted blocks invented by hypothetical cryptographers that have completely broken SHA. > Second, you lose data if you pretend it is unique. I don't like losing > data. You lose data either way. Just we get to burn out a few extra suns before yours dies, and I can burn out whole galaxies before mine dies by using a 256 bit hash, anyway. > Third, a data check only occurs in the highly unlikely case that a hash > already exists -- a collision. Rather than "trillions of times", more > like "one in a trillion chance." Heh. I calculate it has a 50% probability of happening after you have seen 10^24 input blocks. So, you are off by a factor of a trillion or so. Assuming we store 1 KiB blocks with a 160-bit hash, we would be able to store 1000 Trillion Terabytes before the chance of hitting a collision goes to 50%. To use marketing units, that is around 10 Trillion Libraries of Congress. Every 2 bits we add to the hash doubles the amount of data we can store before we hit a 50% probability of collision. I'm not sure how I could convince you that we're arguing about the number of angels that could dance on a pin. Ciao, Ray