On Thu, Apr 26, 2018 at 05:46:28PM +0200, Duy Nguyen wrote:
> On Wed, Apr 25, 2018 at 8:49 PM, Martin Ågren <martin.agren@gmail.com> wrote:
> > Once that is accomplished, I sort of suspect that this code will want to
> > be updated to not always blindly use the_hash_algo, but to always work
> > with SHA-1 sizes. Or rather, this would turn into more generic code to
> > handle both "v2 with SHA-1" and "v3 with some hash function(s)". This
> > commit might be a good first step in that direction.
> 
> I also have an uneasy feeling when things this close to on-disk file
> format get hash-agnostic treatment. I think we would need to start
> adding new file formats soon, from bottom up with simple things like
> loose object files (cat-file and hash-object should be enough to test
> blobs...), then moving up to pack files and more. This is when we can
> really decide where to use the new hash and whether we should keep
> some hashes as sha-1.

I agree that this is work which needs to be done soon.  There are
basically a couple of pieces we need to handle NewHash:

* Remove the dependencies on SHA-1 as much as possible.
* Get the tests to pass with a different hash (almost done for 160-bit
  hash; in progress for 256-bit hashes).
* Write pack code.
* Write loose object index code.
* Write read-as-SHA-1 code.
* Force the codebase to always use SHA-1 when dealing with fetch/push.
* Distinguish between code which needs to use compatObjectFormat and
  code which needs to use objectFormat.
* Decide on NewHash.

I'm working on the top two bullet points right now.  Others are welcome
to pick up other pieces, or I'll get to them eventually.

As much as I'm dreading having the bikeshedding discussion over what
we're going to pick for NewHash, some of these pieces require knowing
what algorithm it will be.  For example, we have some tests which either
need to be completely rewritten or have a translation table written for
them (think the ones that use colliding short names).  In order for
those tests to have the translation table written, we need to be able to
compute colliding values.  I'm annotating these with prerequisites, but
there are quite a few tests which are skipped.

I expect writing the pack, loose object index, and read-as-SHA-1 code is
going to require having some code for NewHash or stand-in present in
order for it to compile and be tested.  It's possible that others could
come up with more imaginative solutions that don't require that, but I
have my doubts.

> For trailing hashes for example, there's no need to move to a new hash
> which only costs us more cycles. We just use it as a fancy checksum to
> avoid bit flips. But then my assumption about cost may be completely
> wrong without experimenting.

I would argue that consistency is helpful.  Also, do we really want
people to be able to (eventually) create colliding packs that contain
different data?  That doesn't seem like a good idea.

But also, some of the candidates we're considering for NewHash are
actually faster than SHA-1.  So for performance reasons alone, it might
be useful to adopt a consistent scheme.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204