On 2021-04-15 at 08:47:00, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Apr 14 2021, brian m. carlson wrote:
> > I will do some performance numbers on these patches, but it will likely
> > be the weekend before I can get to it.  I think this will add 4 bytes on
> > most platforms, since int is typically 32 bits, and the alignment
> > requirement would be for the most strictly aligned member, which is the
> > int, so a 4-byte alignment.  I don't think the alignment requirements
> > are especially onerous here.
> 
> I think if you're doing such a perf test one where we have SHA-1 mode
> with SHA-1 length OID v.s. SHA-256 (the current behavior) would be
> interesting as well.
> 
> It seems like good SHA1s to test that are 5a0cc8aca79 and
> 13eeedb5d17. Running:
> 
>     GIT_PERF_REPEAT_COUNT=10 \
>     GIT_SKIP_TESTS="p0001.[3-9] p1450.2" \
>     GIT_TEST_OPTS= GIT_PERF_MAKE_OPTS='-j8 CFLAGS=-O3' \
>     ./run 5a0cc8aca79 13eeedb5d17 -- p0001-rev-list.sh p1450-fsck.sh
> 
> (I added a fsck --connectivity-only test)
> 
> Gives us:
> 
>     Test                               5a0cc8aca79         13eeedb5d17
>     ------------------------------------------------------------------------------
>     0001.1: rev-list --all             2.46(2.22+0.14)     2.48(2.25+0.14) +0.8%
>     0001.2: rev-list --all --objects   10.79(10.22+0.14)   10.92(10.24+0.20) +1.2%
>     1450.1: fsck --connectivity-only   16.61(15.42+0.34)   16.94(15.60+0.32) +2.0%
> 
> So at least on my box none of those are outside of the confidence
> intervals. This was against my copy of git.git. Perhaps it matters more
> under memory pressure.

I do plan to take a deeper look at this this weekend and do some
performance numbers, and I think these are great examples to use.  I
just have a limited amount of time most weeknights because, among other
things, I am taking French a couple nights a week.

I talked with Stolee today about this approach and the desire for
performance, so I think we're on the same page about trying to make this
as fast as possible.  I plan to try a couple alternative solutions which
may work as well (or at least, I will make notes this time about why
they didn't work) and be less impactful.

> >> I assume that we already checked what happened when GIT_MAX_RAWSZ
> >> increased, but that seemed worth the cost so we could have SHA-256
> >> at all. I find the justification for this interoperability mode to
> >> be less significant, and potentially adding too much of a tax onto
> >> both SHA-1 repos that will never upgrade, and SHA-256 repos that
> >> upgrade all at once (or start as SHA-256).
> >
> > The entire goal of the interoperability is to let people seamlessly and
> > transparently move from SHA-1 to SHA-256.  Currently, the only way
> > people can move a SHA-1 repository to a SHA-256 repository is with
> > fast-import and fast-export, which loses all digital signatures and tags
> > to blobs.  This also requires a flag day.
> >
> > SHA-1 can now be attacked for USD 45,000.  That means it is within the
> > budget of a dedicated professional and virtually all medium or large
> > corporations, including even most municipal governments, to create a
> > SHA-1 collision.
> 
> Is that for vanilla SHA-1, or SHA-1DC?

Well, for SHA-1 in general.  SHA-1DC doesn't do anything except detect
collisions.  People can still create collisions, but we detect them.

> > Unfortunately, the way we deal with this is to die, so
> > as soon as this happens, the repository fails closed.  While an attacker
> > cannot make use of the collisions to spread malicious objects, because
> > of the way Git works, they can effectively DoS a repository, which is in
> > itself a security issue.  Fixing this requires major surgery.
> 
> Can you elaborate on this? I believe that we die on any known collision
> via the SHA1-DC code, and even if we didn't have that we'd detect the
> collision (see [1] for the code) and die while the object is in the
> temporary quarantine.
> 
> I believe such a request is cheaper to serve than one that doesn't
> upload colliding objects, we die earlier (less CPU etc.), and don't add
> objects to the store.
> 
> So what's the DoS vector?

That assumes that the server is using SHA-1DC and that the object can't
be uploaded any way except a push where its hash is checked.  Those are
true for many, but not all, hosting providers.  For example, anyone
using Git in a FIPS-validated environment can only use FIPS-validated
crypto, and I'm not aware of any SHA-1DC implementations that are.
Also, some implementations like Dulwich don't use SHA-1DC.

Once someone can get that object to a standard Git which does use
SHA-1DC, then the repository is pretty much hosed.  I probably can make
this better by just dropping the non SHA-1DC mode here and in libgit2,
to at least disincentivize poor choices in the most common
implementations.

> > We need the interoperability code to let people transition their
> > repositories away from SHA-1, even if it has some performance impact,
> > because without that most SHA-1 repositories will never transition.
> > That's what's outlined in the transition plan, and why that approach was
> > proposed, even though it would be nicer to avoid having to implement it
> > at all.
> 
> There's no question that we need working interop.
> 
> The question at least in my mind is why that interop is happening by
> annotating every object held in memory with whether they're SHA-1 or
> SHA-256, as opposed to having some translation layer earlier in the
> chain.

This is a good question.  Let me provide an example.

When we speak the remote protocol with a remote system, we'll parse out
several object ID of the appropriate algorithm.  We will then pass those
around to various places in our transport code.  It makes it a lot
easier if we can just run every object ID through an inline mapping
function which immediately does nothing if the object ID is of the
appropriate type rather than adding additional code to keep a check of
the current algorithm that's being used in the transport code.

It also makes it a lot easier when we let people store data in SHA-256
and then print things in SHA-1 for compatibility if we can add an
oid_to_hex_output and just map every object automatically on output,
regardless of where it came from, without needing to keep track of what
algorithm it's in.  For example, think about situations where the user
may have passed in a SHA-1 object ID and we reuse that value instead of
reading a SHA-256 object from the store.

So it's not completely impossible to avoid a hash algorithm member, but
it does significantly complicate our code not to do it.

> Not all our file or in-memory structures are are like that, e.g. the
> commit graph has a header saying "this is a bunch of SHA-1/256", and the
> objects that follow are padded to that actual hash size, not the max
> size we know about.
> 
> My understanding of the transition plan was that we'd e.g. have a
> SHA-1<->SHA-256 mapping of objects, which we'd say use to push/pull.

Correct.  That code exists and mostly works.  There are still a lot of
failing tests, but I have a pack index v3 that stores that data and a
loose object store.

> But that by the time I ran say a "git commit" none of that machinery
> needed to care that I was interoping with a SHA-1 repo on the other end.
> It would just happily have all SHA-256 objects, create new ones, and
> only by the time I needed to push them would something kick in to
> re-hash them.
> 
> I *think* the anwer is just "it's easier on the C-level" and "the
> wastage doesn't seem to matter much", which is fair enough.

I think that's accurate.

> *Goes and digs in the ML archive*:
> 
>     https://lore.kernel.org/git/1399147942-165308-1-git-send-email-sandals@crustytoothpaste.net/#t
>     https://lore.kernel.org/git/55016A3A.6010100@alum.mit.edu/
> 
> To answer (some) of that myself:
> 
> Digging up some of the initial discussion that seems to be the case, at
> that point there was a suggestion of:
> 
>     struct object_id {
>         unsigned char hash_type;
>         union {
>             unsigned char sha1[GIT_SHA1_RAWSZ];
>             unsigned char sha256[GIT_SHA256_RAWSZ];
>         } hash;
>     };
> 
> To which you replied:
> 
>     What I think might be more beneficial is to make the hash function
>     specific to a particular repository, and therefore you could maintain
>     something like this:
> 
>       struct object_id {
>           unsigned char hash[GIT_MAX_RAWSZ];
>       };
> 
> It wouldn't matter for the memory use to make it a union, but my reading
> of the above is that the reason for the current object_id not-a-union
> structure might not be valid now that there's a "hash_type" member, no?

Probably at that time we didn't have the formal transition plan and
didn't anticipate interoperability as a concern.  I do think that using
a single hash member instead of a union makes things easier because we
generally don't want to look up two different members in cases like
printing a hex OID.  We ultimately just want to print the right number
of bytes from that data, and the union just makes things trickier with
the compiler when we do that.
-- 
brian m. carlson (he/him or they/them)
Houston, Texas, US