From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: "brian m. carlson" <sandals@crustytoothpaste.net>
Cc: Derrick Stolee <stolee@gmail.com>,
git@vger.kernel.org, Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH 03/15] cache: add an algo member to struct object_id
Date: Thu, 15 Apr 2021 10:47:00 +0200 [thread overview]
Message-ID: <87pmyw0waj.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <YHZApIpPBhpmwscP@camp.crustytoothpaste.net>
On Wed, Apr 14 2021, brian m. carlson wrote:
> On 2021-04-13 at 12:12:21, Derrick Stolee wrote:
>> On 4/10/2021 11:21 AM, brian m. carlson wrote:
>> > Now that we're working with multiple hash algorithms in the same repo,
>> > it's best if we label each object ID with its algorithm so we can
>> > determine how to format a given object ID. Add a member called algo to
>> > struct object_id.
>> >
>> > Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net>
>> > ---
>> > hash.h | 1 +
>> > 1 file changed, 1 insertion(+)
>> >
>> > diff --git a/hash.h b/hash.h
>> > index 3fb0c3d400..dafdcb3335 100644
>> > --- a/hash.h
>> > +++ b/hash.h
>> > @@ -181,6 +181,7 @@ static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
>> >
>> > struct object_id {
>> > unsigned char hash[GIT_MAX_RAWSZ];
>> > + int algo;
>> > };
>>
>> What are the performance implications of adding this single bit
>> (that actually costs us 4 to 8 bytes, based on alignment)? Later
>> in the series you add longer hash comparisons, too. These seem
>> like they will affect performance for existing SHA-1 repos, and
>> it would be nice to know how much we are paying for this support.
>
> I will do some performance numbers on these patches, but it will likely
> be the weekend before I can get to it. I think this will add 4 bytes on
> most platforms, since int is typically 32 bits, and the alignment
> requirement would be for the most strictly aligned member, which is the
> int, so a 4-byte alignment. I don't think the alignment requirements
> are especially onerous here.
I think if you're doing such a perf test one where we have SHA-1 mode
with SHA-1 length OID v.s. SHA-256 (the current behavior) would be
interesting as well.
It seems like good SHA1s to test that are 5a0cc8aca79 and
13eeedb5d17. Running:
GIT_PERF_REPEAT_COUNT=10 \
GIT_SKIP_TESTS="p0001.[3-9] p1450.2" \
GIT_TEST_OPTS= GIT_PERF_MAKE_OPTS='-j8 CFLAGS=-O3' \
./run 5a0cc8aca79 13eeedb5d17 -- p0001-rev-list.sh p1450-fsck.sh
(I added a fsck --connectivity-only test)
Gives us:
Test 5a0cc8aca79 13eeedb5d17
------------------------------------------------------------------------------
0001.1: rev-list --all 2.46(2.22+0.14) 2.48(2.25+0.14) +0.8%
0001.2: rev-list --all --objects 10.79(10.22+0.14) 10.92(10.24+0.20) +1.2%
1450.1: fsck --connectivity-only 16.61(15.42+0.34) 16.94(15.60+0.32) +2.0%
So at least on my box none of those are outside of the confidence
intervals. This was against my copy of git.git. Perhaps it matters more
under memory pressure.
>> I assume that we already checked what happened when GIT_MAX_RAWSZ
>> increased, but that seemed worth the cost so we could have SHA-256
>> at all. I find the justification for this interoperability mode to
>> be less significant, and potentially adding too much of a tax onto
>> both SHA-1 repos that will never upgrade, and SHA-256 repos that
>> upgrade all at once (or start as SHA-256).
>
> The entire goal of the interoperability is to let people seamlessly and
> transparently move from SHA-1 to SHA-256. Currently, the only way
> people can move a SHA-1 repository to a SHA-256 repository is with
> fast-import and fast-export, which loses all digital signatures and tags
> to blobs. This also requires a flag day.
>
> SHA-1 can now be attacked for USD 45,000. That means it is within the
> budget of a dedicated professional and virtually all medium or large
> corporations, including even most municipal governments, to create a
> SHA-1 collision.
Is that for vanilla SHA-1, or SHA-1DC?
> Unfortunately, the way we deal with this is to die, so
> as soon as this happens, the repository fails closed. While an attacker
> cannot make use of the collisions to spread malicious objects, because
> of the way Git works, they can effectively DoS a repository, which is in
> itself a security issue. Fixing this requires major surgery.
Can you elaborate on this? I believe that we die on any known collision
via the SHA1-DC code, and even if we didn't have that we'd detect the
collision (see [1] for the code) and die while the object is in the
temporary quarantine.
I believe such a request is cheaper to serve than one that doesn't
upload colliding objects, we die earlier (less CPU etc.), and don't add
objects to the store.
So what's the DoS vector?
> We need the interoperability code to let people transition their
> repositories away from SHA-1, even if it has some performance impact,
> because without that most SHA-1 repositories will never transition.
> That's what's outlined in the transition plan, and why that approach was
> proposed, even though it would be nicer to avoid having to implement it
> at all.
There's no question that we need working interop.
The question at least in my mind is why that interop is happening by
annotating every object held in memory with whether they're SHA-1 or
SHA-256, as opposed to having some translation layer earlier in the
chain.
Not all our file or in-memory structures are are like that, e.g. the
commit graph has a header saying "this is a bunch of SHA-1/256", and the
objects that follow are padded to that actual hash size, not the max
size we know about.
My understanding of the transition plan was that we'd e.g. have a
SHA-1<->SHA-256 mapping of objects, which we'd say use to push/pull.
But that by the time I ran say a "git commit" none of that machinery
needed to care that I was interoping with a SHA-1 repo on the other end.
It would just happily have all SHA-256 objects, create new ones, and
only by the time I needed to push them would something kick in to
re-hash them.
I *think* the anwer is just "it's easier on the C-level" and "the
wastage doesn't seem to matter much", which is fair enough.
*Goes and digs in the ML archive*:
https://lore.kernel.org/git/1399147942-165308-1-git-send-email-sandals@crustytoothpaste.net/#t
https://lore.kernel.org/git/55016A3A.6010100@alum.mit.edu/
To answer (some) of that myself:
Digging up some of the initial discussion that seems to be the case, at
that point there was a suggestion of:
struct object_id {
unsigned char hash_type;
union {
unsigned char sha1[GIT_SHA1_RAWSZ];
unsigned char sha256[GIT_SHA256_RAWSZ];
} hash;
};
To which you replied:
What I think might be more beneficial is to make the hash function
specific to a particular repository, and therefore you could maintain
something like this:
struct object_id {
unsigned char hash[GIT_MAX_RAWSZ];
};
It wouldn't matter for the memory use to make it a union, but my reading
of the above is that the reason for the current object_id not-a-union
structure might not be valid now that there's a "hash_type" member, no?
> I will endeavor to make the performance impact as small as possible, of
> course, and ideally there will be none. I am sensitive to the fact that
> people do run absurdly large workloads on Git, as we both know, and I do
> want to support that.
All of the above being said I do wonder if for those who worry about
hash size inflating their object store whether a more sensible end-goal
if that's an issue wouldn't be to store abbreviated hashes.
As long as you'd re-hash/inflate the size in the case of collisions
(which would be a more expensive check at the "fetch" boundary) you
could do so safely, and the result would be less memory consumption.
But maybe it's just a non-issue :)
1. https://lore.kernel.org/git/20181113201910.11518-1-avarab@gmail.com/
next prev parent reply other threads:[~2021-04-15 8:47 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-10 15:21 [PATCH 00/15] SHA-256 / SHA-1 interop, part 1 brian m. carlson
2021-04-10 15:21 ` [PATCH 01/15] sha1-file: allow hashing objects literally with any algorithm brian m. carlson
2021-04-15 8:55 ` Denton Liu
2021-04-15 23:03 ` brian m. carlson
2021-04-16 15:04 ` Ævar Arnfjörð Bjarmason
2021-04-16 18:55 ` Junio C Hamano
2021-04-10 15:21 ` [PATCH 02/15] builtin/hash-object: allow literally hashing with a given algorithm brian m. carlson
2021-04-11 8:52 ` Ævar Arnfjörð Bjarmason
2021-04-11 21:07 ` brian m. carlson
2021-04-16 15:21 ` Ævar Arnfjörð Bjarmason
2021-04-16 17:27 ` Ævar Arnfjörð Bjarmason
2021-04-10 15:21 ` [PATCH 03/15] cache: add an algo member to struct object_id brian m. carlson
2021-04-11 11:55 ` Ævar Arnfjörð Bjarmason
2021-04-11 21:37 ` brian m. carlson
2021-04-13 12:12 ` Derrick Stolee
2021-04-14 1:08 ` brian m. carlson
2021-04-15 8:47 ` Ævar Arnfjörð Bjarmason [this message]
2021-04-15 23:51 ` brian m. carlson
2021-04-10 15:21 ` [PATCH 04/15] Always use oidread to read into " brian m. carlson
2021-04-10 15:21 ` [PATCH 05/15] hash: add a function to finalize object IDs brian m. carlson
2021-04-10 15:21 ` [PATCH 06/15] Use the final_oid_fn to finalize hashing of " brian m. carlson
2021-04-10 15:21 ` [PATCH 07/15] builtin/pack-redundant: avoid casting buffers to struct object_id brian m. carlson
2021-04-10 15:21 ` [PATCH 08/15] cache: compare the entire buffer for " brian m. carlson
2021-04-11 8:17 ` Chris Torek
2021-04-11 11:36 ` Ævar Arnfjörð Bjarmason
2021-04-11 21:05 ` brian m. carlson
2021-04-10 15:21 ` [PATCH 09/15] hash: set and copy algo field in " brian m. carlson
2021-04-11 11:57 ` Ævar Arnfjörð Bjarmason
2021-04-11 21:48 ` brian m. carlson
2021-04-11 22:12 ` Ævar Arnfjörð Bjarmason
2021-04-11 23:52 ` brian m. carlson
2021-04-12 11:02 ` [PATCH 0/2] C99: harder dependency on variadic macros Ævar Arnfjörð Bjarmason
2021-04-12 11:02 ` [PATCH 1/2] git-compat-util.h: clarify comment on GCC-specific code Ævar Arnfjörð Bjarmason
2021-04-13 7:57 ` Jeff King
2021-04-13 21:07 ` Junio C Hamano
2021-04-14 5:21 ` Jeff King
2021-04-14 6:12 ` Ævar Arnfjörð Bjarmason
2021-04-14 7:31 ` Jeff King
2021-05-21 2:06 ` Jonathan Nieder
2021-04-12 11:02 ` [PATCH 2/2] C99 support: remove non-HAVE_VARIADIC_MACROS code Ævar Arnfjörð Bjarmason
2021-04-12 17:58 ` Junio C Hamano
2021-04-13 8:00 ` Jeff King
2021-05-21 2:50 ` Jonathan Nieder
2021-04-12 12:14 ` [PATCH 0/2] C99: harder dependency on variadic macros Bagas Sanjaya
2021-04-12 12:41 ` Ævar Arnfjörð Bjarmason
2021-04-12 22:57 ` brian m. carlson
2021-04-12 23:19 ` Junio C Hamano
2022-01-28 11:11 ` [PATCH v2 0/2] C99: remove hardcoded-out !HAVE_VARIADIC_MACROS code Ævar Arnfjörð Bjarmason
2022-01-28 11:11 ` [PATCH v2 1/2] git-compat-util.h: clarify GCC v.s. C99-specific in comment Ævar Arnfjörð Bjarmason
2022-01-28 11:11 ` [PATCH v2 2/2] C99: remove hardcoded-out !HAVE_VARIADIC_MACROS code Ævar Arnfjörð Bjarmason
2022-01-28 22:40 ` Junio C Hamano
2022-02-19 10:41 ` [PATCH v3 0/3] C99: remove dead " Ævar Arnfjörð Bjarmason
2022-02-19 10:41 ` [PATCH v3 1/3] git-compat-util.h: clarify GCC v.s. C99-specific in comment Ævar Arnfjörð Bjarmason
2022-02-19 10:41 ` [PATCH v3 2/3] C99: remove hardcoded-out !HAVE_VARIADIC_MACROS code Ævar Arnfjörð Bjarmason
2022-02-19 10:41 ` [PATCH v3 3/3] trace.h: remove never-used TRACE_CONTEXT Ævar Arnfjörð Bjarmason
2022-02-20 12:02 ` Junio C Hamano
2022-02-20 12:38 ` Ævar Arnfjörð Bjarmason
2022-02-20 20:12 ` Junio C Hamano
2022-02-21 16:05 ` [PATCH v4 0/2] C99: remove dead !HAVE_VARIADIC_MACROS code Ævar Arnfjörð Bjarmason
2022-02-21 16:05 ` [PATCH v4 1/2] git-compat-util.h: clarify GCC v.s. C99-specific in comment Ævar Arnfjörð Bjarmason
2022-02-21 16:05 ` [PATCH v4 2/2] C99: remove hardcoded-out !HAVE_VARIADIC_MACROS code Ævar Arnfjörð Bjarmason
2021-04-12 10:53 ` [PATCH 09/15] hash: set and copy algo field in struct object_id Junio C Hamano
2021-04-12 11:13 ` Ævar Arnfjörð Bjarmason
2021-04-10 15:21 ` [PATCH 10/15] hash: provide per-algorithm null OIDs brian m. carlson
2021-04-11 14:03 ` Junio C Hamano
2021-04-11 21:51 ` brian m. carlson
2021-04-10 15:21 ` [PATCH 11/15] builtin/show-index: set the algorithm for object IDs brian m. carlson
2021-04-10 15:21 ` [PATCH 12/15] commit-graph: don't store file hashes as struct object_id brian m. carlson
2021-04-10 15:21 ` [PATCH 13/15] builtin/pack-objects: avoid using struct object_id for pack hash brian m. carlson
2021-04-10 15:21 ` [PATCH 14/15] hex: default to the_hash_algo on zero algorithm value brian m. carlson
2021-04-10 15:21 ` [PATCH 15/15] hex: print objects using the hash algorithm member brian m. carlson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87pmyw0waj.fsf@evledraar.gmail.com \
--to=avarab@gmail.com \
--cc=dstolee@microsoft.com \
--cc=git@vger.kernel.org \
--cc=sandals@crustytoothpaste.net \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).