git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Jeff King <peff@peff.net>,
	Thomas Braun <thomas.braun@virtuell-zuhause.de>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 0/5] handling 4GB .idx files
Date: Mon, 16 Nov 2020 08:30:34 -0500	[thread overview]
Message-ID: <42080870-1a92-e76f-d83a-f15642a96329@gmail.com> (raw)
In-Reply-To: <20201116041051.GA883199@coredump.intra.peff.net>

On 11/15/2020 11:10 PM, Jeff King wrote:
> On Sun, Nov 15, 2020 at 03:43:39PM +0100, Thomas Braun wrote:
> 
>> On 13.11.2020 06:06, Jeff King wrote:
>>> I recently ran into a case where Git could not read the pack it had
>>> produced via running "git repack". The culprit turned out to be an .idx
>>> file which crossed the 4GB barrier (in bytes, not number of objects).
>>> This series fixes the problems I saw, along with similar ones I couldn't
>>> trigger in practice, and protects the .idx loading code against integer
>>> overflows that would fool the size checks.
>>
>> Would it be feasible to have a test case for this large index case? This
>> should very certainly have an EXPENSIVE tag, or might even not yet work
>> on windows. But hopefully someday I'll find some more time to push large
>> object support on windows forward, and these kind of tests would really
>> help then.
> 
> I think it would be a level beyond what we usually consider even for
> EXPENSIVE. The cheapest I could come up with to generate the case is:

I agree that the cost of this test is more than I would expect for
EXPENSIVE.

>   perl -e '
> 	for (0..154_000_000) {
> 		print "blob\n";
> 		print "data <<EOF\n";
> 		print "$_\n";
> 		print "EOF\n";
> 	}
>   ' |
>   git fast-import
> 
> which took almost 13 minutes of CPU to run, and peaked around 15GB of
> RAM (and takes about 6.7GB on disk).

I was thinking that maybe the RAM requirements would be lower
if we batched the fast-import calls and then repacked, but then
the repack would probably be just as expensive.

> In the resulting repo, the old code barfed on lookups:
> 
>   $ blob=$(echo 0 | git hash-object --stdin)
>   $ git cat-file blob $blob
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
>   fatal: git cat-file 573541ac9702dd3969c9bc859d2b91ec1f7e6e56: bad file
> 
> whereas now it works:
> 
>   $ git cat-file blob $blob
>   0
> 
> That's the most basic test I think you could do. More interesting is
> looking at entries that are actually after the 4GB mark. That requires
> dumping the whole index:
> 
>   final=$(git show-index <.git/objects/pack/*.idx | tail -1 | awk '{print $2}')
>   git cat-file blob $final

Could you also (after running the test once) determine the largest
SHA-1, at least up to unique short-SHA? Then run something like

	git cat-file blob fffffe

Since your loop is hard-coded, you could even use the largest full
SHA-1.

Naturally, nothing short of a full .idx verification would be
completely sound, and we are already generating an enormous repo.

> So I dunno. I wouldn't be opposed to codifying some of that in a script,
> but I can't imagine anybody ever running it unless they were working on
> this specific problem.

It would be good to have this available somewhere in the codebase to
run whenever testing .idx changes. Perhaps create a new prerequisite
specifically for EXPENSIVE_IDX tests, triggered only by a GIT_TEST_*
environment variable?

It would be helpful to also write a multi-pack-index on top of this
.idx to ensure we can handle that case, too.

Thanks,
-Stolee

  reply	other threads:[~2020-11-16 13:32 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-13  5:06 [PATCH 0/5] handling 4GB .idx files Jeff King
2020-11-13  5:06 ` [PATCH 1/5] compute pack .idx byte offsets using size_t Jeff King
2020-11-13  5:07 ` [PATCH 2/5] use size_t to store pack .idx byte offsets Jeff King
2020-11-13  5:07 ` [PATCH 3/5] fsck: correctly compute checksums on idx files larger than 4GB Jeff King
2020-11-13  5:07 ` [PATCH 4/5] block-sha1: take a size_t length parameter Jeff King
2020-11-13  5:07 ` [PATCH 5/5] packfile: detect overflow in .idx file size checks Jeff King
2020-11-13 11:02   ` Johannes Schindelin
2020-11-15 14:43 ` [PATCH 0/5] handling 4GB .idx files Thomas Braun
2020-11-16  4:10   ` Jeff King
2020-11-16 13:30     ` Derrick Stolee [this message]
2020-11-16 23:49       ` Jeff King
2020-11-30 22:57     ` Thomas Braun
2020-12-01 11:23       ` Jeff King
2020-12-01 11:39         ` t7900's new expensive test Jeff King
2020-12-01 20:55           ` Derrick Stolee
2020-12-02  2:47             ` [PATCH] t7900: speed up " Jeff King
2020-12-03 15:23               ` Derrick Stolee
2020-12-01 18:27         ` [PATCH 0/5] handling 4GB .idx files Taylor Blau
2020-12-02 13:12           ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=42080870-1a92-e76f-d83a-f15642a96329@gmail.com \
    --to=stolee@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=peff@peff.net \
    --cc=thomas.braun@virtuell-zuhause.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).