git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Thomas Braun <thomas.braun@virtuell-zuhause.de>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 0/5] handling 4GB .idx files
Date: Sun, 15 Nov 2020 23:10:51 -0500	[thread overview]
Message-ID: <20201116041051.GA883199@coredump.intra.peff.net> (raw)
In-Reply-To: <323fd904-a7ee-061d-d846-5da5afbc88b2@virtuell-zuhause.de>

On Sun, Nov 15, 2020 at 03:43:39PM +0100, Thomas Braun wrote:

> On 13.11.2020 06:06, Jeff King wrote:
> > I recently ran into a case where Git could not read the pack it had
> > produced via running "git repack". The culprit turned out to be an .idx
> > file which crossed the 4GB barrier (in bytes, not number of objects).
> > This series fixes the problems I saw, along with similar ones I couldn't
> > trigger in practice, and protects the .idx loading code against integer
> > overflows that would fool the size checks.
> 
> Would it be feasible to have a test case for this large index case? This
> should very certainly have an EXPENSIVE tag, or might even not yet work
> on windows. But hopefully someday I'll find some more time to push large
> object support on windows forward, and these kind of tests would really
> help then.

I think it would be a level beyond what we usually consider even for
EXPENSIVE. The cheapest I could come up with to generate the case is:

  perl -e '
	for (0..154_000_000) {
		print "blob\n";
		print "data <<EOF\n";
		print "$_\n";
		print "EOF\n";
	}
  ' |
  git fast-import

which took almost 13 minutes of CPU to run, and peaked around 15GB of
RAM (and takes about 6.7GB on disk).

In the resulting repo, the old code barfed on lookups:

  $ blob=$(echo 0 | git hash-object --stdin)
  $ git cat-file blob $blob
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  error: wrong index v2 file size in .git/objects/pack/pack-f8f43ae56c25c1c8ff49ad6320df6efb393f551e.idx
  fatal: git cat-file 573541ac9702dd3969c9bc859d2b91ec1f7e6e56: bad file

whereas now it works:

  $ git cat-file blob $blob
  0

That's the most basic test I think you could do. More interesting is
looking at entries that are actually after the 4GB mark. That requires
dumping the whole index:

  final=$(git show-index <.git/objects/pack/*.idx | tail -1 | awk '{print $2}')
  git cat-file blob $final

That takes ~35s to run. Curiously, it also allocates 5GB of heap. For
some reason it decides to make an internal copy of the entries table. I
guess because it reads the file sequentially rather than mmap-ing it,
and 64-bit offsets in v2 idx files can't be resolved until we've read
the whole entry table (and it wants to output the entries in sha1
order).

The checksum bug requires running git-fsck on the repo. That's another 5
minutes of CPU (and even higher peak memory; I think we create a "struct
blob" for each one, and it seems to hit 20GB).

Hitting the other cases that I fixed but never triggered in practice
would need a repo about 4x as large. So figure an hour of CPU and 60GB
of RAM.

So I dunno. I wouldn't be opposed to codifying some of that in a script,
but I can't imagine anybody ever running it unless they were working on
this specific problem.

-Peff

  reply	other threads:[~2020-11-16  4:12 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-13  5:06 [PATCH 0/5] handling 4GB .idx files Jeff King
2020-11-13  5:06 ` [PATCH 1/5] compute pack .idx byte offsets using size_t Jeff King
2020-11-13  5:07 ` [PATCH 2/5] use size_t to store pack .idx byte offsets Jeff King
2020-11-13  5:07 ` [PATCH 3/5] fsck: correctly compute checksums on idx files larger than 4GB Jeff King
2020-11-13  5:07 ` [PATCH 4/5] block-sha1: take a size_t length parameter Jeff King
2020-11-13  5:07 ` [PATCH 5/5] packfile: detect overflow in .idx file size checks Jeff King
2020-11-13 11:02   ` Johannes Schindelin
2020-11-15 14:43 ` [PATCH 0/5] handling 4GB .idx files Thomas Braun
2020-11-16  4:10   ` Jeff King [this message]
2020-11-16 13:30     ` Derrick Stolee
2020-11-16 23:49       ` Jeff King
2020-11-30 22:57     ` Thomas Braun
2020-12-01 11:23       ` Jeff King
2020-12-01 11:39         ` t7900's new expensive test Jeff King
2020-12-01 20:55           ` Derrick Stolee
2020-12-02  2:47             ` [PATCH] t7900: speed up " Jeff King
2020-12-03 15:23               ` Derrick Stolee
2020-12-01 18:27         ` [PATCH 0/5] handling 4GB .idx files Taylor Blau
2020-12-02 13:12           ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201116041051.GA883199@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=thomas.braun@virtuell-zuhause.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).