git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Derrick Stolee <stolee@gmail.com>
Cc: Thomas Braun <thomas.braun@virtuell-zuhause.de>, git@vger.kernel.org
Subject: Re: [PATCH 0/5] handling 4GB .idx files
Date: Mon, 16 Nov 2020 18:49:39 -0500	[thread overview]
Message-ID: <20201116234939.GA5051@coredump.intra.peff.net> (raw)
In-Reply-To: <42080870-1a92-e76f-d83a-f15642a96329@gmail.com>

On Mon, Nov 16, 2020 at 08:30:34AM -0500, Derrick Stolee wrote:

> > which took almost 13 minutes of CPU to run, and peaked around 15GB of
> > RAM (and takes about 6.7GB on disk).
> 
> I was thinking that maybe the RAM requirements would be lower
> if we batched the fast-import calls and then repacked, but then
> the repack would probably be just as expensive.

I think it's even worse. Fast-import just holds enough data to create
the index (sha1, etc), but pack-objects is also holding data to support
the delta search, etc. A quick (well, quick to invoke, not to run):

   git show-index <.git/objects/pack/pack-*.idx |
   awk '{print $2}' |
   git pack-objects foo --all-progress

on the fast-import pack seems to cap out around 27GB.

I doubt you could do much better overall than fast-import in terms of
CPU. The trick is really that you need to have a matching content/sha1
pair for 154M objects, and that's where most of the time goes. If we
lied about what's in each object (just generating an index with sha1
...0001, ...0002, etc), we could go much faster. But it's a much less
interesting test then.

> > That's the most basic test I think you could do. More interesting is
> > looking at entries that are actually after the 4GB mark. That requires
> > dumping the whole index:
> > 
> >   final=$(git show-index <.git/objects/pack/*.idx | tail -1 | awk '{print $2}')
> >   git cat-file blob $final
> 
> Could you also (after running the test once) determine the largest
> SHA-1, at least up to unique short-SHA? Then run something like
> 
> 	git cat-file blob fffffe
> 
> Since your loop is hard-coded, you could even use the largest full
> SHA-1.

That $final is the highest sha1. We could hard-code it, yes (and the
resulting lookup via cat-file is quite fast; it's the linear index dump
that's slow). We'd need the matching sha256 version, too. But it's
really the generation of the data that's the main issue.

> Naturally, nothing short of a full .idx verification would be
> completely sound, and we are already generating an enormous repo.

Yep.

> > So I dunno. I wouldn't be opposed to codifying some of that in a script,
> > but I can't imagine anybody ever running it unless they were working on
> > this specific problem.
> 
> It would be good to have this available somewhere in the codebase to
> run whenever testing .idx changes. Perhaps create a new prerequisite
> specifically for EXPENSIVE_IDX tests, triggered only by a GIT_TEST_*
> environment variable?

My feeling is that anybody who's really interested in playing with this
topic can find this thread in the archive. I don't think they're really
any worse off there than with a bit-rotting script in the repo that
nobody ever runs.

But if somebody wants to write up a test script, I'm happy to review it.

> It would be helpful to also write a multi-pack-index on top of this
> .idx to ensure we can handle that case, too.

I did run "git multi-pack-index write" on the resulting repo, which
completed in a reasonable amount of time (maybe 30-60s). And then
confirmed that lookups in the midx work just fine.

-Peff

  reply	other threads:[~2020-11-16 23:50 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-13  5:06 [PATCH 0/5] handling 4GB .idx files Jeff King
2020-11-13  5:06 ` [PATCH 1/5] compute pack .idx byte offsets using size_t Jeff King
2020-11-13  5:07 ` [PATCH 2/5] use size_t to store pack .idx byte offsets Jeff King
2020-11-13  5:07 ` [PATCH 3/5] fsck: correctly compute checksums on idx files larger than 4GB Jeff King
2020-11-13  5:07 ` [PATCH 4/5] block-sha1: take a size_t length parameter Jeff King
2020-11-13  5:07 ` [PATCH 5/5] packfile: detect overflow in .idx file size checks Jeff King
2020-11-13 11:02   ` Johannes Schindelin
2020-11-15 14:43 ` [PATCH 0/5] handling 4GB .idx files Thomas Braun
2020-11-16  4:10   ` Jeff King
2020-11-16 13:30     ` Derrick Stolee
2020-11-16 23:49       ` Jeff King [this message]
2020-11-30 22:57     ` Thomas Braun
2020-12-01 11:23       ` Jeff King
2020-12-01 11:39         ` t7900's new expensive test Jeff King
2020-12-01 20:55           ` Derrick Stolee
2020-12-02  2:47             ` [PATCH] t7900: speed up " Jeff King
2020-12-03 15:23               ` Derrick Stolee
2020-12-01 18:27         ` [PATCH 0/5] handling 4GB .idx files Taylor Blau
2020-12-02 13:12           ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201116234939.GA5051@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=stolee@gmail.com \
    --cc=thomas.braun@virtuell-zuhause.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).