git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Taylor Blau <me@ttaylorr.com>
Cc: git@vger.kernel.org, gitster@pobox.com, derrickstolee@github.com
Subject: Re: [PATCH] midx.c: use `pack-objects --stdin-packs` when repacking
Date: Tue, 20 Sep 2022 16:06:26 -0400	[thread overview]
Message-ID: <YyodQg5diyr/UYK1@coredump.intra.peff.net> (raw)
In-Reply-To: <YyoZM1V5S53dz6U6@nand.local>

On Tue, Sep 20, 2022 at 03:49:07PM -0400, Taylor Blau wrote:

> > Is that true also of "multi-pack-index repack"? I guess it would depend
> > on how you invoke it. I admit I don't think I've ever used it myself,
> > since the new "repack --geometric --write-midx" approach matches my
> > mental model. I'm not sure when you'd actually run the "multi-pack-index
> > repack" command. But if you did it with --batch-size=0 (the default), I
> > think we'd end up traversing every object in history.
> 
> We could probably benefit from it, but only if there is a MIDX bitmap
> around to begin with. For instance, you could first try and lookup each
> object you're missing a namehash for and then read its value out of the
> hashcache extension in the MIDX bitmap (assuming the MIDX bitmap exists,
> and has a hashcache).
> 
> But if you don't have a MIDX bitmap, or it has a poor selection of
> commits, then you're out of luck.

You could also use a pack bitmap if there is one (and it's one of the
included packs). But yes, if you have neither, it's no help.

Mostly I'm just concerned that this could have an outsized negative
performance effect if you have a setup like:

  - you have a gigantic repository, say that takes 15 minutes to do a
    full "rev-list --objects" on (like linux.git with all its forks)

  - most of that is in one big pack, but you acquire new packs
    occasionally via pushes, etc

  - doing "git repack --geometric" rolls up the new packs, nicely
    traversing just the new objects

  - doing "git multi-pack-index repack" before your patch is fast-ish.
    It stuffs all the objects into a new pack. But after your patch, it
    does that 15-minute traversal.

But I don't know if that's even realistic, because I'm still wondering
why somebody would run "git multi-pack-index repack" in the first place.
And if they'd always do so with --batch-size anyway, which would
mitigate this (because it gives a geometric-ish approach where we leave
the huge pack untouched).

If it is, then one thing to consider is tying the "do the extra
traversal" feature to the presence / size of excluded packs. And
possibly considering the presence of a bitmap to indicate that it's
worth doing (assuming the optimization there is implemented).

But that sounds like a lot of work to get right, and again, I'm not
really sure of the benefit.

> > The old code went in object order within the midx. Is this sorted by
> > sha1, or the pack pseudo-order? If the former, then that will yield a
> > different order of objects inside pack-objects (since it is seeing the
> > packs in order of our m->pack_names array). I don't _think_ it matters,
> > but I just wanted to double check.
> 
> Good point. This ends up ordering the packs based on their SHA-1
> checksum, and probably should stick to the pack mtimes instead.
> 
> Unfortunately, we discard that information by the time we get to this
> point in midx_repack(). We don't even have it written durably in the
> MIDX, either, so we reconstruct it on-the-fly in
> fill_included_packs_batch() (see the `QSORT()` call there with
> `compare_by_mtime()`).
> 
> I agree that it probably doesn't matter in practice, but it's worth
> trying to match the existing behavior, at least.

Yeah, sorting the packs by mtime might be sensible. I know in the final
midx, we use object order to find the "preferred" pack. And you could
iterate the objects here, passing along their de-duped pack name. But I
don't think we have the objects here in that useful order; that is
really the order of the midx's .rev file, IIRC, and this is probably the
actual sha1 order.

-Peff

  reply	other threads:[~2022-09-20 20:06 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-20  2:08 [PATCH] midx.c: use `pack-objects --stdin-packs` when repacking Taylor Blau
2022-09-20  2:14 ` Taylor Blau
2022-09-20 19:28 ` Jeff King
2022-09-20 19:49   ` Taylor Blau
2022-09-20 20:06     ` Jeff King [this message]
2022-09-20 20:35       ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YyodQg5diyr/UYK1@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=me@ttaylorr.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).