git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [PATCH] midx.c: use `pack-objects --stdin-packs` when repacking
@ 2022-09-20  2:08 Taylor Blau
  2022-09-20  2:14 ` Taylor Blau
  2022-09-20 19:28 ` Jeff King
  0 siblings, 2 replies; 6+ messages in thread
From: Taylor Blau @ 2022-09-20  2:08 UTC (permalink / raw)
  To: git; +Cc: gitster, derrickstolee, peff

The `git multi-pack-index repack` command aggregates all objects in
certain packs contained in the MIDX, largely dictated by the value of
its `--batch-size` parameter.

To create the new pack, the MIDX machinery spawns a `pack-objects`
sub-process, which it feeds the object IDs in each of the pack(s) that
it wants to aggregate.

This implementation, which dates back to ce1e4a105b (midx: implement
midx_repack(), 2019-06-10), predates the `--stdin-packs` mode in
`pack-objects`. This mode (which was introduced a couple of years later
in 339bce27f4 (builtin/pack-objects.c: add '--stdin-packs' option,
2021-02-22)) allows `pack-objects` callers to specify the contents of
the output pack by indicating which packs to aggregate.

This patch replaces the pre-`--stdin-packs` invocation (where each
object is given to `pack-objects` one by one) with the more modern
`--stdin-packs` option.

This allows us to avoid some CPU cycles serializing and deserializing
every object ID in all of the packs we're aggregating. It also avoids us
having to send a potentially large amount of data down to
`pack-objects`.

But more importantly, it generates slightly higher quality (read: more
tightly compressed) packs, because of the reachability traversal that
`--stdin-packs` does after the fact in order to gather namehash values
which seed the delta selection process.

In practice, this seems to add a slight amount of overhead (on the order
of a few seconds for git.git broken up into ~100 packs), in exchange for
a modest reduction (on the order of ~3.5%) in the resulting pack size.

Signed-off-by: Taylor Blau <me@ttaylorr.com>
---
Noticed this while working on a semi-related series in:

  https://lore.kernel.org/git/cover.1663638929.git.me@ttaylorr.com/T/

the savings here are pretty modest, but this is in line with the
strategy we use in the `--geometric` repack mode, which performs a
similar task.

 midx.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/midx.c b/midx.c
index c27d0e5f15..c06f0f00e6 100644
--- a/midx.c
+++ b/midx.c
@@ -1973,6 +1973,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 	struct child_process cmd = CHILD_PROCESS_INIT;
 	FILE *cmd_in;
 	struct strbuf base_name = STRBUF_INIT;
+	struct strbuf scratch = STRBUF_INIT;
 	struct multi_pack_index *m = lookup_multi_pack_index(r, object_dir);

 	/*
@@ -1997,7 +1998,7 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
 	repo_config_get_bool(r, "repack.usedeltabaseoffset", &delta_base_offset);
 	repo_config_get_bool(r, "repack.usedeltaislands", &use_delta_islands);

-	strvec_push(&cmd.args, "pack-objects");
+	strvec_pushl(&cmd.args, "pack-objects", "--stdin-packs", NULL);

 	strbuf_addstr(&base_name, object_dir);
 	strbuf_addstr(&base_name, "/pack/pack");
@@ -2026,17 +2027,17 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,

 	cmd_in = xfdopen(cmd.in, "w");

-	for (i = 0; i < m->num_objects; i++) {
-		struct object_id oid;
-		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
+	for (i = 0; i < m->num_packs; i++) {
+		strbuf_reset(&scratch);

-		if (!include_pack[pack_int_id])
-			continue;
+		strbuf_addstr(&scratch, m->pack_names[i]);
+		strbuf_strip_suffix(&scratch, ".idx");
+		strbuf_addstr(&scratch, ".pack");

-		nth_midxed_object_oid(&oid, m, i);
-		fprintf(cmd_in, "%s\n", oid_to_hex(&oid));
+		fprintf(cmd_in, "%s%s\n", include_pack[i] ? "" : "^", scratch.buf);
 	}
 	fclose(cmd_in);
+	strbuf_release(&scratch);

 	if (finish_command(&cmd)) {
 		error(_("could not finish pack-objects"));
--
2.37.0.1.g1379af2e9d

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] midx.c: use `pack-objects --stdin-packs` when repacking
  2022-09-20  2:08 [PATCH] midx.c: use `pack-objects --stdin-packs` when repacking Taylor Blau
@ 2022-09-20  2:14 ` Taylor Blau
  2022-09-20 19:28 ` Jeff King
  1 sibling, 0 replies; 6+ messages in thread
From: Taylor Blau @ 2022-09-20  2:14 UTC (permalink / raw)
  To: git; +Cc: gitster, derrickstolee, peff

On Mon, Sep 19, 2022 at 10:08:35PM -0400, Taylor Blau wrote:
> Noticed this while working on a semi-related series in:
>
>   https://lore.kernel.org/git/cover.1663638929.git.me@ttaylorr.com/T/
>
> the savings here are pretty modest, but this is in line with the
> strategy we use in the `--geometric` repack mode, which performs a
> similar task.

To expand on my setup a little more, I ran the following script:

--- >8 ---
#!/bin/sh

repack_into_n () {
	rm -rf staging &&
	mkdir staging &&

	git rev-list --first-parent HEAD |
	perl -e '
		my $n = shift;
		while (<>) {
			last unless @commits < $n;
			push @commits, $_ if $. % 5 == 1;
		}
		print reverse @commits;
	' "$1" >pushes &&

	# create base packfile
	base_pack=$(
		head -n 1 pushes |
		git pack-objects --delta-base-offset --revs staging/pack
	) &&
	export base_pack &&

	# create an empty packfile
	empty_pack=$(git pack-objects staging/pack </dev/null) &&
	export empty_pack &&

	# and then incrementals between each pair of commits
	last= &&
	while read rev
	do
		if test -n "$last"; then
			{
				echo "$rev" &&
				echo "^$last"
			} |
			git pack-objects --delta-base-offset --revs \
				staging/pack || return 1
		fi
		last=$rev
	done <pushes &&

	(
		find staging -type f -name 'pack-*.pack' |
			xargs -n 1 basename | grep -v "$base_pack" &&
		printf "^pack-%s.pack\n" $base_pack
	) >stdin.packs

	# and install the whole thing
	rm -f .git/objects/pack/* &&
	mv staging/* .git/objects/pack/
}

# Pretend we just have a single branch and no reflogs, and that everything is
# in objects/pack; that makes our fake pack-building via repack_into_n()
# much simpler.
simplify_reachability() {
	tip=$(git rev-parse --verify HEAD) &&
	git for-each-ref --format="option no-deref%0adelete %(refname)" |
	git update-ref --stdin &&
	rm -rf .git/logs &&
	git update-ref refs/heads/master $tip &&
	git symbolic-ref HEAD refs/heads/master
}

simplify_reachability

for i in 100 1000 5000
do
  echo >&2 "==> $i pack(s)"

  repack_into_n $i

  rm .git/objects/pack/multi-pack-index
  find .git/objects/pack -type f | sort >before

  hyperfine -p './prepare.sh' \
	  'git multi-pack-index repack --batch-size=1G && ./report.sh' \
	  'git.compile multi-pack-index repack --batch-size=1G && ./report.sh'
done
--- 8< ---

...where `git.compile` is has this patch and `git` does not. The two
other scripts (prepare.sh, and report.sh, respectively) look as follows:

--- >8 ---
#!/bin/sh

find .git/objects/pack -type f | sort >after
comm -13 before after | xargs rm -f
rm -f .git/objects/pack/multi-pack-index
git multi-pack-index write
--- 8< ---

...and report.sh:

--- >8 ---
#!/bin/sh

find .git/objects/pack -type f | sort >after
for new in $(comm -13 before after)
do
	echo "==> $new ($(wc -c <$new))"
done
echo "-------------"
--- 8< ---

In general, the timings on git.git packed into 100 packs look something
like:

  Benchmark 1: git multi-pack-index repack --batch-size=1G
    Time (mean ± σ):      4.342 s ±  0.087 s    [User: 12.864 s, System: 0.396 s]
    Range (min … max):    4.235 s …  4.517 s    10 runs

  Benchmark 2: git.compile multi-pack-index repack --batch-size=1G
    Time (mean ± σ):      7.016 s ±  0.119 s    [User: 11.170 s, System: 0.469 s]
    Range (min … max):    6.858 s …  7.233 s    10 runs

But if I rip out the traversal pass towards the end of
`read_packs_list_from_stdin()` in `builtin/pack-objects.c`, the two
timings are equal. So the slow-down here really is from the traversal
pass.

The savings are modest, probably because we're already starting with a
pretty well packed baseline, since we're feeding objects in pack order.
On average, I was able to see around a ~3.5% reduction in pack size or
so.

So, not amazing, but mirroring the behavior of `git repack
--geometric=<n>` is worthwhile for all of the reasons that we do this
there.

I should also mention that this applies cleanly against `master`, and
doesn't depend on or interact with my changes in the series above.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] midx.c: use `pack-objects --stdin-packs` when repacking
  2022-09-20  2:08 [PATCH] midx.c: use `pack-objects --stdin-packs` when repacking Taylor Blau
  2022-09-20  2:14 ` Taylor Blau
@ 2022-09-20 19:28 ` Jeff King
  2022-09-20 19:49   ` Taylor Blau
  1 sibling, 1 reply; 6+ messages in thread
From: Jeff King @ 2022-09-20 19:28 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, gitster, derrickstolee

On Mon, Sep 19, 2022 at 10:08:35PM -0400, Taylor Blau wrote:

> This patch replaces the pre-`--stdin-packs` invocation (where each
> object is given to `pack-objects` one by one) with the more modern
> `--stdin-packs` option.
> 
> This allows us to avoid some CPU cycles serializing and deserializing
> every object ID in all of the packs we're aggregating. It also avoids us
> having to send a potentially large amount of data down to
> `pack-objects`.

Makes sense. Just playing devil's advocate for a moment: is there any
way that getting the list of packs could be worse? I'm thinking
particularly of a race condition where a pack goes away while we're
running, but if we had the actual object list, we could fall back to
finding it elsewhere.

I think that could only happen if we had two gc's running
simultaneously, which is something we try to avoid already. And the
worst case would be that one would say "oops, this pack went away" and
bail, and not any kind of corruption.

So I think it's fine, but just trying to talk through any unexpected
implications.

> But more importantly, it generates slightly higher quality (read: more
> tightly compressed) packs, because of the reachability traversal that
> `--stdin-packs` does after the fact in order to gather namehash values
> which seed the delta selection process.

I think we _could_ do that same traversal even in objects mode. Or do
--stdin-packs without it. If we were starting from scratch, it might be
nice for the two features to be orthogonal so we could evaluate the
changes independently. But I don't think it's worth going back and
trying to split them out now. Although...

> In practice, this seems to add a slight amount of overhead (on the order
> of a few seconds for git.git broken up into ~100 packs), in exchange for
> a modest reduction (on the order of ~3.5%) in the resulting pack size.

Hmm. I thought we'd have some code to reuse the cached name-hashes in
the .bitmap file, if one is present. But I don't see any such code in
the stdin-packs feature. I think for "repack --geometric" it doesn't
matter. There the "main" pack with the bitmap would also be excluded
from the rollup (unless we are rolling all-into-one, in which case we do
the full from-scratch repack with a traversal).

Is that true also of "multi-pack-index repack"? I guess it would depend
on how you invoke it. I admit I don't think I've ever used it myself,
since the new "repack --geometric --write-midx" approach matches my
mental model. I'm not sure when you'd actually run the "multi-pack-index
repack" command. But if you did it with --batch-size=0 (the default), I
think we'd end up traversing every object in history.

>  midx.c | 17 +++++++++--------
>  1 file changed, 9 insertions(+), 8 deletions(-)

The patch itself is mostly as expected, but I did have one question:

> @@ -2026,17 +2027,17 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
> 
>  	cmd_in = xfdopen(cmd.in, "w");
> 
> -	for (i = 0; i < m->num_objects; i++) {
> -		struct object_id oid;
> -		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
> +	for (i = 0; i < m->num_packs; i++) {
> +		strbuf_reset(&scratch);

The old code went in object order within the midx. Is this sorted by
sha1, or the pack pseudo-order? If the former, then that will yield a
different order of objects inside pack-objects (since it is seeing the
packs in order of our m->pack_names array). I don't _think_ it matters,
but I just wanted to double check.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] midx.c: use `pack-objects --stdin-packs` when repacking
  2022-09-20 19:28 ` Jeff King
@ 2022-09-20 19:49   ` Taylor Blau
  2022-09-20 20:06     ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Taylor Blau @ 2022-09-20 19:49 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, gitster, derrickstolee

On Tue, Sep 20, 2022 at 03:28:37PM -0400, Jeff King wrote:
> On Mon, Sep 19, 2022 at 10:08:35PM -0400, Taylor Blau wrote:
>
> > This patch replaces the pre-`--stdin-packs` invocation (where each
> > object is given to `pack-objects` one by one) with the more modern
> > `--stdin-packs` option.
> >
> > This allows us to avoid some CPU cycles serializing and deserializing
> > every object ID in all of the packs we're aggregating. It also avoids us
> > having to send a potentially large amount of data down to
> > `pack-objects`.
>
> Makes sense. Just playing devil's advocate for a moment: is there any
> way that getting the list of packs could be worse? I'm thinking
> particularly of a race condition where a pack goes away while we're
> running, but if we had the actual object list, we could fall back to
> finding it elsewhere.
>
> I think that could only happen if we had two gc's running
> simultaneously, which is something we try to avoid already. And the
> worst case would be that one would say "oops, this pack went away" and
> bail, and not any kind of corruption.
>
> So I think it's fine, but just trying to talk through any unexpected
> implications.

Your assumption is right. We perform those pack validity checks pretty
early these days, see: 5045759de8 (builtin/pack-objects.c: ensure
included `--stdin-packs` exist, 2022-05-24).

We *could* handle the case where we know the object names, but the pack
file has gone away (either we could open the .idx but not the .pack, or
we opened both but then had to close the .pack because of hitting the
max open file-descriptor limit).

...But it gets tricky in practice, and 5045759de8 has some of those
details. At worst we'd complain that one of the packs listed is gone,
and then fail to repack (while maintaining the non-corruptedness of the
repository).

> > But more importantly, it generates slightly higher quality (read: more
> > tightly compressed) packs, because of the reachability traversal that
> > `--stdin-packs` does after the fact in order to gather namehash values
> > which seed the delta selection process.
>
> I think we _could_ do that same traversal even in objects mode. Or do
> --stdin-packs without it. If we were starting from scratch, it might be
> nice for the two features to be orthogonal so we could evaluate the
> changes independently. But I don't think it's worth going back and
> trying to split them out now. Although...

It's relatively easy to do `--stdin-packs` without the traversal. I
wouldn't be opposed to doing that here.

> > In practice, this seems to add a slight amount of overhead (on the order
> > of a few seconds for git.git broken up into ~100 packs), in exchange for
> > a modest reduction (on the order of ~3.5%) in the resulting pack size.
>
> Hmm. I thought we'd have some code to reuse the cached name-hashes in
> the .bitmap file, if one is present. But I don't see any such code in
> the stdin-packs feature. I think for "repack --geometric" it doesn't
> matter. There the "main" pack with the bitmap would also be excluded
> from the rollup (unless we are rolling all-into-one, in which case we do
> the full from-scratch repack with a traversal).

Right.

> Is that true also of "multi-pack-index repack"? I guess it would depend
> on how you invoke it. I admit I don't think I've ever used it myself,
> since the new "repack --geometric --write-midx" approach matches my
> mental model. I'm not sure when you'd actually run the "multi-pack-index
> repack" command. But if you did it with --batch-size=0 (the default), I
> think we'd end up traversing every object in history.

We could probably benefit from it, but only if there is a MIDX bitmap
around to begin with. For instance, you could first try and lookup each
object you're missing a namehash for and then read its value out of the
hashcache extension in the MIDX bitmap (assuming the MIDX bitmap exists,
and has a hashcache).

But if you don't have a MIDX bitmap, or it has a poor selection of
commits, then you're out of luck.

> > @@ -2026,17 +2027,17 @@ int midx_repack(struct repository *r, const char *object_dir, size_t batch_size,
> >
> >  	cmd_in = xfdopen(cmd.in, "w");
> >
> > -	for (i = 0; i < m->num_objects; i++) {
> > -		struct object_id oid;
> > -		uint32_t pack_int_id = nth_midxed_pack_int_id(m, i);
> > +	for (i = 0; i < m->num_packs; i++) {
> > +		strbuf_reset(&scratch);
>
> The old code went in object order within the midx. Is this sorted by
> sha1, or the pack pseudo-order? If the former, then that will yield a
> different order of objects inside pack-objects (since it is seeing the
> packs in order of our m->pack_names array). I don't _think_ it matters,
> but I just wanted to double check.

Good point. This ends up ordering the packs based on their SHA-1
checksum, and probably should stick to the pack mtimes instead.

Unfortunately, we discard that information by the time we get to this
point in midx_repack(). We don't even have it written durably in the
MIDX, either, so we reconstruct it on-the-fly in
fill_included_packs_batch() (see the `QSORT()` call there with
`compare_by_mtime()`).

I agree that it probably doesn't matter in practice, but it's worth
trying to match the existing behavior, at least.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] midx.c: use `pack-objects --stdin-packs` when repacking
  2022-09-20 19:49   ` Taylor Blau
@ 2022-09-20 20:06     ` Jeff King
  2022-09-20 20:35       ` Taylor Blau
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2022-09-20 20:06 UTC (permalink / raw)
  To: Taylor Blau; +Cc: git, gitster, derrickstolee

On Tue, Sep 20, 2022 at 03:49:07PM -0400, Taylor Blau wrote:

> > Is that true also of "multi-pack-index repack"? I guess it would depend
> > on how you invoke it. I admit I don't think I've ever used it myself,
> > since the new "repack --geometric --write-midx" approach matches my
> > mental model. I'm not sure when you'd actually run the "multi-pack-index
> > repack" command. But if you did it with --batch-size=0 (the default), I
> > think we'd end up traversing every object in history.
> 
> We could probably benefit from it, but only if there is a MIDX bitmap
> around to begin with. For instance, you could first try and lookup each
> object you're missing a namehash for and then read its value out of the
> hashcache extension in the MIDX bitmap (assuming the MIDX bitmap exists,
> and has a hashcache).
> 
> But if you don't have a MIDX bitmap, or it has a poor selection of
> commits, then you're out of luck.

You could also use a pack bitmap if there is one (and it's one of the
included packs). But yes, if you have neither, it's no help.

Mostly I'm just concerned that this could have an outsized negative
performance effect if you have a setup like:

  - you have a gigantic repository, say that takes 15 minutes to do a
    full "rev-list --objects" on (like linux.git with all its forks)

  - most of that is in one big pack, but you acquire new packs
    occasionally via pushes, etc

  - doing "git repack --geometric" rolls up the new packs, nicely
    traversing just the new objects

  - doing "git multi-pack-index repack" before your patch is fast-ish.
    It stuffs all the objects into a new pack. But after your patch, it
    does that 15-minute traversal.

But I don't know if that's even realistic, because I'm still wondering
why somebody would run "git multi-pack-index repack" in the first place.
And if they'd always do so with --batch-size anyway, which would
mitigate this (because it gives a geometric-ish approach where we leave
the huge pack untouched).

If it is, then one thing to consider is tying the "do the extra
traversal" feature to the presence / size of excluded packs. And
possibly considering the presence of a bitmap to indicate that it's
worth doing (assuming the optimization there is implemented).

But that sounds like a lot of work to get right, and again, I'm not
really sure of the benefit.

> > The old code went in object order within the midx. Is this sorted by
> > sha1, or the pack pseudo-order? If the former, then that will yield a
> > different order of objects inside pack-objects (since it is seeing the
> > packs in order of our m->pack_names array). I don't _think_ it matters,
> > but I just wanted to double check.
> 
> Good point. This ends up ordering the packs based on their SHA-1
> checksum, and probably should stick to the pack mtimes instead.
> 
> Unfortunately, we discard that information by the time we get to this
> point in midx_repack(). We don't even have it written durably in the
> MIDX, either, so we reconstruct it on-the-fly in
> fill_included_packs_batch() (see the `QSORT()` call there with
> `compare_by_mtime()`).
> 
> I agree that it probably doesn't matter in practice, but it's worth
> trying to match the existing behavior, at least.

Yeah, sorting the packs by mtime might be sensible. I know in the final
midx, we use object order to find the "preferred" pack. And you could
iterate the objects here, passing along their de-duped pack name. But I
don't think we have the objects here in that useful order; that is
really the order of the midx's .rev file, IIRC, and this is probably the
actual sha1 order.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] midx.c: use `pack-objects --stdin-packs` when repacking
  2022-09-20 20:06     ` Jeff King
@ 2022-09-20 20:35       ` Taylor Blau
  0 siblings, 0 replies; 6+ messages in thread
From: Taylor Blau @ 2022-09-20 20:35 UTC (permalink / raw)
  To: Jeff King; +Cc: Taylor Blau, git, gitster, derrickstolee

On Tue, Sep 20, 2022 at 04:06:26PM -0400, Jeff King wrote:
> On Tue, Sep 20, 2022 at 03:49:07PM -0400, Taylor Blau wrote:
>
> > > Is that true also of "multi-pack-index repack"? I guess it would depend
> > > on how you invoke it. I admit I don't think I've ever used it myself,
> > > since the new "repack --geometric --write-midx" approach matches my
> > > mental model. I'm not sure when you'd actually run the "multi-pack-index
> > > repack" command. But if you did it with --batch-size=0 (the default), I
> > > think we'd end up traversing every object in history.
> >
> > We could probably benefit from it, but only if there is a MIDX bitmap
> > around to begin with. For instance, you could first try and lookup each
> > object you're missing a namehash for and then read its value out of the
> > hashcache extension in the MIDX bitmap (assuming the MIDX bitmap exists,
> > and has a hashcache).
> >
> > But if you don't have a MIDX bitmap, or it has a poor selection of
> > commits, then you're out of luck.
>
> You could also use a pack bitmap if there is one (and it's one of the
> included packs). But yes, if you have neither, it's no help.

Good point. But, yeah, you have to have them to begin with.

> Mostly I'm just concerned that this could have an outsized negative
> performance effect if you have a setup like:
>
>   - you have a gigantic repository, say that takes 15 minutes to do a
>     full "rev-list --objects" on (like linux.git with all its forks)
>
>   - most of that is in one big pack, but you acquire new packs
>     occasionally via pushes, etc
>
>   - doing "git repack --geometric" rolls up the new packs, nicely
>     traversing just the new objects
>
>   - doing "git multi-pack-index repack" before your patch is fast-ish.
>     It stuffs all the objects into a new pack. But after your patch, it
>     does that 15-minute traversal.
>
> But I don't know if that's even realistic, because I'm still wondering
> why somebody would run "git multi-pack-index repack" in the first place.
> And if they'd always do so with --batch-size anyway, which would
> mitigate this (because it gives a geometric-ish approach where we leave
> the huge pack untouched).

Yeah, the `--geometric` path(s) don't have this problem, because the big
pack will already be covered by either a pack or MIDX bitmap, and we can
read out all of the namehash values from there.

But I tend to agree that this is pretty unrealistic, so I'm hopeful that
it isn't a huge deal. If it is, though, we can always just "turn off"
the traversal parts. (Though I have to imagine that a repository large
enough to care about the existence of namehash values probably isn't
using `git multi-pack-index repack` anyway).

> Yeah, sorting the packs by mtime might be sensible. I know in the final
> midx, we use object order to find the "preferred" pack. And you could
> iterate the objects here, passing along their de-duped pack name. But I
> don't think we have the objects here in that useful order; that is
> really the order of the midx's .rev file, IIRC, and this is probably the
> actual sha1 order.

We already need the sorted order in order to compute the rollup for
non-zero `--batch-size` arguments, so using that to construct the pack
is just a matter of dragging the sort out of the function to compute the
rollup itself (and into `midx_repack()` instead).

Patches incoming... ;-)

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-09-20 20:36 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-20  2:08 [PATCH] midx.c: use `pack-objects --stdin-packs` when repacking Taylor Blau
2022-09-20  2:14 ` Taylor Blau
2022-09-20 19:28 ` Jeff King
2022-09-20 19:49   ` Taylor Blau
2022-09-20 20:06     ` Jeff King
2022-09-20 20:35       ` Taylor Blau

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).