git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: git@vger.kernel.org
Cc: gitster@pobox.com, derrickstolee@github.com, peff@peff.net
Subject: Re: [PATCH] midx.c: use `pack-objects --stdin-packs` when repacking
Date: Mon, 19 Sep 2022 22:14:54 -0400	[thread overview]
Message-ID: <YykiHrWvrktCLRCB@nand.local> (raw)
In-Reply-To: <9195a9ecd11a19f2c7fb1c70136d2d13fa308010.1663639662.git.me@ttaylorr.com>

On Mon, Sep 19, 2022 at 10:08:35PM -0400, Taylor Blau wrote:
> Noticed this while working on a semi-related series in:
>
>   https://lore.kernel.org/git/cover.1663638929.git.me@ttaylorr.com/T/
>
> the savings here are pretty modest, but this is in line with the
> strategy we use in the `--geometric` repack mode, which performs a
> similar task.

To expand on my setup a little more, I ran the following script:

--- >8 ---
#!/bin/sh

repack_into_n () {
	rm -rf staging &&
	mkdir staging &&

	git rev-list --first-parent HEAD |
	perl -e '
		my $n = shift;
		while (<>) {
			last unless @commits < $n;
			push @commits, $_ if $. % 5 == 1;
		}
		print reverse @commits;
	' "$1" >pushes &&

	# create base packfile
	base_pack=$(
		head -n 1 pushes |
		git pack-objects --delta-base-offset --revs staging/pack
	) &&
	export base_pack &&

	# create an empty packfile
	empty_pack=$(git pack-objects staging/pack </dev/null) &&
	export empty_pack &&

	# and then incrementals between each pair of commits
	last= &&
	while read rev
	do
		if test -n "$last"; then
			{
				echo "$rev" &&
				echo "^$last"
			} |
			git pack-objects --delta-base-offset --revs \
				staging/pack || return 1
		fi
		last=$rev
	done <pushes &&

	(
		find staging -type f -name 'pack-*.pack' |
			xargs -n 1 basename | grep -v "$base_pack" &&
		printf "^pack-%s.pack\n" $base_pack
	) >stdin.packs

	# and install the whole thing
	rm -f .git/objects/pack/* &&
	mv staging/* .git/objects/pack/
}

# Pretend we just have a single branch and no reflogs, and that everything is
# in objects/pack; that makes our fake pack-building via repack_into_n()
# much simpler.
simplify_reachability() {
	tip=$(git rev-parse --verify HEAD) &&
	git for-each-ref --format="option no-deref%0adelete %(refname)" |
	git update-ref --stdin &&
	rm -rf .git/logs &&
	git update-ref refs/heads/master $tip &&
	git symbolic-ref HEAD refs/heads/master
}

simplify_reachability

for i in 100 1000 5000
do
  echo >&2 "==> $i pack(s)"

  repack_into_n $i

  rm .git/objects/pack/multi-pack-index
  find .git/objects/pack -type f | sort >before

  hyperfine -p './prepare.sh' \
	  'git multi-pack-index repack --batch-size=1G && ./report.sh' \
	  'git.compile multi-pack-index repack --batch-size=1G && ./report.sh'
done
--- 8< ---

...where `git.compile` is has this patch and `git` does not. The two
other scripts (prepare.sh, and report.sh, respectively) look as follows:

--- >8 ---
#!/bin/sh

find .git/objects/pack -type f | sort >after
comm -13 before after | xargs rm -f
rm -f .git/objects/pack/multi-pack-index
git multi-pack-index write
--- 8< ---

...and report.sh:

--- >8 ---
#!/bin/sh

find .git/objects/pack -type f | sort >after
for new in $(comm -13 before after)
do
	echo "==> $new ($(wc -c <$new))"
done
echo "-------------"
--- 8< ---

In general, the timings on git.git packed into 100 packs look something
like:

  Benchmark 1: git multi-pack-index repack --batch-size=1G
    Time (mean ± σ):      4.342 s ±  0.087 s    [User: 12.864 s, System: 0.396 s]
    Range (min … max):    4.235 s …  4.517 s    10 runs

  Benchmark 2: git.compile multi-pack-index repack --batch-size=1G
    Time (mean ± σ):      7.016 s ±  0.119 s    [User: 11.170 s, System: 0.469 s]
    Range (min … max):    6.858 s …  7.233 s    10 runs

But if I rip out the traversal pass towards the end of
`read_packs_list_from_stdin()` in `builtin/pack-objects.c`, the two
timings are equal. So the slow-down here really is from the traversal
pass.

The savings are modest, probably because we're already starting with a
pretty well packed baseline, since we're feeding objects in pack order.
On average, I was able to see around a ~3.5% reduction in pack size or
so.

So, not amazing, but mirroring the behavior of `git repack
--geometric=<n>` is worthwhile for all of the reasons that we do this
there.

I should also mention that this applies cleanly against `master`, and
doesn't depend on or interact with my changes in the series above.

Thanks,
Taylor

  reply	other threads:[~2022-09-20  2:15 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-20  2:08 [PATCH] midx.c: use `pack-objects --stdin-packs` when repacking Taylor Blau
2022-09-20  2:14 ` Taylor Blau [this message]
2022-09-20 19:28 ` Jeff King
2022-09-20 19:49   ` Taylor Blau
2022-09-20 20:06     ` Jeff King
2022-09-20 20:35       ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YykiHrWvrktCLRCB@nand.local \
    --to=me@ttaylorr.com \
    --cc=derrickstolee@github.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).