From: Taylor Blau <me@ttaylorr.com>
To: git@vger.kernel.org
Cc: gitster@pobox.com, derrickstolee@github.com, peff@peff.net
Subject: Re: [PATCH] midx.c: use `pack-objects --stdin-packs` when repacking
Date: Mon, 19 Sep 2022 22:14:54 -0400 [thread overview]
Message-ID: <YykiHrWvrktCLRCB@nand.local> (raw)
In-Reply-To: <9195a9ecd11a19f2c7fb1c70136d2d13fa308010.1663639662.git.me@ttaylorr.com>
On Mon, Sep 19, 2022 at 10:08:35PM -0400, Taylor Blau wrote:
> Noticed this while working on a semi-related series in:
>
> https://lore.kernel.org/git/cover.1663638929.git.me@ttaylorr.com/T/
>
> the savings here are pretty modest, but this is in line with the
> strategy we use in the `--geometric` repack mode, which performs a
> similar task.
To expand on my setup a little more, I ran the following script:
--- >8 ---
#!/bin/sh
repack_into_n () {
rm -rf staging &&
mkdir staging &&
git rev-list --first-parent HEAD |
perl -e '
my $n = shift;
while (<>) {
last unless @commits < $n;
push @commits, $_ if $. % 5 == 1;
}
print reverse @commits;
' "$1" >pushes &&
# create base packfile
base_pack=$(
head -n 1 pushes |
git pack-objects --delta-base-offset --revs staging/pack
) &&
export base_pack &&
# create an empty packfile
empty_pack=$(git pack-objects staging/pack </dev/null) &&
export empty_pack &&
# and then incrementals between each pair of commits
last= &&
while read rev
do
if test -n "$last"; then
{
echo "$rev" &&
echo "^$last"
} |
git pack-objects --delta-base-offset --revs \
staging/pack || return 1
fi
last=$rev
done <pushes &&
(
find staging -type f -name 'pack-*.pack' |
xargs -n 1 basename | grep -v "$base_pack" &&
printf "^pack-%s.pack\n" $base_pack
) >stdin.packs
# and install the whole thing
rm -f .git/objects/pack/* &&
mv staging/* .git/objects/pack/
}
# Pretend we just have a single branch and no reflogs, and that everything is
# in objects/pack; that makes our fake pack-building via repack_into_n()
# much simpler.
simplify_reachability() {
tip=$(git rev-parse --verify HEAD) &&
git for-each-ref --format="option no-deref%0adelete %(refname)" |
git update-ref --stdin &&
rm -rf .git/logs &&
git update-ref refs/heads/master $tip &&
git symbolic-ref HEAD refs/heads/master
}
simplify_reachability
for i in 100 1000 5000
do
echo >&2 "==> $i pack(s)"
repack_into_n $i
rm .git/objects/pack/multi-pack-index
find .git/objects/pack -type f | sort >before
hyperfine -p './prepare.sh' \
'git multi-pack-index repack --batch-size=1G && ./report.sh' \
'git.compile multi-pack-index repack --batch-size=1G && ./report.sh'
done
--- 8< ---
...where `git.compile` is has this patch and `git` does not. The two
other scripts (prepare.sh, and report.sh, respectively) look as follows:
--- >8 ---
#!/bin/sh
find .git/objects/pack -type f | sort >after
comm -13 before after | xargs rm -f
rm -f .git/objects/pack/multi-pack-index
git multi-pack-index write
--- 8< ---
...and report.sh:
--- >8 ---
#!/bin/sh
find .git/objects/pack -type f | sort >after
for new in $(comm -13 before after)
do
echo "==> $new ($(wc -c <$new))"
done
echo "-------------"
--- 8< ---
In general, the timings on git.git packed into 100 packs look something
like:
Benchmark 1: git multi-pack-index repack --batch-size=1G
Time (mean ± σ): 4.342 s ± 0.087 s [User: 12.864 s, System: 0.396 s]
Range (min … max): 4.235 s … 4.517 s 10 runs
Benchmark 2: git.compile multi-pack-index repack --batch-size=1G
Time (mean ± σ): 7.016 s ± 0.119 s [User: 11.170 s, System: 0.469 s]
Range (min … max): 6.858 s … 7.233 s 10 runs
But if I rip out the traversal pass towards the end of
`read_packs_list_from_stdin()` in `builtin/pack-objects.c`, the two
timings are equal. So the slow-down here really is from the traversal
pass.
The savings are modest, probably because we're already starting with a
pretty well packed baseline, since we're feeding objects in pack order.
On average, I was able to see around a ~3.5% reduction in pack size or
so.
So, not amazing, but mirroring the behavior of `git repack
--geometric=<n>` is worthwhile for all of the reasons that we do this
there.
I should also mention that this applies cleanly against `master`, and
doesn't depend on or interact with my changes in the series above.
Thanks,
Taylor
next prev parent reply other threads:[~2022-09-20 2:15 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-20 2:08 [PATCH] midx.c: use `pack-objects --stdin-packs` when repacking Taylor Blau
2022-09-20 2:14 ` Taylor Blau [this message]
2022-09-20 19:28 ` Jeff King
2022-09-20 19:49 ` Taylor Blau
2022-09-20 20:06 ` Jeff King
2022-09-20 20:35 ` Taylor Blau
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YykiHrWvrktCLRCB@nand.local \
--to=me@ttaylorr.com \
--cc=derrickstolee@github.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).