From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Jeff King <peff@peff.net>
Cc: "Iucha\, Florin" <Florin.Iucha@amd.com>,
"git\@vger.kernel.org" <git@vger.kernel.org>
Subject: Re: High locking contention during repack?
Date: Wed, 12 Dec 2018 15:04:19 +0100 [thread overview]
Message-ID: <87wooen8nw.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <20181212112409.GB30673@sigill.intra.peff.net>
On Wed, Dec 12 2018, Jeff King wrote:
> On Wed, Dec 12, 2018 at 03:01:47AM +0000, Iucha, Florin wrote:
>
>> I am running “git-repack -A -d -f -F --window=250 --depth=250” on a
>> Git repository converted using git-svn.
>
> Sort of tangential to your question, but:
>
> - Using "-F" implies "-f" already, so you don't need both. That said,
> you are probably wasting CPU to use "-F", unless you have adjusted
> zlib compression settings since the last pack. (Whereas using "-f"
> is useful, if you're willing to pay the CPU tradeoff).
>
> - Using --depth=250 does not actually decrease the packfile size very
> much, and results in a packfile which is more expensive for
> subsequent processes to use. Some experimentation showed that
> --depth=50 is a sweet spot, and that is the default for both normal
> "git gc" and "git gc --aggressive" these days.
>
> See 07e7dbf0db (gc: default aggressive depth to 50, 2016-08-11) for
> more discussion.
I wonder if the size is still too high. I'd rather take the 5-10%
speedup quoted in that commit than a slight decrease in disk space use.
The git.git repository now with the repack command in that commit
message, with --depth=$n:
573M /tmp/git-1
200M /tmp/git-2
118M /tmp/git-3
91M /tmp/git-4
82M /tmp/git-5
77M /tmp/git-6
74M /tmp/git-7
72M /tmp/git-8
71M /tmp/git-9
70M /tmp/git-10
67M /tmp/git-15
66M /tmp/git-20
65M /tmp/git-40
65M /tmp/git-35
65M /tmp/git-30
65M /tmp/git-25
64M /tmp/git-50
64M /tmp/git-45
Produced via:
parallel 'rm -rf /tmp/git-{}; cp -Rvp /tmp/git.git/ /tmp/git-{} && git -C /tmp/git-{} repack -adf --depth={} --window=250' ::: {1..10} 15 20 25 30 35 40 45 50
And then the log -S benchmarks:
s/iter log -S 1 log -S 50 log -S 45 log -S 35 log -S 30 log -S 25 log -S 20 log -S 15 log -S 10 log -S 5
log -S 1 12.6 -- -26% -27% -32% -36% -38% -38% -39% -40% -42%
log -S 50 9.28 36% -- -0% -7% -13% -15% -16% -18% -19% -21%
log -S 45 9.25 36% 0% -- -7% -12% -15% -16% -17% -19% -20%
log -S 35 8.62 46% 8% 7% -- -6% -9% -10% -11% -13% -14%
log -S 30 8.12 55% 14% 14% 6% -- -3% -4% -6% -8% -9%
log -S 25 7.86 60% 18% 18% 10% 3% -- -1% -3% -5% -6%
log -S 20 7.77 62% 19% 19% 11% 4% 1% -- -2% -3% -5%
log -S 15 7.64 65% 21% 21% 13% 6% 3% 2% -- -2% -4%
log -S 10 7.51 68% 24% 23% 15% 8% 5% 3% 2% -- -2%
log -S 5 7.37 71% 26% 25% 17% 10% 7% 5% 4% 2% --
So we're getting a 20% speedup in log -S by using --depth=5 instead of
--depth=50, neatly at the cost of 20% more disk space, but could also
get a 19% speedup for 8% (--depth=10) more disk space, etc.
Then in rev-list it makes less of a difference, narrowing this down a
bit:
s/iter rev-list 50 rev-list 25 rev-list 10 rev-list 5
rev-list 50 1.61 -- -1% -4% -5%
rev-list 25 1.60 1% -- -3% -4%
rev-list 10 1.55 4% 3% -- -1%
rev-list 5 1.54 5% 4% 1% --
It seems to me we should at least move this to --depth=25 by default. A
~15% speedup in log -S & other mass-blob lookups for 1.5% more disk
space seems like a good trade-off. As your commit notes RAM (or disk
space) is not commonly the bottleneck anymore.
>> The system is a 16 core / 32 thread Threadripper with 128GB of RAM and
>> NVMe storage. The repack starts strong, with 32 threads but it fairly
>> quickly gets to 99% done and the number of threads drops to 4 then 3
>> then 2. However, running “dstat 5” I see lots of “sys” time without
>> any IO time (the network traffic you see is caused by SSH).
>
> This sounds mostly normal and expected. The parallel part of a repack is
> the delta search, which is not infinitely parallelizable. Each worker
> thread is given a "chunk" of objects, and it uses a sliding window to
> search for delta pairs through that chunk. You don't want a chunk that
> approaches the window size, since at every chunk boundary you're missing
> delta possibilities.
>
> The default chunk size is about 1/nr_threads of the total list size
> (i.e., we portion out all the work). And then when a thread finishes, we
> take work from the thread with the most work remaining, and portion it
> out. However, at some point the chunks approach their minimum, and we
> stop dividing. So the number of threads will drop, eventually to 1, and
> you'll be waiting on it to finish that final chunk.
>
> So that's all working as planned. Having high sys load does seem a bit
> odd. Most of the effort should be going to reading the mmap'd data from
> disk, zlib-inflating it and computing a fingerprint, and then comparing
> the fingerprints. So that would mostly be user time.
>
>> Running a strace on the running git-repack process shows only these:
>> --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>> --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>> --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>> --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>> --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>> --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
>>
>> Any idea on how to debug this? I have ran git-repack under gdb, but it seems to spin on builtin/repack.c line 409.
>
> The heavy lifting here is done by the pack-objects child process, not
> git-repack itself. Try running with GIT_TRACE=1 in the environment to
> see the exact invocation, but timing and debugging:
>
> git pack-objects --all --no-reuse-delta --delta-base-offset --stdout \
> </dev/null >/dev/null
>
> should produce interesting results.
>
> The SIGALRM loop you see above is likely just the progress meter
> triggering once per second (the actual worker threads are updating an
> int, and then at least once per second we'll show the int).
>
> -Peff
next prev parent reply other threads:[~2018-12-12 14:04 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-12-12 3:01 High locking contention during repack? Iucha, Florin
2018-12-12 11:24 ` Jeff King
2018-12-12 14:04 ` Ævar Arnfjörð Bjarmason [this message]
2018-12-12 16:49 ` Iucha, Florin
2018-12-12 16:54 ` Ævar Arnfjörð Bjarmason
2018-12-12 18:08 ` Iucha, Florin
2018-12-12 18:30 ` Iucha, Florin
2018-12-12 19:05 ` Iucha, Florin
2018-12-12 21:50 ` Iucha, Florin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87wooen8nw.fsf@evledraar.gmail.com \
--to=avarab@gmail.com \
--cc=Florin.Iucha@amd.com \
--cc=git@vger.kernel.org \
--cc=peff@peff.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).