From: Vegard Nossum <vegard.nossum@oracle.com>
To: Jeff King <peff@peff.net>
Cc: git@vger.kernel.org,
"Quentin Casasnovas" <quentin.casasnovas@oracle.com>,
"Shawn Pearce" <spearce@spearce.org>,
"Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>,
"Junio C Hamano" <gitster@pobox.com>
Subject: Re: Huge performance bottleneck reading packs
Date: Thu, 13 Oct 2016 20:18:11 +0200 [thread overview]
Message-ID: <d727ee5d-5f0f-e6df-3f83-35fe74641ace@oracle.com> (raw)
In-Reply-To: <20161013152632.ynsabbv3yrthgidy@sigill.intra.peff.net>
On 10/13/2016 05:26 PM, Jeff King wrote:
> On Thu, Oct 13, 2016 at 09:20:07AM +0200, Vegard Nossum wrote:
>
>>> Does the patch below help?
>>
>> Yes, ~2m10s -> ~1m25s when I test a git fetch this morning (the other
>> variation in time may be due to different CPU usage by other programs,
>> but I ran with/without the patch multiple times and the difference is
>> consistent).
>> [...]
>> There are some 20k refs on the remote, closer to 25k locally.
>
> OK, that makes some sense. For whatever reason, your remote has a bunch
> of tags that point to objects you do not already have. That could
> happen, I think, if the remote added a lot of tags since you cloned
> (because clone will grab all of the tags), but those tags do not point
> to history that you are otherwise fetching (since fetch by default will
> "auto-follow" such tags).
>
> It might be interesting to see the results of:
>
> # all the objects we have
> git cat-file --batch-all-objects --batch-check='%(objectname)' >us
>
> # all the objects the remote is advertising
> git ls-remote origin | cut -f1 | sort -u >them
>
> # what they are advertising that we don't have
> comm -13 us them | wc -l
>
> My guess is that the number is relatively high. And that would explain
> why nobody else has really complained much; such a pattern is probably
> uncommon.
I get ~3,700 objects "they are advertising that we don't have".
They are all indeed tags which I don't have (nor want) locally.
>> The fetch doesn't actually get anything from the remote as everything is
>> already up to date (that makes the 2m40s times even more frustrating in
>> a way :-)). Here's count-objects:
>
> If the fetch is largely a noop, then that makes me wonder why we are
> spending even a minute in the "good" case. I wonder if there is some
It was user time I reported and this is over network, so some of it
might be IO wait on the network? With the (first) patch:
$ time git fetch
real 1m9.248s
user 0m40.808s
sys 0m2.996s
Also see the profile data below.
> other spot that is wasting CPU on some inefficient data structure
> related to the number of refs you have. If you can do any profiling that
> points to a hot spot, that would be interesting to see (and also whether
> a gc improves things).
>
> I see in find_non_local_tags() that we build up a sorted string_list via
> repeated calls to string_list_insert(), which will keep the thing sorted
> at each stage. That's not as efficient as just sorting at the end, but I
> think it's probably OK in practice because we actually feed it via
> for_each_ref(), whose output is sorted, and so we'd always just be
> appending to the end.
So without your patch I get these numbers:
% cumulative self self total
time seconds seconds calls s/call s/call name
37.34 29.83 29.83 1948265116 0.00 0.00 strip_suffix_mem
27.56 51.85 22.02 11358 0.00 0.01
prepare_packed_git_one
14.24 63.23 11.38 1924457702 0.00 0.00 strip_suffix
4.88 67.13 3.90 26045319 0.00 0.00 find_pack_entry_one
4.17 70.46 3.33 84828149 0.00 0.00 hashcmp
3.28 73.08 2.62 79760199 0.00 0.00 hashcmp
2.44 75.03 1.95 const_error
the call graph data shows all the reprepare_packed_git() calls to come
from a chain like this:
do_for_each_ref
do_for_each_ref_iterator
ref_iterator_advance
files_ref_iterator_advance
ref_resolves_to_object
has_sha1_file
has_sha1_file_with_flags
reprepare_packed_git
the do_for_each_ref() calls come from prepare_replace_object(),
do_fetch(), find_common(), everything_local(),
check_for_new_submodule_commits(), and find_non_local_tags().
With your (first) patch I get this instead:
% cumulative self self total
time seconds seconds calls ms/call ms/call name
29.41 0.25 0.25 4490399 0.00 0.00 hashcmp
16.47 0.39 0.14 843447 0.00 0.00 find_pack_entry_one
10.59 0.48 0.09 60490 0.00 0.00
unpack_object_header_buffer
4.71 0.52 0.04 167441 0.00 0.00 lookup_object
3.53 0.55 0.03 843447 0.00 0.00 fill_pack_entry
3.53 0.58 0.03 213169 0.00 0.00
check_refname_component
2.35 0.60 0.02 550723 0.00 0.00 strip_suffix_mem
2.35 0.62 0.02 214607 0.00 0.00 insert_obj_hash
2.35 0.64 0.02 158866 0.00 0.00 hashcmp
2.35 0.66 0.02 153936 0.00 0.00
nth_packed_object_offset
Do you have all the profile data you were interested in before I try a
'git gc'?
I really appreciate the quick response and the work you put into fixing
this, even when it's my fault for disabling gc, thanks!
Vegard
next prev parent reply other threads:[~2016-10-13 18:18 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-10-12 22:30 Huge performance bottleneck reading packs Vegard Nossum
2016-10-12 22:45 ` Junio C Hamano
2016-10-13 7:17 ` Vegard Nossum
2016-10-13 14:50 ` Jeff King
2016-10-12 23:01 ` Jeff King
2016-10-12 23:18 ` Jeff King
2016-10-12 23:47 ` Jeff King
2016-10-13 9:04 ` Vegard Nossum
2016-10-14 9:35 ` Jakub Narębski
2016-10-13 7:20 ` Vegard Nossum
2016-10-13 15:26 ` Jeff King
2016-10-13 16:53 ` [PATCH] fetch: use "quick" has_sha1_file for tag following Jeff King
2016-10-13 17:04 ` Jeff King
2016-10-13 20:06 ` Jeff King
2016-10-14 17:39 ` Junio C Hamano
2016-10-14 18:59 ` Jeff King
2016-10-17 17:30 ` Junio C Hamano
2016-10-18 10:28 ` Jeff King
2016-10-13 18:18 ` Vegard Nossum [this message]
2016-10-13 20:43 ` Huge performance bottleneck reading packs Jeff King
2016-10-14 6:55 ` Vegard Nossum
2016-10-14 19:00 ` Jeff King
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d727ee5d-5f0f-e6df-3f83-35fe74641ace@oracle.com \
--to=vegard.nossum@oracle.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=pclouds@gmail.com \
--cc=peff@peff.net \
--cc=quentin.casasnovas@oracle.com \
--cc=spearce@spearce.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).