From: Julian Phillips <julian@quantumfyre.co.uk>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org
Subject: Re: [RFC/PATCH 0/2] Speed up fetch with large number of tags
Date: Wed, 16 Sep 2009 23:32:52 +0100 (BST) [thread overview]
Message-ID: <alpine.LNX.2.00.0909162141140.13697@reaper.quantumfyre.co.uk> (raw)
In-Reply-To: <7vbplb2pi7.fsf@alter.siamese.dyndns.org>
On Wed, 16 Sep 2009, Junio C Hamano wrote:
> Julian Phillips <julian@quantumfyre.co.uk> writes:
>
>> I have a repository at $dayjob where fetch was taking ~30s to tell me
>> that there were no updates.
>>
>> It turns out that I appear to have added a nasty linear search of all
>> remote refs for every commit (i.e. tag^{}) tag ref way back in the
>> original C implementation of fetch. This doesn't scale well to large
>> numbers of refs, so this replaces it with a hash table based lookup
>> instead, which brings the time down to a few seconds even for very large
>> ref counts.
>>
>> I haven't tested it with non-native transports, but there is no reason
>> to believe that the code should be transport specific.
>
> Very interesting.
>
> A few questions (not criticisms).
>
> * 1m50s to 4.5s is quite impressive, even if it is only in a repository
> with unusual refs-vs-commits ratio, but I personally think 10 refs per
> every commit is already on the borderline of being insane, and the
> normal ratio would be more like 1 refs per every 10-20 commits.
I noticed the problem with a real repository at $dayjob, and did enough
anaylsis to identiy the problem. I deliberately created a repository that
should emphasise the problem so that it was easy to get a handle on.
Having applied the ref-dict patches fetch on my $dayjob repo has gone from
~30s to ~4.5s, which may not be as impressive - but is much easier to live
with.
> What are possible downsides with the new code in repositories with more
> reasonable refs-vs-commits ratio? A hash table (with a sensible hash
> function) would almost always outperform linear search in an randomly
> ordered collection, so my gut tells me that there won't be performance
> downsides, but are there other potential issues we should worry about?
I guess that main thing would be the extra memory usage.
> * In an insanely large refs-vs-commits case, perhaps not 50000:1 but more
> like 100:1, but with a history with far more than one commit, what is
> the memory consumption? Judging from a cursory view, I think the way
> ref-dict re-uses struct ref might be quite suboptimal, as you are using
> only next (for hash-bucket link), old_sha1[] and its name field, and
> also your ref_dict_add() calls alloc_ref() which calls one calloc() per
> requested ref, instead of attempting any bulk allocation.
Yeah, I just reused struct for speed and convience of developing, to
veryify that ref-dict would give me the speed I wanted. A final patch
would want a more optimised version. Except that I've thrown the whole
hash table thing away anyway.
> * The outer loop is walking the list of refs from a transport, and the
> inner loop is walking a copy of the same list of refs from the same
> transport, looking for each refs/tags/X^{} what record, if any, existed
> for refs/tags/X.
>
> Would it make sense to further specialize your optimization? For
> example, something like...
I actually arrived at somthing similar to this myself, after realising
that I could use string_list as a basis.
> * It is tempting to use a hash table when you have to deal with an
> unordered collection, but in this case, wouldn't the refs obtained from
> the transport (it's essentially a ls-remote output, isn't it?) be
> sorted? Can't you take advantage of that fact to optimize the loop,
> without adding a specialized hash table implementation?
I wasn't sure if we could rely on the refs list being sorted. But I've
got a new version that uses an extra string_list instead that is actually
slightly faster. I'll post that shortly.
> We find refs/tags/v0.99 immediately followed by refs/tags/v0.99^{} in
> the ls-remote output. And the inefficient loop is about finding
> refs/tags/v0.99 when we see refs/tags/v0.99^{}, so if we remember the
> tag ref we saw in the previous round, we can check with that first to
> make sure our "sorted" assumption holds true, and optimize the loop out
> that way, no?
--
Julian
---
Bilbo's First Law:
You cannot count friends that are all packed up in barrels.
next prev parent reply other threads:[~2009-09-16 22:36 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-09-16 7:53 [RFC/PATCH 0/2] Speed up fetch with large number of tags Julian Phillips
2009-09-16 7:53 ` [RFC/PATCH 1/2] ref-dict: Add a set of functions for working with a ref dictionary Julian Phillips
2009-09-16 7:53 ` [RFC/PATCH 2/2] fetch: Speed up fetch by using " Julian Phillips
2009-09-16 9:44 ` [RFC/PATCH 0/2] Speed up fetch with large number of tags Junio C Hamano
2009-09-16 22:32 ` Julian Phillips [this message]
2009-09-16 22:42 ` Shawn O. Pearce
2009-09-16 22:52 ` Junio C Hamano
2009-09-16 23:03 ` Shawn O. Pearce
2009-09-16 23:19 ` Junio C Hamano
2009-09-16 22:53 ` [RFC/PATCH v2] fetch: Speed up fetch by rewriting find_non_local_tags Julian Phillips
2009-09-16 23:15 ` Junio C Hamano
2009-09-16 23:46 ` Julian Phillips
2009-09-17 1:30 ` Julian Phillips
2009-09-17 7:13 ` Johan Herland
2009-09-17 7:33 ` [RFC/PATCH v3] " Julian Phillips
2009-09-16 22:46 ` [RFC/PATCH 0/2] Speed up fetch with large number of tags Shawn O. Pearce
2009-09-22 20:36 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LNX.2.00.0909162141140.13697@reaper.quantumfyre.co.uk \
--to=julian@quantumfyre.co.uk \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).