Re: [RFC/PATCH 0/2] Speed up fetch with large number of tags

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Julian Phillips <julian@quantumfyre.co.uk>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org
Subject: Re: [RFC/PATCH 0/2] Speed up fetch with large number of tags
Date: Wed, 16 Sep 2009 23:32:52 +0100 (BST)	[thread overview]
Message-ID: <alpine.LNX.2.00.0909162141140.13697@reaper.quantumfyre.co.uk> (raw)
In-Reply-To: <7vbplb2pi7.fsf@alter.siamese.dyndns.org>

On Wed, 16 Sep 2009, Junio C Hamano wrote:

> Julian Phillips <julian@quantumfyre.co.uk> writes:
>
>> I have a repository at $dayjob where fetch was taking ~30s to tell me
>> that there were no updates.
>>
>> It turns out that I appear to have added a nasty linear search of all
>> remote refs for every commit (i.e. tag^{}) tag ref way back in the
>> original C implementation of fetch.  This doesn't scale well to large
>> numbers of refs, so this replaces it with a hash table based lookup
>> instead, which brings the time down to a few seconds even for very large
>> ref counts.
>>
>> I haven't tested it with non-native transports, but there is no reason
>> to believe that the code should be transport specific.
>
> Very interesting.
>
> A few questions (not criticisms).
>
> * 1m50s to 4.5s is quite impressive, even if it is only in a repository
>   with unusual refs-vs-commits ratio, but I personally think 10 refs per
>   every commit is already on the borderline of being insane, and the
>   normal ratio would be more like 1 refs per every 10-20 commits.

I noticed the problem with a real repository at $dayjob, and did enough 
anaylsis to identiy the problem.  I deliberately created a repository that 
should emphasise the problem so that it was easy to get a handle on.

Having applied the ref-dict patches fetch on my $dayjob repo has gone from 
~30s to ~4.5s, which may not be as impressive - but is much easier to live 
with.

>   What are possible downsides with the new code in repositories with more
>   reasonable refs-vs-commits ratio?  A hash table (with a sensible hash
>   function) would almost always outperform linear search in an randomly
>   ordered collection, so my gut tells me that there won't be performance
>   downsides, but are there other potential issues we should worry about?

I guess that main thing would be the extra memory usage.

> * In an insanely large refs-vs-commits case, perhaps not 50000:1 but more
>   like 100:1, but with a history with far more than one commit, what is
>   the memory consumption?  Judging from a cursory view, I think the way
>   ref-dict re-uses struct ref might be quite suboptimal, as you are using
>   only next (for hash-bucket link), old_sha1[] and its name field, and
>   also your ref_dict_add() calls alloc_ref() which calls one calloc() per
>   requested ref, instead of attempting any bulk allocation.

Yeah, I just reused struct for speed and convience of developing, to 
veryify that ref-dict would give me the speed I wanted.  A final patch 
would want a more optimised version.  Except that I've thrown the whole 
hash table thing away anyway.

> * The outer loop is walking the list of refs from a transport, and the
>   inner loop is walking a copy of the same list of refs from the same
>   transport, looking for each refs/tags/X^{} what record, if any, existed
>   for refs/tags/X.
>
>   Would it make sense to further specialize your optimization?  For
>   example, something like...

I actually arrived at somthing similar to this myself, after realising 
that I could use string_list as a basis.

> * It is tempting to use a hash table when you have to deal with an
>   unordered collection, but in this case, wouldn't the refs obtained from
>   the transport (it's essentially a ls-remote output, isn't it?) be
>   sorted?  Can't you take advantage of that fact to optimize the loop,
>   without adding a specialized hash table implementation?

I wasn't sure if we could rely on the refs list being sorted.  But I've 
got a new version that uses an extra string_list instead that is actually 
slightly faster.  I'll post that shortly.

>   We find refs/tags/v0.99 immediately followed by refs/tags/v0.99^{} in
>   the ls-remote output.  And the inefficient loop is about finding
>   refs/tags/v0.99 when we see refs/tags/v0.99^{}, so if we remember the
>   tag ref we saw in the previous round, we can check with that first to
>   make sure our "sorted" assumption holds true, and optimize the loop out
>   that way, no?

-- 
Julian

  ---
Bilbo's First Law:
 	You cannot count friends that are all packed up in barrels.

next prev parent reply	other threads:[~2009-09-16 22:36 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-16  7:53 [RFC/PATCH 0/2] Speed up fetch with large number of tags Julian Phillips
2009-09-16  7:53 ` [RFC/PATCH 1/2] ref-dict: Add a set of functions for working with a ref dictionary Julian Phillips
2009-09-16  7:53 ` [RFC/PATCH 2/2] fetch: Speed up fetch by using " Julian Phillips
2009-09-16  9:44 ` [RFC/PATCH 0/2] Speed up fetch with large number of tags Junio C Hamano
2009-09-16 22:32   ` Julian Phillips [this message]
2009-09-16 22:42     ` Shawn O. Pearce
2009-09-16 22:52       ` Junio C Hamano
2009-09-16 23:03         ` Shawn O. Pearce
2009-09-16 23:19           ` Junio C Hamano
2009-09-16 22:53     ` [RFC/PATCH v2] fetch: Speed up fetch by rewriting find_non_local_tags Julian Phillips
2009-09-16 23:15       ` Junio C Hamano
2009-09-16 23:46         ` Julian Phillips
2009-09-17  1:30           ` Julian Phillips
2009-09-17  7:13             ` Johan Herland
2009-09-17  7:33               ` [RFC/PATCH v3] " Julian Phillips
2009-09-16 22:46   ` [RFC/PATCH 0/2] Speed up fetch with large number of tags Shawn O. Pearce
2009-09-22 20:36     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LNX.2.00.0909162141140.13697@reaper.quantumfyre.co.uk \
    --to=julian@quantumfyre.co.uk \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).