git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: Julian Phillips <julian@quantumfyre.co.uk>
Cc: git@vger.kernel.org
Subject: Re: [RFC/PATCH 0/2] Speed up fetch with large number of tags
Date: Wed, 16 Sep 2009 02:44:00 -0700	[thread overview]
Message-ID: <7vbplb2pi7.fsf@alter.siamese.dyndns.org> (raw)
In-Reply-To: <20090916074737.58044.42776.julian@quantumfyre.co.uk> (Julian Phillips's message of "Wed\, 16 Sep 2009 08\:53\:01 +0100")

Julian Phillips <julian@quantumfyre.co.uk> writes:

> I have a repository at $dayjob where fetch was taking ~30s to tell me
> that there were no updates.
>
> It turns out that I appear to have added a nasty linear search of all
> remote refs for every commit (i.e. tag^{}) tag ref way back in the
> original C implementation of fetch.  This doesn't scale well to large
> numbers of refs, so this replaces it with a hash table based lookup
> instead, which brings the time down to a few seconds even for very large
> ref counts.
>
> I haven't tested it with non-native transports, but there is no reason
> to believe that the code should be transport specific.

Very interesting.

A few questions (not criticisms).

 * 1m50s to 4.5s is quite impressive, even if it is only in a repository
   with unusual refs-vs-commits ratio, but I personally think 10 refs per
   every commit is already on the borderline of being insane, and the
   normal ratio would be more like 1 refs per every 10-20 commits.

   What are possible downsides with the new code in repositories with more
   reasonable refs-vs-commits ratio?  A hash table (with a sensible hash
   function) would almost always outperform linear search in an randomly
   ordered collection, so my gut tells me that there won't be performance
   downsides, but are there other potential issues we should worry about?

 * In an insanely large refs-vs-commits case, perhaps not 50000:1 but more
   like 100:1, but with a history with far more than one commit, what is
   the memory consumption?  Judging from a cursory view, I think the way
   ref-dict re-uses struct ref might be quite suboptimal, as you are using
   only next (for hash-bucket link), old_sha1[] and its name field, and
   also your ref_dict_add() calls alloc_ref() which calls one calloc() per
   requested ref, instead of attempting any bulk allocation.

 * The outer loop is walking the list of refs from a transport, and the
   inner loop is walking a copy of the same list of refs from the same
   transport, looking for each refs/tags/X^{} what record, if any, existed
   for refs/tags/X.

   Would it make sense to further specialize your optimization?  For
   example, something like...

        /* Your hash records this structure */
        struct tag_ref_record {
                const char *name;
                struct ref *self;
                struct ref *peeled;
        };

        static void add_to_tail(struct ref ***tail,
                                struct string_list *existing_refs,
                                struct string_list *new_refs,
                                const struct ref *ref,
                                const unsigned char sha1[]) {
                ... the "appending to *tail" thing as a helper function ...
        }

        for (ref in all refs from transport) {
                if (ref is of form "refs/tags/X^{}")
                        look up tag_ref_record for "refs/tags/X" and store
                        ref in its peeled member;
                else if (ref is of form "refs/tags/X")
                        look up tag_ref_record for "refs/tags/X" and store
                        ref in its self member;
        }

        for (trr in all tag_ref_record database) {
                add_to_tail(tail, &existing_refs, &new_refs,
                            trr->self, self->old_sha1);
                add_to_tail(tail, &existing_refs, &new_refs,
                            trr->peeled, self->old_sha1);
        }

 * It is tempting to use a hash table when you have to deal with an
   unordered collection, but in this case, wouldn't the refs obtained from
   the transport (it's essentially a ls-remote output, isn't it?) be
   sorted?  Can't you take advantage of that fact to optimize the loop,
   without adding a specialized hash table implementation?

   We find refs/tags/v0.99 immediately followed by refs/tags/v0.99^{} in
   the ls-remote output.  And the inefficient loop is about finding
   refs/tags/v0.99 when we see refs/tags/v0.99^{}, so if we remember the
   tag ref we saw in the previous round, we can check with that first to
   make sure our "sorted" assumption holds true, and optimize the loop out
   that way, no?

diff --git a/builtin-fetch.c b/builtin-fetch.c
index cb48c57..3f12e28 100644
--- a/builtin-fetch.c
+++ b/builtin-fetch.c
@@ -516,6 +516,7 @@ static void find_non_local_tags(struct transport *transport,
 	const struct ref *tag_ref;
 	struct ref *rm = NULL;
 	const struct ref *ref;
+	const struct ref *last_tag_seen = NULL;
 
 	for_each_ref(add_existing, &existing_refs);
 	for (ref = transport_get_remote_refs(transport); ref; ref = ref->next) {
@@ -528,6 +529,11 @@ static void find_non_local_tags(struct transport *transport,
 
 		if (!strcmp(ref_name + ref_name_len - 3, "^{}")) {
 			ref_name[ref_name_len - 3] = 0;
+			if (last_tag_seen &&
+			    !strcmp(last_tag_seen->name, ref_name)) {
+				ref_sha1 = last_tag_seen->old_sha1;
+				goto quick;
+			}
 			tag_ref = transport_get_remote_refs(transport);
 			while (tag_ref) {
 				if (!strcmp(tag_ref->name, ref_name)) {
@@ -536,8 +542,11 @@ static void find_non_local_tags(struct transport *transport,
 				}
 				tag_ref = tag_ref->next;
 			}
+		} else {
+			last_tag_seen = ref;
 		}
 
+	quick:
 		if (!string_list_has_string(&existing_refs, ref_name) &&
 		    !string_list_has_string(&new_refs, ref_name) &&
 		    (has_sha1_file(ref->old_sha1) ||

  parent reply	other threads:[~2009-09-16  9:44 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-09-16  7:53 [RFC/PATCH 0/2] Speed up fetch with large number of tags Julian Phillips
2009-09-16  7:53 ` [RFC/PATCH 1/2] ref-dict: Add a set of functions for working with a ref dictionary Julian Phillips
2009-09-16  7:53 ` [RFC/PATCH 2/2] fetch: Speed up fetch by using " Julian Phillips
2009-09-16  9:44 ` Junio C Hamano [this message]
2009-09-16 22:32   ` [RFC/PATCH 0/2] Speed up fetch with large number of tags Julian Phillips
2009-09-16 22:42     ` Shawn O. Pearce
2009-09-16 22:52       ` Junio C Hamano
2009-09-16 23:03         ` Shawn O. Pearce
2009-09-16 23:19           ` Junio C Hamano
2009-09-16 22:53     ` [RFC/PATCH v2] fetch: Speed up fetch by rewriting find_non_local_tags Julian Phillips
2009-09-16 23:15       ` Junio C Hamano
2009-09-16 23:46         ` Julian Phillips
2009-09-17  1:30           ` Julian Phillips
2009-09-17  7:13             ` Johan Herland
2009-09-17  7:33               ` [RFC/PATCH v3] " Julian Phillips
2009-09-16 22:46   ` [RFC/PATCH 0/2] Speed up fetch with large number of tags Shawn O. Pearce
2009-09-22 20:36     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7vbplb2pi7.fsf@alter.siamese.dyndns.org \
    --to=gitster@pobox.com \
    --cc=git@vger.kernel.org \
    --cc=julian@quantumfyre.co.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).