git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/1] fetch: Cache the want OIDs for faster lookup
@ 2019-09-15 21:18 Masaya Suzuki
  2019-09-15 21:18 ` [PATCH 1/1] " Masaya Suzuki
  2019-09-16 20:01 ` [PATCH 0/1] " Junio C Hamano
  0 siblings, 2 replies; 6+ messages in thread
From: Masaya Suzuki @ 2019-09-15 21:18 UTC (permalink / raw)
  To: git; +Cc: Masaya Suzuki

When mirroring a repository with a lot of refs, there is a case where the client
takes a long time to calculate the want OIDs. This is observable with Chromium's
repository for example.

$ mkdir testing
$ cd testing
$ git init . --bare
$ git remote add origin master https://chromium.googlesource.com/chromium/src --mirror=fetch
$ git fetch origin

With the commands above, it takes a long time before sending a fetch request. I
stopped the command after 15 minutes.

Debugging this, it seems most of the time is spent on iterating the want refs to
see OIDs are included there. This patch speeds up this process by using oid_set.
Now the client can send a fetch request almost immediately.

Masaya Suzuki (1):
  fetch: Cache the want OIDs for faster lookup

 builtin/fetch.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

-- 
2.23.0.237.gc6a4ce50a0-goog


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/1] fetch: Cache the want OIDs for faster lookup
  2019-09-15 21:18 [PATCH 0/1] fetch: Cache the want OIDs for faster lookup Masaya Suzuki
@ 2019-09-15 21:18 ` Masaya Suzuki
  2019-09-16  2:35   ` Derrick Stolee
  2019-09-16  4:34   ` Jeff King
  2019-09-16 20:01 ` [PATCH 0/1] " Junio C Hamano
  1 sibling, 2 replies; 6+ messages in thread
From: Masaya Suzuki @ 2019-09-15 21:18 UTC (permalink / raw)
  To: git; +Cc: Masaya Suzuki

During git-fetch, the client checks if the advertised tags' OIDs are
already in the fetch request's want OID set. This check is done in a
linear scan. For a repository that has a lot of refs, repeating this
scan takes 15+ minutes. In order to speed this up, create a oid_set for
other refs' OIDs.

Signed-off-by: Masaya Suzuki <masayasuzuki@google.com>
---
 builtin/fetch.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 54d6b01892..51a276dfaa 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -7,6 +7,7 @@
 #include "refs.h"
 #include "refspec.h"
 #include "object-store.h"
+#include "oidset.h"
 #include "commit.h"
 #include "builtin.h"
 #include "string-list.h"
@@ -243,15 +244,13 @@ static void add_merge_config(struct ref **head,
 	}
 }
 
-static int will_fetch(struct ref **head, const unsigned char *sha1)
+static void create_fetch_oidset(struct ref **head, struct oidset *out)
 {
 	struct ref *rm = *head;
 	while (rm) {
-		if (hasheq(rm->old_oid.hash, sha1))
-			return 1;
+		oidset_insert(out, &rm->old_oid);
 		rm = rm->next;
 	}
-	return 0;
 }
 
 struct refname_hash_entry {
@@ -317,6 +316,7 @@ static void find_non_local_tags(const struct ref *refs,
 {
 	struct hashmap existing_refs;
 	struct hashmap remote_refs;
+	struct oidset fetch_oids = OIDSET_INIT;
 	struct string_list remote_refs_list = STRING_LIST_INIT_NODUP;
 	struct string_list_item *remote_ref_item;
 	const struct ref *ref;
@@ -324,6 +324,7 @@ static void find_non_local_tags(const struct ref *refs,
 
 	refname_hash_init(&existing_refs);
 	refname_hash_init(&remote_refs);
+	create_fetch_oidset(head, &fetch_oids);
 
 	for_each_ref(add_one_refname, &existing_refs);
 	for (ref = refs; ref; ref = ref->next) {
@@ -340,9 +341,9 @@ static void find_non_local_tags(const struct ref *refs,
 			if (item &&
 			    !has_object_file_with_flags(&ref->old_oid,
 							OBJECT_INFO_QUICK) &&
-			    !will_fetch(head, ref->old_oid.hash) &&
+			    !oidset_contains(&fetch_oids, &ref->old_oid) &&
 			    !has_object_file_with_flags(&item->oid, OBJECT_INFO_QUICK) &&
-			    !will_fetch(head, item->oid.hash))
+			    !oidset_contains(&fetch_oids, &item->oid))
 				clear_item(item);
 			item = NULL;
 			continue;
@@ -356,7 +357,7 @@ static void find_non_local_tags(const struct ref *refs,
 		 */
 		if (item &&
 		    !has_object_file_with_flags(&item->oid, OBJECT_INFO_QUICK) &&
-		    !will_fetch(head, item->oid.hash))
+		    !oidset_contains(&fetch_oids, &item->oid))
 			clear_item(item);
 
 		item = NULL;
@@ -377,7 +378,7 @@ static void find_non_local_tags(const struct ref *refs,
 	 */
 	if (item &&
 	    !has_object_file_with_flags(&item->oid, OBJECT_INFO_QUICK) &&
-	    !will_fetch(head, item->oid.hash))
+	    !oidset_contains(&fetch_oids, &item->oid))
 		clear_item(item);
 
 	/*
@@ -404,6 +405,7 @@ static void find_non_local_tags(const struct ref *refs,
 	}
 	hashmap_free(&remote_refs, 1);
 	string_list_clear(&remote_refs_list, 0);
+	oidset_clear(&fetch_oids);
 }
 
 static struct ref *get_ref_map(struct remote *remote,
-- 
2.23.0.237.gc6a4ce50a0-goog


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] fetch: Cache the want OIDs for faster lookup
  2019-09-15 21:18 ` [PATCH 1/1] " Masaya Suzuki
@ 2019-09-16  2:35   ` Derrick Stolee
  2019-09-16  4:28     ` Masaya Suzuki
  2019-09-16  4:34   ` Jeff King
  1 sibling, 1 reply; 6+ messages in thread
From: Derrick Stolee @ 2019-09-16  2:35 UTC (permalink / raw)
  To: Masaya Suzuki, git

On 9/15/2019 5:18 PM, Masaya Suzuki wrote:
> During git-fetch, the client checks if the advertised tags' OIDs are
> already in the fetch request's want OID set. This check is done in a
> linear scan. For a repository that has a lot of refs, repeating this
> scan takes 15+ minutes. In order to speed this up, create a oid_set for
> other refs' OIDs.

Good catch! Quadratic performance is never good.

The patch below looks like it works, but could you also share your
performance timings for the 15+ minute case after your patch is
applied?

Thanks,
-Stolee

> 
> Signed-off-by: Masaya Suzuki <masayasuzuki@google.com>
> ---
>  builtin/fetch.c | 18 ++++++++++--------
>  1 file changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index 54d6b01892..51a276dfaa 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -7,6 +7,7 @@
>  #include "refs.h"
>  #include "refspec.h"
>  #include "object-store.h"
> +#include "oidset.h"
>  #include "commit.h"
>  #include "builtin.h"
>  #include "string-list.h"
> @@ -243,15 +244,13 @@ static void add_merge_config(struct ref **head,
>  	}
>  }
>  
> -static int will_fetch(struct ref **head, const unsigned char *sha1)
> +static void create_fetch_oidset(struct ref **head, struct oidset *out)
>  {
>  	struct ref *rm = *head;
>  	while (rm) {
> -		if (hasheq(rm->old_oid.hash, sha1))
> -			return 1;
> +		oidset_insert(out, &rm->old_oid);
>  		rm = rm->next;
>  	}
> -	return 0;
>  }
>  
>  struct refname_hash_entry {
> @@ -317,6 +316,7 @@ static void find_non_local_tags(const struct ref *refs,
>  {
>  	struct hashmap existing_refs;
>  	struct hashmap remote_refs;
> +	struct oidset fetch_oids = OIDSET_INIT;
>  	struct string_list remote_refs_list = STRING_LIST_INIT_NODUP;
>  	struct string_list_item *remote_ref_item;
>  	const struct ref *ref;
> @@ -324,6 +324,7 @@ static void find_non_local_tags(const struct ref *refs,
>  
>  	refname_hash_init(&existing_refs);
>  	refname_hash_init(&remote_refs);
> +	create_fetch_oidset(head, &fetch_oids);
>  
>  	for_each_ref(add_one_refname, &existing_refs);
>  	for (ref = refs; ref; ref = ref->next) {
> @@ -340,9 +341,9 @@ static void find_non_local_tags(const struct ref *refs,
>  			if (item &&
>  			    !has_object_file_with_flags(&ref->old_oid,
>  							OBJECT_INFO_QUICK) &&
> -			    !will_fetch(head, ref->old_oid.hash) &&
> +			    !oidset_contains(&fetch_oids, &ref->old_oid) &&
>  			    !has_object_file_with_flags(&item->oid, OBJECT_INFO_QUICK) &&
> -			    !will_fetch(head, item->oid.hash))
> +			    !oidset_contains(&fetch_oids, &item->oid))
>  				clear_item(item);
>  			item = NULL;
>  			continue;
> @@ -356,7 +357,7 @@ static void find_non_local_tags(const struct ref *refs,
>  		 */
>  		if (item &&
>  		    !has_object_file_with_flags(&item->oid, OBJECT_INFO_QUICK) &&
> -		    !will_fetch(head, item->oid.hash))
> +		    !oidset_contains(&fetch_oids, &item->oid))
>  			clear_item(item);
>  
>  		item = NULL;
> @@ -377,7 +378,7 @@ static void find_non_local_tags(const struct ref *refs,
>  	 */
>  	if (item &&
>  	    !has_object_file_with_flags(&item->oid, OBJECT_INFO_QUICK) &&
> -	    !will_fetch(head, item->oid.hash))
> +	    !oidset_contains(&fetch_oids, &item->oid))
>  		clear_item(item);
>  
>  	/*
> @@ -404,6 +405,7 @@ static void find_non_local_tags(const struct ref *refs,
>  	}
>  	hashmap_free(&remote_refs, 1);
>  	string_list_clear(&remote_refs_list, 0);
> +	oidset_clear(&fetch_oids);
>  }
>  
>  static struct ref *get_ref_map(struct remote *remote,
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] fetch: Cache the want OIDs for faster lookup
  2019-09-16  2:35   ` Derrick Stolee
@ 2019-09-16  4:28     ` Masaya Suzuki
  0 siblings, 0 replies; 6+ messages in thread
From: Masaya Suzuki @ 2019-09-16  4:28 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Git Mailing List

On Sun, Sep 15, 2019 at 7:35 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 9/15/2019 5:18 PM, Masaya Suzuki wrote:
> > During git-fetch, the client checks if the advertised tags' OIDs are
> > already in the fetch request's want OID set. This check is done in a
> > linear scan. For a repository that has a lot of refs, repeating this
> > scan takes 15+ minutes. In order to speed this up, create a oid_set for
> > other refs' OIDs.
>
> Good catch! Quadratic performance is never good.
>
> The patch below looks like it works, but could you also share your
> performance timings for the 15+ minute case after your patch is
> applied?

With the following code change, I measured the time for
find_non_local_tags. It shows 215 msec with the example commands. (I
didn't measure entire fetch time as good portion of the time is spent
on the server side.)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 51a276dfaa..d3b06c733d 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -25,6 +25,7 @@
 #include "list-objects-filter-options.h"
 #include "commit-reach.h"
 #include "branch.h"
+#include <time.h>

 #define FORCED_UPDATES_DELAY_WARNING_IN_MS (10 * 1000)

@@ -322,8 +323,11 @@ static void find_non_local_tags(const struct ref *refs,
        const struct ref *ref;
        struct refname_hash_entry *item = NULL;

+       struct timespec start, end;
+
        refname_hash_init(&existing_refs);
        refname_hash_init(&remote_refs);
+       clock_gettime(CLOCK_MONOTONIC, &start);
        create_fetch_oidset(head, &fetch_oids);

        for_each_ref(add_one_refname, &existing_refs);
@@ -405,6 +409,12 @@ static void find_non_local_tags(const struct ref *refs,
        }
        hashmap_free(&remote_refs, 1);
        string_list_clear(&remote_refs_list, 0);
+       clock_gettime(CLOCK_MONOTONIC, &end);
+       {
+               uint64_t millisec = (end.tv_sec - start.tv_sec) * 1000
+ (end.tv_nsec - start.tv_nsec) / 1000000;
+               fprintf(stderr, "find_non_local_tags: %ld msec\n", millisec);
+       }
+
        oidset_clear(&fetch_oids);
 }

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 1/1] fetch: Cache the want OIDs for faster lookup
  2019-09-15 21:18 ` [PATCH 1/1] " Masaya Suzuki
  2019-09-16  2:35   ` Derrick Stolee
@ 2019-09-16  4:34   ` Jeff King
  1 sibling, 0 replies; 6+ messages in thread
From: Jeff King @ 2019-09-16  4:34 UTC (permalink / raw)
  To: Masaya Suzuki; +Cc: git

On Sun, Sep 15, 2019 at 02:18:02PM -0700, Masaya Suzuki wrote:

> During git-fetch, the client checks if the advertised tags' OIDs are
> already in the fetch request's want OID set. This check is done in a
> linear scan. For a repository that has a lot of refs, repeating this
> scan takes 15+ minutes. In order to speed this up, create a oid_set for
> other refs' OIDs.

Good find. I was curious why nobody noticed it sooner, but I think a key
element in your chromium example is the use of "--mirror", which brings
in a bunch of refs which would not normally be grabbed. There are only
about 17k heads and tags, but over 1.5M entries in refs/changes.
Quadratically speaking, that's almost 8000 times worse, so your
15-minute delay is only 1/10th of a second in the non-mirror case.

Mostly I was wondering if this might have been a recent regression. But
it looks rather old:

> -static int will_fetch(struct ref **head, const unsigned char *sha1)
> +static void create_fetch_oidset(struct ref **head, struct oidset *out)
>  {
>  	struct ref *rm = *head;
>  	while (rm) {
> -		if (hasheq(rm->old_oid.hash, sha1))
> -			return 1;
> +		oidset_insert(out, &rm->old_oid);
>  		rm = rm->next;
>  	}
> -	return 0;
>  }

This function comes from cf7f929a10 (Teach git-fetch to grab a tag at
the same time as a commit, 2008-03-02). As explained in that commit
message, this technique can't actually get all the tags we're looking
for. It was around the same time, in 348e390b17 (Teach
fetch-pack/upload-pack about --include-tag, 2008-03-03), that we added
an extension to do this reliably.

So it _might_ even be possible to get rid of this code entirely, if we
don't care about pre-2008 versions of Git (we'd still get the right
answer from them; it would just take an extra round-trip).

That said, it seems like a good idea to do this obvious and safe
conversion in the near term, and we can look at removing it entirely as
a separate topic. The cost of the oidset isn't too bad (~30MB, I guess
even for that pathological chromium case).

> @@ -324,6 +324,7 @@ static void find_non_local_tags(const struct ref *refs,
>  
>  	refname_hash_init(&existing_refs);
>  	refname_hash_init(&remote_refs);
> +	create_fetch_oidset(head, &fetch_oids);
>  
>  	for_each_ref(add_one_refname, &existing_refs);
>  	for (ref = refs; ref; ref = ref->next) {
> @@ -340,9 +341,9 @@ static void find_non_local_tags(const struct ref *refs,
>  			if (item &&
>  			    !has_object_file_with_flags(&ref->old_oid,
>  							OBJECT_INFO_QUICK) &&
> -			    !will_fetch(head, ref->old_oid.hash) &&
> +			    !oidset_contains(&fetch_oids, &ref->old_oid) &&
>  			    !has_object_file_with_flags(&item->oid, OBJECT_INFO_QUICK) &&
> -			    !will_fetch(head, item->oid.hash))
> +			    !oidset_contains(&fetch_oids, &item->oid))
>  				clear_item(item);
>  			item = NULL;
>  			continue;

One thing to check is whether the list we feed to will_fetch ever gets
updated in the loop that checks it (i.e., do we need to be adding or
clearing elements in the oidset).

I _think_ the answer is no, because we modify only the "refs" list in
that loop (via clear_item()), but the will_fetch() check is on the
head/tail we receive (which isn't updated until later). Both of those
lists are passed into the function, but I didn't trace it all the way up
to make sure they are not aliases.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/1] fetch: Cache the want OIDs for faster lookup
  2019-09-15 21:18 [PATCH 0/1] fetch: Cache the want OIDs for faster lookup Masaya Suzuki
  2019-09-15 21:18 ` [PATCH 1/1] " Masaya Suzuki
@ 2019-09-16 20:01 ` Junio C Hamano
  1 sibling, 0 replies; 6+ messages in thread
From: Junio C Hamano @ 2019-09-16 20:01 UTC (permalink / raw)
  To: Masaya Suzuki; +Cc: git

Masaya Suzuki <masayasuzuki@google.com> writes:

> Debugging this, it seems most of the time is spent on iterating the want refs to
> see OIDs are included there. This patch speeds up this process by using oid_set.
> Now the client can send a fetch request almost immediately.

Wonderful.  Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-09-16 20:01 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-15 21:18 [PATCH 0/1] fetch: Cache the want OIDs for faster lookup Masaya Suzuki
2019-09-15 21:18 ` [PATCH 1/1] " Masaya Suzuki
2019-09-16  2:35   ` Derrick Stolee
2019-09-16  4:28     ` Masaya Suzuki
2019-09-16  4:34   ` Jeff King
2019-09-16 20:01 ` [PATCH 0/1] " Junio C Hamano

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).