git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Masaya Suzuki <masayasuzuki@google.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH 1/1] fetch: Cache the want OIDs for faster lookup
Date: Mon, 16 Sep 2019 00:34:55 -0400	[thread overview]
Message-ID: <20190916043454.GA18026@sigill.intra.peff.net> (raw)
In-Reply-To: <20190915211802.207715-2-masayasuzuki@google.com>

On Sun, Sep 15, 2019 at 02:18:02PM -0700, Masaya Suzuki wrote:

> During git-fetch, the client checks if the advertised tags' OIDs are
> already in the fetch request's want OID set. This check is done in a
> linear scan. For a repository that has a lot of refs, repeating this
> scan takes 15+ minutes. In order to speed this up, create a oid_set for
> other refs' OIDs.

Good find. I was curious why nobody noticed it sooner, but I think a key
element in your chromium example is the use of "--mirror", which brings
in a bunch of refs which would not normally be grabbed. There are only
about 17k heads and tags, but over 1.5M entries in refs/changes.
Quadratically speaking, that's almost 8000 times worse, so your
15-minute delay is only 1/10th of a second in the non-mirror case.

Mostly I was wondering if this might have been a recent regression. But
it looks rather old:

> -static int will_fetch(struct ref **head, const unsigned char *sha1)
> +static void create_fetch_oidset(struct ref **head, struct oidset *out)
>  {
>  	struct ref *rm = *head;
>  	while (rm) {
> -		if (hasheq(rm->old_oid.hash, sha1))
> -			return 1;
> +		oidset_insert(out, &rm->old_oid);
>  		rm = rm->next;
>  	}
> -	return 0;
>  }

This function comes from cf7f929a10 (Teach git-fetch to grab a tag at
the same time as a commit, 2008-03-02). As explained in that commit
message, this technique can't actually get all the tags we're looking
for. It was around the same time, in 348e390b17 (Teach
fetch-pack/upload-pack about --include-tag, 2008-03-03), that we added
an extension to do this reliably.

So it _might_ even be possible to get rid of this code entirely, if we
don't care about pre-2008 versions of Git (we'd still get the right
answer from them; it would just take an extra round-trip).

That said, it seems like a good idea to do this obvious and safe
conversion in the near term, and we can look at removing it entirely as
a separate topic. The cost of the oidset isn't too bad (~30MB, I guess
even for that pathological chromium case).

> @@ -324,6 +324,7 @@ static void find_non_local_tags(const struct ref *refs,
>  
>  	refname_hash_init(&existing_refs);
>  	refname_hash_init(&remote_refs);
> +	create_fetch_oidset(head, &fetch_oids);
>  
>  	for_each_ref(add_one_refname, &existing_refs);
>  	for (ref = refs; ref; ref = ref->next) {
> @@ -340,9 +341,9 @@ static void find_non_local_tags(const struct ref *refs,
>  			if (item &&
>  			    !has_object_file_with_flags(&ref->old_oid,
>  							OBJECT_INFO_QUICK) &&
> -			    !will_fetch(head, ref->old_oid.hash) &&
> +			    !oidset_contains(&fetch_oids, &ref->old_oid) &&
>  			    !has_object_file_with_flags(&item->oid, OBJECT_INFO_QUICK) &&
> -			    !will_fetch(head, item->oid.hash))
> +			    !oidset_contains(&fetch_oids, &item->oid))
>  				clear_item(item);
>  			item = NULL;
>  			continue;

One thing to check is whether the list we feed to will_fetch ever gets
updated in the loop that checks it (i.e., do we need to be adding or
clearing elements in the oidset).

I _think_ the answer is no, because we modify only the "refs" list in
that loop (via clear_item()), but the will_fetch() check is on the
head/tail we receive (which isn't updated until later). Both of those
lists are passed into the function, but I didn't trace it all the way up
to make sure they are not aliases.

-Peff

  parent reply	other threads:[~2019-09-16  4:34 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-09-15 21:18 [PATCH 0/1] fetch: Cache the want OIDs for faster lookup Masaya Suzuki
2019-09-15 21:18 ` [PATCH 1/1] " Masaya Suzuki
2019-09-16  2:35   ` Derrick Stolee
2019-09-16  4:28     ` Masaya Suzuki
2019-09-16  4:34   ` Jeff King [this message]
2019-09-16 20:01 ` [PATCH 0/1] " Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190916043454.GA18026@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=masayasuzuki@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).