git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: git@vger.kernel.org
Subject: Re: [PATCH] diff: batch fetching of missing blobs
Date: Thu, 28 Mar 2019 02:52:52 -0400	[thread overview]
Message-ID: <20190328065252.GA1930@sigill.intra.peff.net> (raw)
In-Reply-To: <20190326220906.111879-1-jonathantanmy@google.com>

On Tue, Mar 26, 2019 at 03:09:06PM -0700, Jonathan Tan wrote:

> When running a command like "git show" or "git diff" in a partial clone,
> batch all missing blobs to be fetched as one request.
> 
> This is similar to c0c578b33c ("unpack-trees: batch fetching of missing
> blobs", 2017-12-08), but for another command.

Sounds like a good idea, and this should make some cases much better
without making any cases worse. Two observations about how we might do
even better, though:

> @@ -6067,6 +6068,32 @@ static void diff_flush_patch_all_file_pairs(struct diff_options *o)
> [...]

At this stage we're looking at actually diffing the contents themselves.
But we'd also potentially need the blobs during rename and break
detection. It's not always the same set of blobs (e.g., unless you've
cranked up the copy-detection flags, renames only look at added/deleted
entries). We could have each phase do its own bulk fetch, which
worst-case gives us probably three fetches. But I wonder if we could
figure out a complete plausible set immediately after the tree diff.

> +	if (repository_format_partial_clone) {
> +		/*
> +		 * Prefetch the diff pairs that are about to be flushed.
> +		 */
> +		struct oid_array to_fetch = OID_ARRAY_INIT;
> +		int fetch_if_missing_store = fetch_if_missing;
> +
> +		fetch_if_missing = 0;
> +		for (i = 0; i < q->nr; i++) {
> +			struct diff_filepair *p = q->queue[i];
> +			if (!check_pair_status(p))
> +				continue;
> +			if (p->one && p->one->oid_valid &&
> +			    !has_object_file(&p->one->oid))
> +				oid_array_append(&to_fetch, &p->one->oid);
> +			if (p->two && p->two->oid_valid &&
> +			    !has_object_file(&p->two->oid))
> +				oid_array_append(&to_fetch, &p->two->oid);
> +		}

These has_object_file() calls may repeatedly re-scan the pack directory,
once per call.  Since it's likely that some may be missing, that may be
a noticeable amount of wasted work for a big diff (still way less than
individually fetching each missing object, so it's a net win, but read
on).

If you use the QUICK flag, that avoids the re-scans, but we may miss
erroneously say "we don't have it" if we race with a repack. For that,
we can either:

  1. Just ignore it. It's relatively rare, and the worst case is that we
     re-fetch an object.

  2. Do a series of QUICK checks, followed by a single
     reprepare_packed_git() if we had any missing, and then another
     series of QUICK checks. Then worst-case we have a single re-scan.

Something like:

  int object_is_missing_cb(const struct object_id *oid, void *data)
  {
	return !has_object_file_with_flags(oid, OBJECT_INFO_QUICK);
  }
  ...

  /* collect all of the possible blobs we need */
  for (i = 0; i < q->nr; i++) {
	...
	oid_array_append(&to_fetch, &p->one->oid);
	oid_array_append(&to_fetch, &p->two->oid);
  }

  /* drop any we already have */
  oid_array_filter(&to_fetch, object_is_missing_cb, NULL);

  /* any missing ones might actually be a race; try again */
  if (to_fetch.nr) {
	  reprepare_packed_git(the_repository);
	  oid_array_filter(&to_fetch, object_is_missing_cb, NULL);
  }

  /* and now anything we have left is definitely not here */
  if (to_fetch.nr)
	fetch_objects(..., to_fetch.oid, to_fetch.nr).

One thing I noticed while writing this: we don't seem to do any
de-duplication of the list (in yours or mine), and it doesn't look like
fetch_objects() does either. I wonder if an oidset would be a better
data structure.

-Peff

  parent reply	other threads:[~2019-03-28  6:52 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-03-26 22:09 Jonathan Tan
2019-03-27 10:10 ` SZEDER Gábor
2019-03-27 22:02 ` Johannes Schindelin
2019-03-28  6:52 ` Jeff King [this message]
2019-03-29 21:39 ` [PATCH v2 0/2] Batch fetching of missing blobs in diff and show Jonathan Tan
2019-03-29 21:39   ` [PATCH v2 1/2] sha1-file: support OBJECT_INFO_FOR_PREFETCH Jonathan Tan
2019-04-05 14:13     ` Johannes Schindelin
2019-04-05 22:00     ` Jeff King
2019-03-29 21:39   ` [PATCH v2 2/2] diff: batch fetching of missing blobs Jonathan Tan
2019-04-04  2:47     ` SZEDER Gábor
2019-04-05 13:38       ` Johannes Schindelin
2019-04-07  6:00         ` Christian Couder
2019-04-08  2:36           ` Junio C Hamano
2019-04-08  5:51             ` Junio C Hamano
2019-04-08  6:03               ` Junio C Hamano
2019-04-08  6:45                 ` Christian Couder
2019-04-08  6:40             ` Christian Couder
2019-04-08  7:59               ` Junio C Hamano
2019-04-08  9:56                 ` Christian Couder
2019-04-05  9:39     ` Duy Nguyen
2019-04-05 17:09       ` [PATCH] fixup! " Jonathan Tan
2019-04-05 20:16         ` Johannes Schindelin
2019-04-06  4:17         ` Duy Nguyen
2019-04-08  3:46           ` Junio C Hamano
2019-04-08  4:06           ` Junio C Hamano
2019-04-08  9:58             ` Duy Nguyen
2019-04-09  6:36               ` Junio C Hamano
2019-04-05 14:17     ` [PATCH v2 2/2] " Johannes Schindelin
2019-04-05 22:12   ` [PATCH v2 0/2] Batch fetching of missing blobs in diff and show Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190328065252.GA1930@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --subject='Re: [PATCH] diff: batch fetching of missing blobs' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

git@vger.kernel.org list mirror (unofficial, one of many)

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 git git/ https://public-inbox.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://7fh6tueqddpjyxjmgtdiueylzoqt6pt7hec3pukyptlmohoowvhde4yd.onion/inbox.comp.version-control.git
	nntp://ie5yzdi7fg72h7s4sdcztq5evakq23rdt33mfyfcddc5u3ndnw24ogqd.onion/inbox.comp.version-control.git
	nntp://4uok3hntl7oi7b4uf4rtfwefqeexfzil2w6kgk2jn5z2f764irre7byd.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git