git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Jonathan Tan <jonathantanmy@google.com>, git@vger.kernel.org
Cc: peff@peff.net
Subject: Re: [PATCH] diff: restrict when prefetching occurs
Date: Tue, 31 Mar 2020 08:14:49 -0400	[thread overview]
Message-ID: <b956528c-412b-2f38-bd90-1fa2ae4b8604@gmail.com> (raw)
In-Reply-To: <20200331020418.55640-1-jonathantanmy@google.com>

On 3/30/2020 10:04 PM, Jonathan Tan wrote:
> Commit 7fbbcb21b1 ("diff: batch fetching of missing blobs", 2019-04-08)
> optimized "diff" by prefetching blobs in a partial clone, but there are
> some cases wherein blobs do not need to be prefetched. In particular, if
> (1) no blob data is included in the output of the diff, (2)
> break-rewrite detection is not requested, and (3) no inexact rename
> detection is needed, then no blobs are read at all.
> 
> Therefore, in such a case, do not prefetch. Change diffcore_std() to
> only prefetch if (1) and/or (2) is not true (currently, it always
> prefetches); change diffcore_rename() to prefetch if (3) is not true and
> no prefetch has yet occurred.

This conflicts with [3], so please keep that in mind.

Maybe [3] should be adjusted to assume this patch, because that change
is mostly about disabling the batch download when no renames are required.
As Peff said [2] the full rename detection trigger is "overly broad".

However, the changed-path Bloom filters are an excellent test for this
patch, as computing them in a partial clone will trigger downloading all
blobs without [3].

[3] https://lore.kernel.org/git/55824cda89c1dca7756c8c2d831d6e115f4a9ddb.1585528298.git.gitgitgadget@gmail.com/T/#u

> [1] https://lore.kernel.org/git/20200128213508.31661-1-jonathantanmy@google.com/
> [2] https://lore.kernel.org/git/20200130055136.GA2184413@coredump.intra.peff.net/
> ---
>  diff.c                        | 26 +++++++++++++++----
>  diffcore-rename.c             | 40 +++++++++++++++++++++++++++-
>  diffcore.h                    |  2 +-
>  t/t4067-diff-partial-clone.sh | 49 +++++++++++++++++++++++++++++++++++
>  4 files changed, 110 insertions(+), 7 deletions(-)
> 
> diff --git a/diff.c b/diff.c
> index 1010d806f5..19c5d638d6 100644
> --- a/diff.c
> +++ b/diff.c
> @@ -6507,10 +6507,24 @@ static void add_if_missing(struct repository *r,
>  
>  void diffcore_std(struct diff_options *options)
>  {
> -	if (options->repo == the_repository && has_promisor_remote()) {
> -		/*
> -		 * Prefetch the diff pairs that are about to be flushed.
> -		 */
> +	int prefetched = 0;
> +	int output_formats_to_prefetch = DIFF_FORMAT_DIFFSTAT |
> +		DIFF_FORMAT_NUMSTAT |
> +		DIFF_FORMAT_PATCH |
> +		DIFF_FORMAT_SHORTSTAT |
> +		DIFF_FORMAT_DIRSTAT;
> +
> +	/*
> +	 * Check if the user requested a blob-data-requiring diff output and/or
> +	 * break-rewrite detection (which requires blob data). If yes, prefetch
> +	 * the diff pairs.
> +	 *
> +	 * If no prefetching occurs, diffcore_rename() will prefetch if it
> +	 * decides that it needs inexact rename detection.
> +	 */
> +	if (options->repo == the_repository && has_promisor_remote() &&
> +	    (options->output_format & output_formats_to_prefetch ||
> +	     (!options->found_follow && options->break_opt != -1))) {
>  		int i;
>  		struct diff_queue_struct *q = &diff_queued_diff;
>  		struct oid_array to_fetch = OID_ARRAY_INIT;
> @@ -6520,6 +6534,8 @@ void diffcore_std(struct diff_options *options)
>  			add_if_missing(options->repo, &to_fetch, p->one);
>  			add_if_missing(options->repo, &to_fetch, p->two);
>  		}
> +		prefetched = 1;
> +
>  		if (to_fetch.nr)

It was difficult to see from the context, but the next line is
"do the prefetch", so "prefetched = 1" makes sense here, even
if to_fetch.nr is zero. We've already done the work to see that
a batch download is not needed.

>  			/*
>  			 * NEEDSWORK: Consider deduplicating the OIDs sent.
> @@ -6538,7 +6554,7 @@ void diffcore_std(struct diff_options *options)
>  			diffcore_break(options->repo,
>  				       options->break_opt);
>  		if (options->detect_rename)
> -			diffcore_rename(options);
> +			diffcore_rename(options, prefetched);
>  		if (options->break_opt != -1)
>  			diffcore_merge_broken();
>  	}
> diff --git a/diffcore-rename.c b/diffcore-rename.c
> index e189f407af..962565f066 100644
> --- a/diffcore-rename.c
> +++ b/diffcore-rename.c
> @@ -7,6 +7,7 @@
>  #include "object-store.h"
>  #include "hashmap.h"
>  #include "progress.h"
> +#include "promisor-remote.h"
>  
>  /* Table of rename/copy destinations */
>  
> @@ -448,7 +449,18 @@ static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, i
>  	return count;
>  }
>  
> -void diffcore_rename(struct diff_options *options)
> +static void add_if_missing(struct repository *r,
> +			   struct oid_array *to_fetch,
> +			   const struct diff_filespec *filespec)
> +{
> +	if (filespec && filespec->oid_valid &&
> +	    !S_ISGITLINK(filespec->mode) &&
> +	    oid_object_info_extended(r, &filespec->oid, NULL,
> +				     OBJECT_INFO_FOR_PREFETCH))
> +		oid_array_append(to_fetch, &filespec->oid);
> +}
> +
> +void diffcore_rename(struct diff_options *options, int prefetched)
>  {
>  	int detect_rename = options->detect_rename;
>  	int minimum_score = options->rename_score;
> @@ -538,6 +550,32 @@ void diffcore_rename(struct diff_options *options)
>  		break;
>  	}
>  
> +	if (!prefetched) {
> +		/*
> +		 * At this point we know there's actual work to do: we have rename
> +		 * destinations that didn't find an exact match, and we have potential
> +		 * sources. So we'll have to do inexact rename detection, which
> +		 * requires looking at the blobs.
> +		 *
> +		 * If we haven't already prefetched, it's worth pre-fetching
> +		 * them as a group now.
> +		 */
> +		int i;
> +		struct oid_array to_fetch = OID_ARRAY_INIT;
> +
> +		for (i = 0; i < rename_dst_nr; i++) {
> +			if (rename_dst[i].pair)
> +				continue; /* already found exact match */
> +			add_if_missing(options->repo, &to_fetch, rename_dst[i].two);

Could this be reversed instead to avoid the "continue"?

	if (!rename_dst[i].pair)
		add_if_missing(options->repo, &to_fetch, rename_dst[i].two);

> +		}
> +		for (i = 0; i < rename_src_nr; i++)
> +			add_if_missing(options->repo, &to_fetch, rename_src[i].p->one);

Does this not have the equivalent "rename_src[i].pair" logic for exact
matches?

> +		if (to_fetch.nr)
> +			promisor_remote_get_direct(options->repo,
> +						   to_fetch.oid, to_fetch.nr);

Perhaps promisor_remote_get_direct() could have the check for
nr == 0 to exit early instead of putting that upon all the
callers?

> +		oid_array_clear(&to_fetch);
> +	}
> +
>  	if (options->show_rename_progress) {
>  		progress = start_delayed_progress(
>  				_("Performing inexact rename detection"),
> diff --git a/diffcore.h b/diffcore.h
> index 7c07347e42..9f69506574 100644
> --- a/diffcore.h
> +++ b/diffcore.h
> @@ -144,7 +144,7 @@ struct diff_filepair *diff_queue(struct diff_queue_struct *,
>  void diff_q(struct diff_queue_struct *, struct diff_filepair *);
>  
>  void diffcore_break(struct repository *, int);
> -void diffcore_rename(struct diff_options *);
> +void diffcore_rename(struct diff_options *, int prefetched);
>  void diffcore_merge_broken(void);
>  void diffcore_pickaxe(struct diff_options *);
>  void diffcore_order(const char *orderfile);
> diff --git a/t/t4067-diff-partial-clone.sh b/t/t4067-diff-partial-clone.sh
> index 4831ad35e6..7acb64727d 100755
> --- a/t/t4067-diff-partial-clone.sh
> +++ b/t/t4067-diff-partial-clone.sh
> @@ -131,4 +131,53 @@ test_expect_success 'diff with rename detection batches blobs' '
>  	test_line_count = 1 done_lines
>  '
>  
> +test_expect_success 'diff does not fetch anything if inexact rename detection is not needed' '
> +	test_when_finished "rm -rf server client trace" &&
> +
> +	test_create_repo server &&
> +	echo a >server/a &&
> +	printf "b\nb\nb\nb\nb\n" >server/b &&
> +	git -C server add a b &&
> +	git -C server commit -m x &&

> +	rm server/b &&
> +	printf "b\nb\nb\nb\nb\n" >server/c &&

Would "mv server/b server/c" make it more clear that
this is an exact rename?

> +	git -C server add c &&
> +	git -C server commit -a -m x &&
> +
> +	test_config -C server uploadpack.allowfilter 1 &&
> +	test_config -C server uploadpack.allowanysha1inwant 1 &&
> +	git clone --bare --filter=blob:limit=0 "file://$(pwd)/server" client &&
> +
> +	# Ensure no fetches.
> +	GIT_TRACE_PACKET="$(pwd)/trace" git -C client diff --raw -M HEAD^ HEAD &&
> +	! test_path_exists trace
> +'
> +
> +test_expect_success 'diff --break-rewrites fetches only if necessary, and batches blobs if it does' '
> +	test_when_finished "rm -rf server client trace" &&
> +
> +	test_create_repo server &&
> +	echo a >server/a &&
> +	printf "b\nb\nb\nb\nb\n" >server/b &&
> +	git -C server add a b &&
> +	git -C server commit -m x &&
> +	printf "c\nc\nc\nc\nc\n" >server/b &&
> +	git -C server commit -a -m x &&
> +
> +	test_config -C server uploadpack.allowfilter 1 &&
> +	test_config -C server uploadpack.allowanysha1inwant 1 &&
> +	git clone --bare --filter=blob:limit=0 "file://$(pwd)/server" client &&
> +
> +	# Ensure no fetches.
> +	GIT_TRACE_PACKET="$(pwd)/trace" git -C client diff --raw -M HEAD^ HEAD &&
> +	! test_path_exists trace &&
> +
> +	# But with --break-rewrites, ensure that there is exactly 1 negotiation
> +	# by checking that there is only 1 "done" line sent. ("done" marks the
> +	# end of negotiation.)
> +	GIT_TRACE_PACKET="$(pwd)/trace" git -C client diff --break-rewrites --raw -M HEAD^ HEAD &&
> +	grep "git> done" trace >done_lines &&
> +	test_line_count = 1 done_lines
> +'
> +
>  test_done
> 

Thanks,
-Stolee


  reply	other threads:[~2020-03-31 12:14 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-31  2:04 [PATCH] diff: restrict when prefetching occurs Jonathan Tan
2020-03-31 12:14 ` Derrick Stolee [this message]
2020-03-31 16:50   ` Jonathan Tan
2020-03-31 17:48     ` Derrick Stolee
2020-03-31 18:21       ` Junio C Hamano
2020-03-31 18:15 ` Junio C Hamano
2020-04-02 19:19 ` [PATCH v2 0/2] Restrict when prefetcing occurs Jonathan Tan
2020-04-02 19:19   ` [PATCH v2 1/2] promisor-remote: accept 0 as oid_nr in function Jonathan Tan
2020-04-02 19:46     ` Junio C Hamano
2020-04-02 23:01       ` Jonathan Tan
2020-04-02 19:19   ` [PATCH v2 2/2] diff: restrict when prefetching occurs Jonathan Tan
2020-04-02 20:08     ` Junio C Hamano
2020-04-02 23:09       ` Jonathan Tan
2020-04-02 23:25         ` Junio C Hamano
2020-04-02 23:54         ` Junio C Hamano
2020-04-03 21:35           ` Jonathan Tan
2020-04-03 22:12             ` Junio C Hamano
2020-04-02 20:28   ` [PATCH v2 0/2] Restrict when prefetcing occurs Junio C Hamano
2020-04-06 11:44     ` Derrick Stolee
2020-04-06 11:57       ` Garima Singh
2020-04-07 22:11 ` [PATCH v3 0/4] " Jonathan Tan
2020-04-07 22:11   ` [PATCH v3 1/4] promisor-remote: accept 0 as oid_nr in function Jonathan Tan
2020-04-07 22:11   ` [PATCH v3 2/4] diff: make diff_populate_filespec_options struct Jonathan Tan
2020-04-07 23:44     ` Junio C Hamano
2020-04-07 22:11   ` [PATCH v3 3/4] diff: refactor object read Jonathan Tan
2020-04-07 22:11   ` [PATCH v3 4/4] diff: restrict when prefetching occurs Jonathan Tan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b956528c-412b-2f38-bd90-1fa2ae4b8604@gmail.com \
    --to=stolee@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).