[PATCH 0/2] Optimization batch 6: make full use of exact renames

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH 0/2] Optimization batch 6: make full use of exact renames
@ 2021-02-03  5:49 Elijah Newren via GitGitGadget
  2021-02-03  5:49 ` [PATCH 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-03  5:49 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Karsten Blees, Elijah Newren

This series depends on en/merge-ort-perf.

For the very curious who are wondering about the first five optimization
batches; see the end of this email.

This series makes full use of exact renames; see commit messages for
details. It represents "Optimization #1" from my Git Merge 2020 talk[1]. For
the testcases mentioned in commit 557ac0350d ("merge-ort: begin performance
work; instrument with trace2_region_* calls", 2020-10-28), the changes in
just this series improves the performance as follows:

                     Before Series           After Series
no-renames:       14.263 s ±  0.053 s    13.815 s ±  0.062 s
mega-renames:   5504.231 s ±  5.150 s  1799.937 s ±  0.493 s
just-one-mega:   158.534 s ±  0.498 s    51.289 s ±  0.019 s


As a reminder, before any merge-ort/diffcore-rename performance work, the
performance results we started with (as noted in the same commit message)
were:

no-renames-am:      6.940 s ±  0.485 s
no-renames:        18.912 s ±  0.174 s
mega-renames:    5964.031 s ± 10.459 s
just-one-mega:    149.583 s ±  0.751 s


[1]
https://github.com/newren/presentations/blob/pdfs/merge-performance/merge-performance-slides.pdf

=== Previous optimization batches ===

I'm labeling this as the "6th" batch, due to other optimizations submitted
previously, and a number of optimizations baked into the design of
fast-rebase and merge-ort.

 1. Previously submitted hashmap/strmap optimizations 1a) 33f20d8217
    (hashmap: introduce a new hashmap_partial_clear()) 1b) 6ccdfc2a20
    (strmap: enable faster clearing and reusing of strmaps) 1c) a208ec1f0b
    (strmap: enable allocations to come from a mem_pool) 1d) 23a276a9c4
    (strmap: take advantage of FLEXPTR_ALLOC_STR when relevant)

 2. Previously submitted diffcore-rename optimizations 2a) b970b4ef62
    (diffcore-rename: simplify and accelerate register_rename_src()) 2b)
    9db2ac5616 (diffcore-rename: accelerate rename_dst setup) 2c) 350410f6b1
    (diffcore-rename: remove unnecessary duplicate entry checks)

 3. fast-rebase optimizations 3a) Avoid updating working-tree/index with
    every intermediate patch 3b) avoid reading/writing rebase metadata until
    conflict or completion

 4. Small stuff baked into merge-ort design 4a) Using pahole to note I can
    reduce size of merged_info by 8 bytes 4b) Avoid recomparing hashes (due
    to use of match_masks) 4c) Avoid unconditional dropping and re-reading
    of the index 4d) avoid checking index matches HEAD with every patch; do
    it at start only

 5. Big stuff baked into merge-ort design 5a) Avoid quadratic behavior with
    O(N) insertions/removals of index entries 5b) Avoid numerous expensive
    mini-tree traversals done by merge-recursive 5c) Avoid recursing into
    trees where both sides match merge base

Elijah Newren (2):
  diffcore-rename: no point trying to find a match better than exact
  diffcore-rename: filter rename_src list when possible

 diffcore-rename.c | 69 ++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 62 insertions(+), 7 deletions(-)


base-commit: 557ac0350d9efa1f59c708779ca3fb3aee121131
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-842%2Fnewren%2Fort-perf-batch-6-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-842/newren/ort-perf-batch-6-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/842
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 1/2] diffcore-rename: no point trying to find a match better than exact
  2021-02-03  5:49 [PATCH 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget
@ 2021-02-03  5:49 ` Elijah Newren via GitGitGadget
  2021-02-03 11:44   ` Derrick Stolee
  2021-02-03  5:49 ` [PATCH 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget
  2021-02-03 20:03 ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget
  2 siblings, 1 reply; 25+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-03  5:49 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Karsten Blees, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

diffcore_rename() had some code to avoid having destination paths that
already had an exact rename detected from being re-checked for other
renames.  Source paths, however, were re-checked because we wanted to
allow the possibility of detecting copies.  But if copy detection isn't
turned on, then this merely amounts to attempting to find a
better-than-exact match, which naturally ends up being an expensive
no-op.  In particular, copy detection is never turned on by the merge
machinery.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       14.263 s ±  0.053 s    14.119 s ±  0.101 s
    mega-renames:   5504.231 s ±  5.150 s  1802.044 s ±  0.828 s
    just-one-mega:   158.534 s ±  0.498 s    51.391 s ±  0.028 s

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 8fe6c9384bc..e3047da3aaf 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -463,6 +463,7 @@ void diffcore_rename(struct diff_options *options)
 	struct diff_score *mx;
 	int i, j, rename_count, skip_unmodified = 0;
 	int num_destinations, dst_cnt;
+	int num_sources;
 	struct progress *progress = NULL;
 
 	trace2_region_enter("diff", "setup", options->repo);
@@ -532,12 +533,15 @@ void diffcore_rename(struct diff_options *options)
 	 * files still remain as options for rename/copies!)
 	 */
 	num_destinations = (rename_dst_nr - rename_count);
+	num_sources = rename_src_nr;
+	if (detect_rename != DIFF_DETECT_COPY)
+		num_sources -= rename_count;
 
 	/* All done? */
-	if (!num_destinations)
+	if (!num_destinations || !num_sources)
 		goto cleanup;
 
-	switch (too_many_rename_candidates(num_destinations, rename_src_nr,
+	switch (too_many_rename_candidates(num_destinations, num_sources,
 					   options)) {
 	case 1:
 		goto cleanup;
@@ -553,7 +557,7 @@ void diffcore_rename(struct diff_options *options)
 	if (options->show_rename_progress) {
 		progress = start_delayed_progress(
 				_("Performing inexact rename detection"),
-				(uint64_t)num_destinations * (uint64_t)rename_src_nr);
+				(uint64_t)num_destinations * (uint64_t)num_sources);
 	}
 
 	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_destinations),
@@ -573,6 +577,10 @@ void diffcore_rename(struct diff_options *options)
 			struct diff_filespec *one = rename_src[j].p->one;
 			struct diff_score this_src;
 
+			if (one->rename_used &&
+			    detect_rename != DIFF_DETECT_COPY)
+				continue;
+
 			if (skip_unmodified &&
 			    diff_unmodified_pair(rename_src[j].p))
 				continue;
@@ -594,7 +602,7 @@ void diffcore_rename(struct diff_options *options)
 		}
 		dst_cnt++;
 		display_progress(progress,
-				 (uint64_t)dst_cnt * (uint64_t)rename_src_nr);
+				 (uint64_t)dst_cnt * (uint64_t)num_sources);
 	}
 	stop_progress(&progress);
 
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 2/2] diffcore-rename: filter rename_src list when possible
  2021-02-03  5:49 [PATCH 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget
  2021-02-03  5:49 ` [PATCH 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget
@ 2021-02-03  5:49 ` Elijah Newren via GitGitGadget
       [not found]   ` <13feb106-c3a7-a26d-0e6e-013aa45c58d4@gmail.com>
  2021-02-03 19:12   ` Junio C Hamano
  2021-02-03 20:03 ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget
  2 siblings, 2 replies; 25+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-03  5:49 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Karsten Blees, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

We have to look at each entry in rename_src a total of rename_dst_nr
times.  When we're not detecting copies, any exact renames or ignorable
rename paths will just be skipped over.  While checking that these can
be skipped over is a relatively cheap check, it's still a waste of time
to do that check more than once, let alone rename_dst_nr times.  When
rename_src_nr is a few thousand times bigger than the number of relevant
sources (such as when cherry-picking a commit that only touched a
handful of files, but from a side of history that has different names
for some high level directories), this time can add up.

First make an initial pass over the rename_src array and move all the
relevant entries to the front, so that we can iterate over just those
relevant entries.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       14.119 s ±  0.101 s    13.815 s ±  0.062 s
    mega-renames:   1802.044 s ±  0.828 s  1799.937 s ±  0.493 s
    just-one-mega:    51.391 s ±  0.028 s    51.289 s ±  0.019 s

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 67 ++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 57 insertions(+), 10 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index e3047da3aaf..2a8e7b84b9c 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -454,6 +454,55 @@ static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, i
 	return count;
 }
 
+static int remove_unneeded_paths_from_src(int num_src,
+					  int detecting_copies)
+{
+	int i, new_num_src;
+
+	/*
+	 * Note on reasons why we cull unneeded sources but not destinations:
+	 *   1) Pairings are stored in rename_dst (not rename_src), which we
+	 *      need to keep around.  So, we just can't cull rename_dst even
+	 *      if we wanted to.  But doing so wouldn't help because...
+	 *
+	 *   2) There is a matrix pairwise comparison that follows the
+	 *      "Performing inexact rename detection" progress message.
+	 *      Iterating over the destinations is done in the outer loop,
+	 *      hence we only iterate over each of those once and we can
+	 *      easily skip the outer loop early if the destination isn't
+	 *      relevant.  That's only one check per destination path to
+	 *      skip.
+	 *
+	 *      By contrast, the sources are iterated in the inner loop; if
+	 *      we check whether a source can be skipped, then we'll be
+	 *      checking it N separate times, once for each destination.
+	 *      We don't want to have to iterate over known-not-needed
+	 *      sources N times each, so avoid that by removing the sources
+	 *      from rename_src here.
+	 */
+	if (detecting_copies)
+		return num_src; /* nothing to remove */
+	if (break_idx)
+		return num_src; /* culling incompatbile with break detection */
+
+	for (i = 0, new_num_src = 0; i < num_src; i++) {
+		/*
+		 * renames are stored in rename_dst, so if a rename has
+		 * already been detected using this source, we can just
+		 * remove the source knowing rename_dst has its info.
+		 */
+		if (rename_src[i].p->one->rename_used)
+			continue;
+
+		if (new_num_src < i)
+			memcpy(&rename_src[new_num_src], &rename_src[i],
+			       sizeof(struct diff_rename_src));
+		new_num_src++;
+	}
+
+	return new_num_src;
+}
+
 void diffcore_rename(struct diff_options *options)
 {
 	int detect_rename = options->detect_rename;
@@ -463,10 +512,11 @@ void diffcore_rename(struct diff_options *options)
 	struct diff_score *mx;
 	int i, j, rename_count, skip_unmodified = 0;
 	int num_destinations, dst_cnt;
-	int num_sources;
+	int num_sources, want_copies;
 	struct progress *progress = NULL;
 
 	trace2_region_enter("diff", "setup", options->repo);
+	want_copies = (detect_rename == DIFF_DETECT_COPY);
 	if (!minimum_score)
 		minimum_score = DEFAULT_RENAME_SCORE;
 
@@ -529,13 +579,10 @@ void diffcore_rename(struct diff_options *options)
 		goto cleanup;
 
 	/*
-	 * Calculate how many renames are left (but all the source
-	 * files still remain as options for rename/copies!)
+	 * Calculate how many renames are left
 	 */
 	num_destinations = (rename_dst_nr - rename_count);
-	num_sources = rename_src_nr;
-	if (detect_rename != DIFF_DETECT_COPY)
-		num_sources -= rename_count;
+	num_sources = remove_unneeded_paths_from_src(rename_src_nr, want_copies);
 
 	/* All done? */
 	if (!num_destinations || !num_sources)
@@ -573,13 +620,13 @@ void diffcore_rename(struct diff_options *options)
 		for (j = 0; j < NUM_CANDIDATE_PER_DST; j++)
 			m[j].dst = -1;
 
-		for (j = 0; j < rename_src_nr; j++) {
+		for (j = 0; j < num_sources; j++) {
 			struct diff_filespec *one = rename_src[j].p->one;
 			struct diff_score this_src;
 
-			if (one->rename_used &&
-			    detect_rename != DIFF_DETECT_COPY)
-				continue;
+			assert(!one->rename_used ||
+			       detect_rename == DIFF_DETECT_COPY ||
+			       break_idx);
 
 			if (skip_unmodified &&
 			    diff_unmodified_pair(rename_src[j].p))
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] diffcore-rename: no point trying to find a match better than exact
  2021-02-03  5:49 ` [PATCH 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget
@ 2021-02-03 11:44   ` Derrick Stolee
  2021-02-03 16:31     ` Elijah Newren
  2021-02-03 18:46     ` Junio C Hamano
  0 siblings, 2 replies; 25+ messages in thread
From: Derrick Stolee @ 2021-02-03 11:44 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Karsten Blees, Elijah Newren

On 2/3/2021 12:49 AM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
> 
> diffcore_rename() had some code to avoid having destination paths that
> already had an exact rename detected from being re-checked for other
> renames.  Source paths, however, were re-checked because we wanted to
> allow the possibility of detecting copies.  But if copy detection isn't
> turned on, then this merely amounts to attempting to find a
> better-than-exact match, which naturally ends up being an expensive
> no-op.  In particular, copy detection is never turned on by the merge
> machinery.
...
> +	num_sources = rename_src_nr;
> +	if (detect_rename != DIFF_DETECT_COPY)
> +		num_sources -= rename_count;

Ok, delete the renamed files from the sources. Using a new variable
because rename_src_nr is actually a static global to diffcore-rename.c,
describing the number of entries in the rename_src table. This is
scary, but I think your new local is a good way to change the local
logic of this method without adjusting that global.

>  
>  	/* All done? */
> -	if (!num_destinations)
> +	if (!num_destinations || !num_sources)
>  		goto cleanup;

And add an extra quit condition which is very possible to hit.
Is it only hit when every "delete" is actually a rename?

> -	switch (too_many_rename_candidates(num_destinations, rename_src_nr,
> +	switch (too_many_rename_candidates(num_destinations, num_sources,
>  					   options)) {

This is all about checking if we need to skip inexact renames. Makes
sense to use the new number.
  
> +			if (one->rename_used &&
> +			    detect_rename != DIFF_DETECT_COPY)
> +				continue;
> +

Have we "consumed" this input? Skip over it. Good. And this is inside
a double-loop:

	for (dst_cnt = i = 0; i < rename_dst_nr; i++) {
		...
		for (j = 0; j < rename_src_nr; j++) {

Keeping rename_src_nr in the inner loop makes sense, but this new
'continue;' gives most of the speedup, I imagine.

This is a nice speedup for such a simple optimization.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] diffcore-rename: no point trying to find a match better than exact
  2021-02-03 11:44   ` Derrick Stolee
@ 2021-02-03 16:31     ` Elijah Newren
  2021-02-03 18:46     ` Junio C Hamano
  1 sibling, 0 replies; 25+ messages in thread
From: Elijah Newren @ 2021-02-03 16:31 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Junio C Hamano, Jeff King,
	Karsten Blees

On Wed, Feb 3, 2021 at 3:44 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/3/2021 12:49 AM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
> >
> > diffcore_rename() had some code to avoid having destination paths that
> > already had an exact rename detected from being re-checked for other
> > renames.  Source paths, however, were re-checked because we wanted to
> > allow the possibility of detecting copies.  But if copy detection isn't
> > turned on, then this merely amounts to attempting to find a
> > better-than-exact match, which naturally ends up being an expensive
> > no-op.  In particular, copy detection is never turned on by the merge
> > machinery.
> ...
> > +     num_sources = rename_src_nr;
> > +     if (detect_rename != DIFF_DETECT_COPY)
> > +             num_sources -= rename_count;
>
> Ok, delete the renamed files from the sources. Using a new variable
> because rename_src_nr is actually a static global to diffcore-rename.c,
> describing the number of entries in the rename_src table. This is
> scary, but I think your new local is a good way to change the local
> logic of this method without adjusting that global.

I thought about changing rename_src, rename_src_nr, rename_dst, and
rename_dst_nr to all be in some struct and make one of those on the
stack locally in diffcore_rename() and then pass that structure
around.  Would be nice to get rid of more global state.  But I've got
enough things in the queue that I never made the jump.

> >
> >       /* All done? */
> > -     if (!num_destinations)
> > +     if (!num_destinations || !num_sources)
> >               goto cleanup;
>
> And add an extra quit condition which is very possible to hit.
> Is it only hit when every "delete" is actually a rename?

Right, when every "delete" gets paired by exact rename detection to
some "add" and is marked as a rename, meaning we have no more
"deletes" to pair with anything.  In later series, there will be
additional reasons for num_sources to decrease and possibly hit 0.

> > -     switch (too_many_rename_candidates(num_destinations, rename_src_nr,
> > +     switch (too_many_rename_candidates(num_destinations, num_sources,
> >                                          options)) {
>
> This is all about checking if we need to skip inexact renames. Makes
> sense to use the new number.
>
> > +                     if (one->rename_used &&
> > +                         detect_rename != DIFF_DETECT_COPY)
> > +                             continue;
> > +
>
> Have we "consumed" this input? Skip over it. Good. And this is inside
> a double-loop:
>
>         for (dst_cnt = i = 0; i < rename_dst_nr; i++) {
>                 ...
>                 for (j = 0; j < rename_src_nr; j++) {
>
> Keeping rename_src_nr in the inner loop makes sense, but this new
> 'continue;' gives most of the speedup, I imagine.
>
> This is a nice speedup for such a simple optimization.
>
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] diffcore-rename: filter rename_src list when possible
       [not found]   ` <13feb106-c3a7-a26d-0e6e-013aa45c58d4@gmail.com>
@ 2021-02-03 17:12     ` Elijah Newren
  0 siblings, 0 replies; 25+ messages in thread
From: Elijah Newren @ 2021-02-03 17:12 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Junio C Hamano, Jeff King,
	Karsten Blees

On Wed, Feb 3, 2021 at 4:00 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/3/2021 12:49 AM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
> >
> > We have to look at each entry in rename_src a total of rename_dst_nr
> > times.  When we're not detecting copies, any exact renames or ignorable
> > rename paths will just be skipped over.  While checking that these can
> > be skipped over is a relatively cheap check, it's still a waste of time
> > to do that check more than once, let alone rename_dst_nr times.  When
> > rename_src_nr is a few thousand times bigger than the number of relevant
> > sources (such as when cherry-picking a commit that only touched a
> > handful of files, but from a side of history that has different names
> > for some high level directories), this time can add up.
> >
> > First make an initial pass over the rename_src array and move all the
> > relevant entries to the front, so that we can iterate over just those
> > relevant entries.
> >
> > For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
> > performance work; instrument with trace2_region_* calls", 2020-10-28),
> > this change improves the performance as follows:
> >
> >                             Before                  After
> >     no-renames:       14.119 s ±  0.101 s    13.815 s ±  0.062 s
> >     mega-renames:   1802.044 s ±  0.828 s  1799.937 s ±  0.493 s
> >     just-one-mega:    51.391 s ±  0.028 s    51.289 s ±  0.019 s
> >
> > Signed-off-by: Elijah Newren <newren@gmail.com>
> > ---
> >  diffcore-rename.c | 67 ++++++++++++++++++++++++++++++++++++++++-------
> >  1 file changed, 57 insertions(+), 10 deletions(-)
> >
> > diff --git a/diffcore-rename.c b/diffcore-rename.c
> > index e3047da3aaf..2a8e7b84b9c 100644
> > --- a/diffcore-rename.c
> > +++ b/diffcore-rename.c
> > @@ -454,6 +454,55 @@ static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, i
> >       return count;
> >  }
> >
> > +static int remove_unneeded_paths_from_src(int num_src,
> > +                                       int detecting_copies)
> > +{
> > +     int i, new_num_src;
> > +
> > +     /*
> > +      * Note on reasons why we cull unneeded sources but not destinations:
> > +      *   1) Pairings are stored in rename_dst (not rename_src), which we
> > +      *      need to keep around.  So, we just can't cull rename_dst even
> > +      *      if we wanted to.  But doing so wouldn't help because...
> > +      *
> > +      *   2) There is a matrix pairwise comparison that follows the
> > +      *      "Performing inexact rename detection" progress message.
> > +      *      Iterating over the destinations is done in the outer loop,
> > +      *      hence we only iterate over each of those once and we can
> > +      *      easily skip the outer loop early if the destination isn't
> > +      *      relevant.  That's only one check per destination path to
> > +      *      skip.
> > +      *
> > +      *      By contrast, the sources are iterated in the inner loop; if
> > +      *      we check whether a source can be skipped, then we'll be
> > +      *      checking it N separate times, once for each destination.
> > +      *      We don't want to have to iterate over known-not-needed
> > +      *      sources N times each, so avoid that by removing the sources
> > +      *      from rename_src here.
> > +      */
> > +     if (detecting_copies)
> > +             return num_src; /* nothing to remove */
> > +     if (break_idx)
> > +             return num_src; /* culling incompatbile with break detection */
>
> While the comments here are certainly descriptive, they could probably
> be sidelined to "return when culling is incompatible" with this before
> the big comment:

Yeah, I guess at this stage a bigger explanation isn't needed.  But I
think I'd end up just adding these comments back in later at an
unrelated time if I were to take them out at this point.  By the time
merge-ort is finished, I'll have over a dozen additional
optimizations, many of them in diffcore-rename.c.  There will be files
for which four different optimizations are all applicable, but they
are mutually exclusive so we have to pick one of the four and exclude
the other three.  Best results come from knowing which one provides
the best speedup and why (and which is second if the best doesn't
apply, etc.).  So what I'll present as four optimizations originally
felt to me like 10 or so different ones, because I didn't start
knowing the best, meaning I kept adding later additional optimizations
that fed the appropriate data in all the right places so that I could
tweak which one(s) took priority and get the best speedups.  You'll
get the cleaner version to review that pretends I knew the right
answers all along.  (Not to mention that I think I tried a lot more
optimization ideas that failed than the number I tried that succeeded,
and longer comments of mine sometimes highlight details that could
have prevented the need to attempt some of the failed ideas.)  Anyway,
long story short: at some point along the way, the more detailed
description felt very appropriate, but I don't have a clear breakpoint
for where that is and the longer description most logically goes along
with this early patch so I squashed it in here.  *shrug*

>         if (detecting_copies || break_idx)
>                 return num_src;
>
> This is a very minor nit pick, though.

Moving these conditions above the comment makes sense, yes.  I guess
combining the two if conditions here could also be done, though the
reasoning is different as noted by the comments so I would have to
toss the comments.  Since a later optimization will need to add an
additional condition that is only relevant for one of the two reasons,
I would just need to split again later and I think I would want the
comments back at that time.  So I think the split and keeping the
comments seems more natural to me.

> > +
> > +     for (i = 0, new_num_src = 0; i < num_src; i++) {
> > +             /*
> > +              * renames are stored in rename_dst, so if a rename has
> > +              * already been detected using this source, we can just
> > +              * remove the source knowing rename_dst has its info.
> > +              */
> > +             if (rename_src[i].p->one->rename_used)
> > +                     continue;
> > +
> > +             if (new_num_src < i)
> > +                     memcpy(&rename_src[new_num_src], &rename_src[i],
> > +                            sizeof(struct diff_rename_src));
>
> 'struct diff_rename_src' is currently just a pointer and a short, so
> this is a very small amount of data, but this conditional move doesn't
> hurt.
>
> > +             new_num_src++;
> > +     }
> > +
> > +     return new_num_src;
> > +}
> > +
>
>
> >  void diffcore_rename(struct diff_options *options)
> >  {
> >       int detect_rename = options->detect_rename;
> > @@ -463,10 +512,11 @@ void diffcore_rename(struct diff_options *options)
> >       struct diff_score *mx;
> >       int i, j, rename_count, skip_unmodified = 0;
> >       int num_destinations, dst_cnt;
> > -     int num_sources;
> > +     int num_sources, want_copies;
> >       struct progress *progress = NULL;
> >
> >       trace2_region_enter("diff", "setup", options->repo);
> > +     want_copies = (detect_rename == DIFF_DETECT_COPY);
>
> I was looking at this and thinking "why wasn't this in the previous
> patch?" I see that you delete the "if (detect_rename != DIFF_DETECT_COPY)"
> below in favor of the call to the helper.
>
> _If_ you were re-rolling, then maybe it would be slightly cleaner
> to add 'want_copies' to the previous patch?

Oh, yeah, I think that does make sense.  And you highlight some good
reasons to re-roll below, so I'll do that.

> >       /*
> > -      * Calculate how many renames are left (but all the source
> > -      * files still remain as options for rename/copies!)
> > +      * Calculate how many renames are left
> >        */
> >       num_destinations = (rename_dst_nr - rename_count);
> > -     num_sources = rename_src_nr;
> > -     if (detect_rename != DIFF_DETECT_COPY)
> > -             num_sources -= rename_count;
> > +     num_sources = remove_unneeded_paths_from_src(rename_src_nr, want_copies);
>
> So here we update the local variable...
>
> >
> >       /* All done? */
> >       if (!num_destinations || !num_sources)
> > @@ -573,13 +620,13 @@ void diffcore_rename(struct diff_options *options)
> >               for (j = 0; j < NUM_CANDIDATE_PER_DST; j++)
> >                       m[j].dst = -1;
> >
> > -             for (j = 0; j < rename_src_nr; j++) {
> > +             for (j = 0; j < num_sources; j++) {
>
> ...and drop the use of the static global. However, don't we need to
> update the global itself? There are other loops limited by
> rename_src_nr that should only work on this smaller list, right?
>
> In particular, it seems that the prefetch() method would benefit
> from looking at this smaller list. It fetches the blob contents in
> a partial clone inside of estimate_similarity(). When there are a
> lot of exact renames but even one possible inexact rename, it seems
> to currently download the content for all of the blobs including the
> exactly-matched ones.

Ah, indeed.  The prefetch() method didn't exist at the time I wrote
this patch, but there's always the possibility of other reasons.  It
turns out I did make rename_src_nr match num_sources, but that diff
hunk got placed in a later series.  I agree with you that it should go
here instead.

> For interesting tests around this prefetch() call, I think the
> 'diff with rename detection batches blobs' test in t4067 should help.
>
> >                       struct diff_filespec *one = rename_src[j].p->one;
> >                       struct diff_score this_src;
> >
> > -                     if (one->rename_used &&
> > -                         detect_rename != DIFF_DETECT_COPY)
> > -                             continue;
> > +                     assert(!one->rename_used ||
> > +                            detect_rename == DIFF_DETECT_COPY ||
> > +                            break_idx);
>
> Looks like you could use 'want_copies' here (and in the previous
> patch).

Yep, good catch.  I'll make the change.

> Thanks,
> -Stolee

Thanks for reviewing!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] diffcore-rename: no point trying to find a match better than exact
  2021-02-03 11:44   ` Derrick Stolee
  2021-02-03 16:31     ` Elijah Newren
@ 2021-02-03 18:46     ` Junio C Hamano
  2021-02-03 19:10       ` Elijah Newren
  1 sibling, 1 reply; 25+ messages in thread
From: Junio C Hamano @ 2021-02-03 18:46 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, git, Derrick Stolee, Jonathan Tan,
	Taylor Blau, Jeff King, Karsten Blees, Elijah Newren

Derrick Stolee <stolee@gmail.com> writes:

>>  	/* All done? */
>> -	if (!num_destinations)
>> +	if (!num_destinations || !num_sources)
>>  		goto cleanup;
>
> And add an extra quit condition which is very possible to hit.
> Is it only hit when every "delete" is actually a rename?

When every delete is actually an unmodified exact rename, I think.
If this simple change gives us good performance gain, that is
superb.

Nicely done.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 1/2] diffcore-rename: no point trying to find a match better than exact
  2021-02-03 18:46     ` Junio C Hamano
@ 2021-02-03 19:10       ` Elijah Newren
  0 siblings, 0 replies; 25+ messages in thread
From: Elijah Newren @ 2021-02-03 19:10 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee, Elijah Newren via GitGitGadget, Git Mailing List,
	Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King,
	Karsten Blees

On Wed, Feb 3, 2021 at 10:46 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Derrick Stolee <stolee@gmail.com> writes:
>
> >>      /* All done? */
> >> -    if (!num_destinations)
> >> +    if (!num_destinations || !num_sources)
> >>              goto cleanup;
> >
> > And add an extra quit condition which is very possible to hit.
> > Is it only hit when every "delete" is actually a rename?
>
> When every delete is actually an unmodified exact rename, I think.

Yes, exactly.

> If this simple change gives us good performance gain, that is
> superb.
>
> Nicely done.

Thanks.

I probably should have separated this from the speculative
not-quite-working changes that weren't ready yet and pushed just it
sooner.[1]  But I kinda wanted to get all the performance changes
together, and...well, all the other performance changes took a while
to complete (especially given multiple starts and stops).

[1] https://lore.kernel.org/git/20171110222156.23221-2-newren@gmail.com/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] diffcore-rename: filter rename_src list when possible
  2021-02-03  5:49 ` [PATCH 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget
       [not found]   ` <13feb106-c3a7-a26d-0e6e-013aa45c58d4@gmail.com>
@ 2021-02-03 19:12   ` Junio C Hamano
  2021-02-03 19:19     ` Elijah Newren
  1 sibling, 1 reply; 25+ messages in thread
From: Junio C Hamano @ 2021-02-03 19:12 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King,
	Karsten Blees, Elijah Newren

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +static int remove_unneeded_paths_from_src(int num_src,
> +					  int detecting_copies)
> +{
> +	int i, new_num_src;
> +
> +	/*
> +	 * Note on reasons why we cull unneeded sources but not destinations:
> +	 *   1) Pairings are stored in rename_dst (not rename_src), which we
> +	 *      need to keep around.  So, we just can't cull rename_dst even
> +	 *      if we wanted to.  But doing so wouldn't help because...
> +	 *
> +	 *   2) There is a matrix pairwise comparison that follows the
> +	 *      "Performing inexact rename detection" progress message.
> +	 *      Iterating over the destinations is done in the outer loop,
> +	 *      hence we only iterate over each of those once and we can
> +	 *      easily skip the outer loop early if the destination isn't
> +	 *      relevant.  That's only one check per destination path to
> +	 *      skip.
> +	 *
> +	 *      By contrast, the sources are iterated in the inner loop; if
> +	 *      we check whether a source can be skipped, then we'll be
> +	 *      checking it N separate times, once for each destination.
> +	 *      We don't want to have to iterate over known-not-needed
> +	 *      sources N times each, so avoid that by removing the sources
> +	 *      from rename_src here.
> +	 */
> +	if (detecting_copies)
> +		return num_src; /* nothing to remove */
> +	if (break_idx)
> +		return num_src; /* culling incompatbile with break detection */
> +
> +	for (i = 0, new_num_src = 0; i < num_src; i++) {
> +		/*
> +		 * renames are stored in rename_dst, so if a rename has
> +		 * already been detected using this source, we can just
> +		 * remove the source knowing rename_dst has its info.
> +		 */
> +		if (rename_src[i].p->one->rename_used)
> +			continue;
> +
> +		if (new_num_src < i)
> +			memcpy(&rename_src[new_num_src], &rename_src[i],
> +			       sizeof(struct diff_rename_src));
> +		new_num_src++;
> +	}
> +
> +	return new_num_src;
> +}

Essentially we are compacting rename_src[] array from num_src
elements down to new_num_src elements; we are losing pointers, but I
presume these are all borrowed pointers that we do not own and we
are not responsible for freeing?  If we were to free them, the
compaction would leave duplicates after the new tail (new_num_src)
and we'd end up having to worry about double-freeing, so hopefully
all we need to do is just to free the entire array of pointers, and
not the pointees.

Having to do this just once and being able to reduce the number of
entries we need to iterate over does sound like a good simple
optimization.

>  void diffcore_rename(struct diff_options *options)
>  {
>  	int detect_rename = options->detect_rename;
> @@ -463,10 +512,11 @@ void diffcore_rename(struct diff_options *options)
>  	struct diff_score *mx;
>  	int i, j, rename_count, skip_unmodified = 0;
>  	int num_destinations, dst_cnt;
> -	int num_sources;
> +	int num_sources, want_copies;
>  	struct progress *progress = NULL;
>  
>  	trace2_region_enter("diff", "setup", options->repo);
> +	want_copies = (detect_rename == DIFF_DETECT_COPY);
>  	if (!minimum_score)
>  		minimum_score = DEFAULT_RENAME_SCORE;
>  
> @@ -529,13 +579,10 @@ void diffcore_rename(struct diff_options *options)
>  		goto cleanup;
>  
>  	/*
> -	 * Calculate how many renames are left (but all the source
> -	 * files still remain as options for rename/copies!)
> +	 * Calculate how many renames are left
>  	 */
>  	num_destinations = (rename_dst_nr - rename_count);
> -	num_sources = rename_src_nr;
> -	if (detect_rename != DIFF_DETECT_COPY)
> -		num_sources -= rename_count;
> +	num_sources = remove_unneeded_paths_from_src(rename_src_nr, want_copies);

OK, this is in a sense an extended version of the previous step.

I am not sure if rename_src_nr can be left out of sync with reality
like this patch does, though.  The reference to that variable in
register_rename_src() and find_exact_renames() are OK as we are not
going to call them after we futz with the rename_src[] array, but
the reference in prefetch(), which does not actually happen early
but only when we start running estimate_similarity(), which is after
we compacted the rename_src[] array, would be affected, no?

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 2/2] diffcore-rename: filter rename_src list when possible
  2021-02-03 19:12   ` Junio C Hamano
@ 2021-02-03 19:19     ` Elijah Newren
  0 siblings, 0 replies; 25+ messages in thread
From: Elijah Newren @ 2021-02-03 19:19 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Jeff King, Karsten Blees

On Wed, Feb 3, 2021 at 11:12 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +static int remove_unneeded_paths_from_src(int num_src,
> > +                                       int detecting_copies)
> > +{
> > +     int i, new_num_src;
> > +
> > +     /*
> > +      * Note on reasons why we cull unneeded sources but not destinations:
> > +      *   1) Pairings are stored in rename_dst (not rename_src), which we
> > +      *      need to keep around.  So, we just can't cull rename_dst even
> > +      *      if we wanted to.  But doing so wouldn't help because...
> > +      *
> > +      *   2) There is a matrix pairwise comparison that follows the
> > +      *      "Performing inexact rename detection" progress message.
> > +      *      Iterating over the destinations is done in the outer loop,
> > +      *      hence we only iterate over each of those once and we can
> > +      *      easily skip the outer loop early if the destination isn't
> > +      *      relevant.  That's only one check per destination path to
> > +      *      skip.
> > +      *
> > +      *      By contrast, the sources are iterated in the inner loop; if
> > +      *      we check whether a source can be skipped, then we'll be
> > +      *      checking it N separate times, once for each destination.
> > +      *      We don't want to have to iterate over known-not-needed
> > +      *      sources N times each, so avoid that by removing the sources
> > +      *      from rename_src here.
> > +      */
> > +     if (detecting_copies)
> > +             return num_src; /* nothing to remove */
> > +     if (break_idx)
> > +             return num_src; /* culling incompatbile with break detection */
> > +
> > +     for (i = 0, new_num_src = 0; i < num_src; i++) {
> > +             /*
> > +              * renames are stored in rename_dst, so if a rename has
> > +              * already been detected using this source, we can just
> > +              * remove the source knowing rename_dst has its info.
> > +              */
> > +             if (rename_src[i].p->one->rename_used)
> > +                     continue;
> > +
> > +             if (new_num_src < i)
> > +                     memcpy(&rename_src[new_num_src], &rename_src[i],
> > +                            sizeof(struct diff_rename_src));
> > +             new_num_src++;
> > +     }
> > +
> > +     return new_num_src;
> > +}
>
> Essentially we are compacting rename_src[] array from num_src
> elements down to new_num_src elements; we are losing pointers, but I
> presume these are all borrowed pointers that we do not own and we
> are not responsible for freeing?  If we were to free them, the
> compaction would leave duplicates after the new tail (new_num_src)
> and we'd end up having to worry about double-freeing, so hopefully
> all we need to do is just to free the entire array of pointers, and
> not the pointees.
>
> Having to do this just once and being able to reduce the number of
> entries we need to iterate over does sound like a good simple
> optimization.

Correct, they are all just borrowed pointers and we only need to free
the array, not the pointers within the array.

> >  void diffcore_rename(struct diff_options *options)
> >  {
> >       int detect_rename = options->detect_rename;
> > @@ -463,10 +512,11 @@ void diffcore_rename(struct diff_options *options)
> >       struct diff_score *mx;
> >       int i, j, rename_count, skip_unmodified = 0;
> >       int num_destinations, dst_cnt;
> > -     int num_sources;
> > +     int num_sources, want_copies;
> >       struct progress *progress = NULL;
> >
> >       trace2_region_enter("diff", "setup", options->repo);
> > +     want_copies = (detect_rename == DIFF_DETECT_COPY);
> >       if (!minimum_score)
> >               minimum_score = DEFAULT_RENAME_SCORE;
> >
> > @@ -529,13 +579,10 @@ void diffcore_rename(struct diff_options *options)
> >               goto cleanup;
> >
> >       /*
> > -      * Calculate how many renames are left (but all the source
> > -      * files still remain as options for rename/copies!)
> > +      * Calculate how many renames are left
> >        */
> >       num_destinations = (rename_dst_nr - rename_count);
> > -     num_sources = rename_src_nr;
> > -     if (detect_rename != DIFF_DETECT_COPY)
> > -             num_sources -= rename_count;
> > +     num_sources = remove_unneeded_paths_from_src(rename_src_nr, want_copies);
>
> OK, this is in a sense an extended version of the previous step.
>
> I am not sure if rename_src_nr can be left out of sync with reality
> like this patch does, though.  The reference to that variable in
> register_rename_src() and find_exact_renames() are OK as we are not
> going to call them after we futz with the rename_src[] array, but
> the reference in prefetch(), which does not actually happen early
> but only when we start running estimate_similarity(), which is after
> we compacted the rename_src[] array, would be affected, no?

Yes, good catch.  This is the same issue Stolee caught.  I did set
rename_src_nr to the new num_sources in my "ort" branch, but when
breaking up the changes to upstream them, that line of code somehow
got separated into a later patch (that I haven't submitted yet) and I
just didn't notice it when reviewing this early series.  It belongs in
this patch as both of you pointed out.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 0/2] Optimization batch 6: make full use of exact renames
  2021-02-03  5:49 [PATCH 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget
  2021-02-03  5:49 ` [PATCH 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget
  2021-02-03  5:49 ` [PATCH 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget
@ 2021-02-03 20:03 ` Elijah Newren via GitGitGadget
  2021-02-03 20:03   ` [PATCH v2 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget
                     ` (3 more replies)
  2 siblings, 4 replies; 25+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-03 20:03 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Karsten Blees, Derrick Stolee, Elijah Newren,
	Elijah Newren

This series depends on en/merge-ort-perf and makes full use of exact
renames; see commit messages for details.

Thanks to Stolee and Junio for reviewing v1.

Changes since v1:

 * Update rename_src_nr when updating rename_src
 * Introduce want_copies in the first patch and use it in a few more places
 * Move a comment below a few exit-early if-checks.

Elijah Newren (2):
  diffcore-rename: no point trying to find a match better than exact
  diffcore-rename: filter rename_src list when possible

 diffcore-rename.c | 69 +++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 61 insertions(+), 8 deletions(-)


base-commit: 557ac0350d9efa1f59c708779ca3fb3aee121131
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-842%2Fnewren%2Fort-perf-batch-6-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-842/newren/ort-perf-batch-6-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/842

Range-diff vs v1:

 1:  3e69857f37e ! 1:  770e894b4ab diffcore-rename: no point trying to find a match better than exact
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
       	struct diff_score *mx;
       	int i, j, rename_count, skip_unmodified = 0;
       	int num_destinations, dst_cnt;
     -+	int num_sources;
     ++	int num_sources, want_copies;
       	struct progress *progress = NULL;
       
       	trace2_region_enter("diff", "setup", options->repo);
     ++	want_copies = (detect_rename == DIFF_DETECT_COPY);
     + 	if (!minimum_score)
     + 		minimum_score = DEFAULT_RENAME_SCORE;
     + 
     +@@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
     + 				p->one->rename_used++;
     + 			register_rename_src(p);
     + 		}
     +-		else if (detect_rename == DIFF_DETECT_COPY) {
     ++		else if (want_copies) {
     + 			/*
     + 			 * Increment the "rename_used" score by
     + 			 * one, to indicate ourselves as a user.
      @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
       	 * files still remain as options for rename/copies!)
       	 */
       	num_destinations = (rename_dst_nr - rename_count);
      +	num_sources = rename_src_nr;
     -+	if (detect_rename != DIFF_DETECT_COPY)
     ++	if (!want_copies)
      +		num_sources -= rename_count;
       
       	/* All done? */
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
       			struct diff_filespec *one = rename_src[j].p->one;
       			struct diff_score this_src;
       
     -+			if (one->rename_used &&
     -+			    detect_rename != DIFF_DETECT_COPY)
     ++			if (one->rename_used && !want_copies)
      +				continue;
      +
       			if (skip_unmodified &&
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
       	}
       	stop_progress(&progress);
       
     +@@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
     + 	STABLE_QSORT(mx, dst_cnt * NUM_CANDIDATE_PER_DST, score_compare);
     + 
     + 	rename_count += find_renames(mx, dst_cnt, minimum_score, 0);
     +-	if (detect_rename == DIFF_DETECT_COPY)
     ++	if (want_copies)
     + 		rename_count += find_renames(mx, dst_cnt, minimum_score, 1);
     + 	free(mx);
     + 	trace2_region_leave("diff", "inexact renames", options->repo);
 2:  580ba9a10f5 ! 2:  7ae9460d3db diffcore-rename: filter rename_src list when possible
     @@ diffcore-rename.c: static int find_renames(struct diff_score *mx, int dst_cnt, i
       	return count;
       }
       
     -+static int remove_unneeded_paths_from_src(int num_src,
     -+					  int detecting_copies)
     ++static void remove_unneeded_paths_from_src(int detecting_copies)
      +{
      +	int i, new_num_src;
      +
     ++	if (detecting_copies)
     ++		return; /* nothing to remove */
     ++	if (break_idx)
     ++		return; /* culling incompatbile with break detection */
     ++
      +	/*
      +	 * Note on reasons why we cull unneeded sources but not destinations:
      +	 *   1) Pairings are stored in rename_dst (not rename_src), which we
     @@ diffcore-rename.c: static int find_renames(struct diff_score *mx, int dst_cnt, i
      +	 *      sources N times each, so avoid that by removing the sources
      +	 *      from rename_src here.
      +	 */
     -+	if (detecting_copies)
     -+		return num_src; /* nothing to remove */
     -+	if (break_idx)
     -+		return num_src; /* culling incompatbile with break detection */
     -+
     -+	for (i = 0, new_num_src = 0; i < num_src; i++) {
     ++	for (i = 0, new_num_src = 0; i < rename_src_nr; i++) {
      +		/*
      +		 * renames are stored in rename_dst, so if a rename has
      +		 * already been detected using this source, we can just
     @@ diffcore-rename.c: static int find_renames(struct diff_score *mx, int dst_cnt, i
      +		new_num_src++;
      +	}
      +
     -+	return new_num_src;
     ++	rename_src_nr = new_num_src;
      +}
      +
       void diffcore_rename(struct diff_options *options)
       {
       	int detect_rename = options->detect_rename;
     -@@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
     - 	struct diff_score *mx;
     - 	int i, j, rename_count, skip_unmodified = 0;
     - 	int num_destinations, dst_cnt;
     --	int num_sources;
     -+	int num_sources, want_copies;
     - 	struct progress *progress = NULL;
     - 
     - 	trace2_region_enter("diff", "setup", options->repo);
     -+	want_copies = (detect_rename == DIFF_DETECT_COPY);
     - 	if (!minimum_score)
     - 		minimum_score = DEFAULT_RENAME_SCORE;
     - 
      @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
       		goto cleanup;
       
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
      +	 * Calculate how many renames are left
       	 */
       	num_destinations = (rename_dst_nr - rename_count);
     --	num_sources = rename_src_nr;
     --	if (detect_rename != DIFF_DETECT_COPY)
     ++	remove_unneeded_paths_from_src(want_copies);
     + 	num_sources = rename_src_nr;
     +-	if (!want_copies)
      -		num_sources -= rename_count;
     -+	num_sources = remove_unneeded_paths_from_src(rename_src_nr, want_copies);
       
       	/* All done? */
       	if (!num_destinations || !num_sources)
      @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
     - 		for (j = 0; j < NUM_CANDIDATE_PER_DST; j++)
     - 			m[j].dst = -1;
     - 
     --		for (j = 0; j < rename_src_nr; j++) {
     -+		for (j = 0; j < num_sources; j++) {
       			struct diff_filespec *one = rename_src[j].p->one;
       			struct diff_score this_src;
       
     --			if (one->rename_used &&
     --			    detect_rename != DIFF_DETECT_COPY)
     +-			if (one->rename_used && !want_copies)
      -				continue;
     -+			assert(!one->rename_used ||
     -+			       detect_rename == DIFF_DETECT_COPY ||
     -+			       break_idx);
     ++			assert(!one->rename_used || want_copies || break_idx);
       
       			if (skip_unmodified &&
       			    diff_unmodified_pair(rename_src[j].p))

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 1/2] diffcore-rename: no point trying to find a match better than exact
  2021-02-03 20:03 ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget
@ 2021-02-03 20:03   ` Elijah Newren via GitGitGadget
  2021-02-03 20:03   ` [PATCH v2 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 25+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-03 20:03 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Karsten Blees, Derrick Stolee, Elijah Newren,
	Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

diffcore_rename() had some code to avoid having destination paths that
already had an exact rename detected from being re-checked for other
renames.  Source paths, however, were re-checked because we wanted to
allow the possibility of detecting copies.  But if copy detection isn't
turned on, then this merely amounts to attempting to find a
better-than-exact match, which naturally ends up being an expensive
no-op.  In particular, copy detection is never turned on by the merge
machinery.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       14.263 s ±  0.053 s    14.119 s ±  0.101 s
    mega-renames:   5504.231 s ±  5.150 s  1802.044 s ±  0.828 s
    just-one-mega:   158.534 s ±  0.498 s    51.391 s ±  0.028 s

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 8fe6c9384bc..8b118628b4e 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -463,9 +463,11 @@ void diffcore_rename(struct diff_options *options)
 	struct diff_score *mx;
 	int i, j, rename_count, skip_unmodified = 0;
 	int num_destinations, dst_cnt;
+	int num_sources, want_copies;
 	struct progress *progress = NULL;
 
 	trace2_region_enter("diff", "setup", options->repo);
+	want_copies = (detect_rename == DIFF_DETECT_COPY);
 	if (!minimum_score)
 		minimum_score = DEFAULT_RENAME_SCORE;
 
@@ -502,7 +504,7 @@ void diffcore_rename(struct diff_options *options)
 				p->one->rename_used++;
 			register_rename_src(p);
 		}
-		else if (detect_rename == DIFF_DETECT_COPY) {
+		else if (want_copies) {
 			/*
 			 * Increment the "rename_used" score by
 			 * one, to indicate ourselves as a user.
@@ -532,12 +534,15 @@ void diffcore_rename(struct diff_options *options)
 	 * files still remain as options for rename/copies!)
 	 */
 	num_destinations = (rename_dst_nr - rename_count);
+	num_sources = rename_src_nr;
+	if (!want_copies)
+		num_sources -= rename_count;
 
 	/* All done? */
-	if (!num_destinations)
+	if (!num_destinations || !num_sources)
 		goto cleanup;
 
-	switch (too_many_rename_candidates(num_destinations, rename_src_nr,
+	switch (too_many_rename_candidates(num_destinations, num_sources,
 					   options)) {
 	case 1:
 		goto cleanup;
@@ -553,7 +558,7 @@ void diffcore_rename(struct diff_options *options)
 	if (options->show_rename_progress) {
 		progress = start_delayed_progress(
 				_("Performing inexact rename detection"),
-				(uint64_t)num_destinations * (uint64_t)rename_src_nr);
+				(uint64_t)num_destinations * (uint64_t)num_sources);
 	}
 
 	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_destinations),
@@ -573,6 +578,9 @@ void diffcore_rename(struct diff_options *options)
 			struct diff_filespec *one = rename_src[j].p->one;
 			struct diff_score this_src;
 
+			if (one->rename_used && !want_copies)
+				continue;
+
 			if (skip_unmodified &&
 			    diff_unmodified_pair(rename_src[j].p))
 				continue;
@@ -594,7 +602,7 @@ void diffcore_rename(struct diff_options *options)
 		}
 		dst_cnt++;
 		display_progress(progress,
-				 (uint64_t)dst_cnt * (uint64_t)rename_src_nr);
+				 (uint64_t)dst_cnt * (uint64_t)num_sources);
 	}
 	stop_progress(&progress);
 
@@ -602,7 +610,7 @@ void diffcore_rename(struct diff_options *options)
 	STABLE_QSORT(mx, dst_cnt * NUM_CANDIDATE_PER_DST, score_compare);
 
 	rename_count += find_renames(mx, dst_cnt, minimum_score, 0);
-	if (detect_rename == DIFF_DETECT_COPY)
+	if (want_copies)
 		rename_count += find_renames(mx, dst_cnt, minimum_score, 1);
 	free(mx);
 	trace2_region_leave("diff", "inexact renames", options->repo);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 2/2] diffcore-rename: filter rename_src list when possible
  2021-02-03 20:03 ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget
  2021-02-03 20:03   ` [PATCH v2 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget
@ 2021-02-03 20:03   ` Elijah Newren via GitGitGadget
  2021-02-13  1:04     ` Junio C Hamano
  2021-02-13  1:06     ` Junio C Hamano
  2021-02-03 21:56   ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Junio C Hamano
  2021-02-14  7:34   ` [PATCH v3 " Elijah Newren via GitGitGadget
  3 siblings, 2 replies; 25+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-03 20:03 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Karsten Blees, Derrick Stolee, Elijah Newren,
	Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

We have to look at each entry in rename_src a total of rename_dst_nr
times.  When we're not detecting copies, any exact renames or ignorable
rename paths will just be skipped over.  While checking that these can
be skipped over is a relatively cheap check, it's still a waste of time
to do that check more than once, let alone rename_dst_nr times.  When
rename_src_nr is a few thousand times bigger than the number of relevant
sources (such as when cherry-picking a commit that only touched a
handful of files, but from a side of history that has different names
for some high level directories), this time can add up.

First make an initial pass over the rename_src array and move all the
relevant entries to the front, so that we can iterate over just those
relevant entries.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       14.119 s ±  0.101 s    13.815 s ±  0.062 s
    mega-renames:   1802.044 s ±  0.828 s  1799.937 s ±  0.493 s
    just-one-mega:    51.391 s ±  0.028 s    51.289 s ±  0.019 s

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 57 ++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 51 insertions(+), 6 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 8b118628b4e..74930716e70 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -454,6 +454,54 @@ static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, i
 	return count;
 }
 
+static void remove_unneeded_paths_from_src(int detecting_copies)
+{
+	int i, new_num_src;
+
+	if (detecting_copies)
+		return; /* nothing to remove */
+	if (break_idx)
+		return; /* culling incompatbile with break detection */
+
+	/*
+	 * Note on reasons why we cull unneeded sources but not destinations:
+	 *   1) Pairings are stored in rename_dst (not rename_src), which we
+	 *      need to keep around.  So, we just can't cull rename_dst even
+	 *      if we wanted to.  But doing so wouldn't help because...
+	 *
+	 *   2) There is a matrix pairwise comparison that follows the
+	 *      "Performing inexact rename detection" progress message.
+	 *      Iterating over the destinations is done in the outer loop,
+	 *      hence we only iterate over each of those once and we can
+	 *      easily skip the outer loop early if the destination isn't
+	 *      relevant.  That's only one check per destination path to
+	 *      skip.
+	 *
+	 *      By contrast, the sources are iterated in the inner loop; if
+	 *      we check whether a source can be skipped, then we'll be
+	 *      checking it N separate times, once for each destination.
+	 *      We don't want to have to iterate over known-not-needed
+	 *      sources N times each, so avoid that by removing the sources
+	 *      from rename_src here.
+	 */
+	for (i = 0, new_num_src = 0; i < rename_src_nr; i++) {
+		/*
+		 * renames are stored in rename_dst, so if a rename has
+		 * already been detected using this source, we can just
+		 * remove the source knowing rename_dst has its info.
+		 */
+		if (rename_src[i].p->one->rename_used)
+			continue;
+
+		if (new_num_src < i)
+			memcpy(&rename_src[new_num_src], &rename_src[i],
+			       sizeof(struct diff_rename_src));
+		new_num_src++;
+	}
+
+	rename_src_nr = new_num_src;
+}
+
 void diffcore_rename(struct diff_options *options)
 {
 	int detect_rename = options->detect_rename;
@@ -530,13 +578,11 @@ void diffcore_rename(struct diff_options *options)
 		goto cleanup;
 
 	/*
-	 * Calculate how many renames are left (but all the source
-	 * files still remain as options for rename/copies!)
+	 * Calculate how many renames are left
 	 */
 	num_destinations = (rename_dst_nr - rename_count);
+	remove_unneeded_paths_from_src(want_copies);
 	num_sources = rename_src_nr;
-	if (!want_copies)
-		num_sources -= rename_count;
 
 	/* All done? */
 	if (!num_destinations || !num_sources)
@@ -578,8 +624,7 @@ void diffcore_rename(struct diff_options *options)
 			struct diff_filespec *one = rename_src[j].p->one;
 			struct diff_score this_src;
 
-			if (one->rename_used && !want_copies)
-				continue;
+			assert(!one->rename_used || want_copies || break_idx);
 
 			if (skip_unmodified &&
 			    diff_unmodified_pair(rename_src[j].p))
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/2] Optimization batch 6: make full use of exact renames
  2021-02-03 20:03 ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget
  2021-02-03 20:03   ` [PATCH v2 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget
  2021-02-03 20:03   ` [PATCH v2 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget
@ 2021-02-03 21:56   ` Junio C Hamano
  2021-02-03 23:06     ` Elijah Newren
  2021-02-14  7:34   ` [PATCH v3 " Elijah Newren via GitGitGadget
  3 siblings, 1 reply; 25+ messages in thread
From: Junio C Hamano @ 2021-02-03 21:56 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King,
	Karsten Blees, Derrick Stolee, Elijah Newren

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> This series depends on en/merge-ort-perf and makes full use of exact
> renames; see commit messages for details.
>
> Thanks to Stolee and Junio for reviewing v1.
>
> Changes since v1:
>
>  * Update rename_src_nr when updating rename_src
>  * Introduce want_copies in the first patch and use it in a few more places
>  * Move a comment below a few exit-early if-checks.
>
> Elijah Newren (2):
>   diffcore-rename: no point trying to find a match better than exact
>   diffcore-rename: filter rename_src list when possible
>
>  diffcore-rename.c | 69 +++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 61 insertions(+), 8 deletions(-)

Thanks, these look bettrer.

With these changes, I guess there are only two things I find myself
somewhat embarrassing in the rename machinery that is still there
since I invented it.

 - We still need to go full matrix while finding the "best"
   pairing.  I cannot think of a way to avoid it (that is what makes
   it embarrassing) but wish there were some way to.

   In an early attempt, I tried to retire rename_src[j], once
   rename_dst[i] has been found to be a "good enough" match for it,
   from the pool of rename src candidates to find a good match for
   rename_dst[k] for i < k, but naive implementation of it would not
   work well for obvious reasons---rename_src[j] may match a lot
   better with rename_dst[k] than rename_dst[i] but we do not know
   that until we try to estimate similarity with rename_dst[k].

 - The .cnt_data member was designed to be a concise summary of the
   blob characteristics so that two .cnt_data can be "compared"
   fairly cheaply to see how "similar" two blobs are [*], but (1) it
   is rather big to be called a "concise summary", and (2) it was
   not chosen after real performance measurement, and we've been
   using it for the past 15 years without revisiting its design.

   Side note: In a very early prototype, the approach to assess
   similarity between two blobs was very different---there was no
   attempt to compute "concise summary" for each blob, but we just
   attempted to create delta (as in the pack data) between src and
   dst blobs and measured how small a delta we can use to transform
   from src to dst.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/2] Optimization batch 6: make full use of exact renames
  2021-02-03 21:56   ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Junio C Hamano
@ 2021-02-03 23:06     ` Elijah Newren
  2021-02-03 23:26       ` Junio C Hamano
  2021-02-03 23:36       ` Jeff King
  0 siblings, 2 replies; 25+ messages in thread
From: Elijah Newren @ 2021-02-03 23:06 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Jeff King, Karsten Blees,
	Derrick Stolee

On Wed, Feb 3, 2021 at 1:56 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > This series depends on en/merge-ort-perf and makes full use of exact
> > renames; see commit messages for details.
> >
> > Thanks to Stolee and Junio for reviewing v1.
> >
> > Changes since v1:
> >
> >  * Update rename_src_nr when updating rename_src
> >  * Introduce want_copies in the first patch and use it in a few more places
> >  * Move a comment below a few exit-early if-checks.
> >
> > Elijah Newren (2):
> >   diffcore-rename: no point trying to find a match better than exact
> >   diffcore-rename: filter rename_src list when possible
> >
> >  diffcore-rename.c | 69 +++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 61 insertions(+), 8 deletions(-)
>
> Thanks, these look bettrer.
>
> With these changes, I guess there are only two things I find myself
> somewhat embarrassing in the rename machinery that is still there
> since I invented it.
>
>  - We still need to go full matrix while finding the "best"
>    pairing.  I cannot think of a way to avoid it (that is what makes
>    it embarrassing) but wish there were some way to.

It turns out that exact renames aren't the only thing that can reduce
the size of the matrix.  My next few series will add some more.

The matrix does remain, even at the end of all my performance work,
but multiple ways to shrink it certainly help a lot.

>    In an early attempt, I tried to retire rename_src[j], once
>    rename_dst[i] has been found to be a "good enough" match for it,
>    from the pool of rename src candidates to find a good match for
>    rename_dst[k] for i < k, but naive implementation of it would not
>    work well for obvious reasons---rename_src[j] may match a lot
>    better with rename_dst[k] than rename_dst[i] but we do not know
>    that until we try to estimate similarity with rename_dst[k].

You may really like the next two series I submit.  I have a smarter
way to find a "good enough" match (comparing to exactly one other file
and often finding sufficient similarity), and one that'll make
intuitive sense to users.

Then I have other series after those that optimize in different ways.

>  - The .cnt_data member was designed to be a concise summary of the
>    blob characteristics so that two .cnt_data can be "compared"
>    fairly cheaply to see how "similar" two blobs are [*], but (1) it
>    is rather big to be called a "concise summary", and (2) it was
>    not chosen after real performance measurement, and we've been
>    using it for the past 15 years without revisiting its design.

.cnt_data seemed kind of clever to me, though it did have a few things
that seemed surprising:

1) Small "lines".  The similarity works by mapping each "line" to an
integer using some simple hash, and keeps track of the list of hashed
integers for each file.  Once you have a list of hashed integers for
each file, similarity is determined by looking through the two lists
for two files and seeing what percentage of integers is found in both
(but see item #3 for how percentage is defined).  One special
consideration here was "lines".

I think because of a desire to handle binary files, there was a
decision to split at 64-bytes or LF, whichever comes first, so each
real line of code might be broken into multiple "lines".  The 64-bytes
thing might make things a little weird, though.  If you have a lot of
lines that are just over 64-characters long, they'll be split into two
"lines", with the second ones being very short.  Really short lines
are much more likely to look like other really short lines, which
might pad your similarity count.  I'm not sure what to do differently,
though.

2) Size-agnostic hashing.  The integer that each "line" is mapped to
is strictly between 0 and HASHBASE-1, where HASHBASE is #define'd to
107927.

By the pigeon-hole principle, since this hash size is fixed, any two
sufficiently large files will be marked by this algorithm as similar.
I created some testcases and verified that this was indeed true.  I
created two files at random using disjoint "alphabets" (e.g. one file
only contained letters, the other file only contained digits), and
found that at a certain size the algorithm would always return >50%
similarity.  The size was somewhere between 1MB and 8MB; I can't find
the table where I recorded that anymore.  So, basically, sufficiently
large files that are sufficiently close in size (since the algorithm
checks for size similarity as an optimization) will always be marked
as a rename or copy.  I think we haven't gotten reports of this "bug"
because files that large are for all intents and purposes binaries
that people likely won't modify.  And if they do modify the files,
then they'll conflict, but people won't look at the files to resolve
the conflict; instead, they'll generate a new "good" binary externally
and then stick it into place.  I investigated this just a little bit,
and even considered turning off rename detection for sufficiently
large files, but sizes aren't available until the tight inner loop and
adding any extra checking there actually costs enough to offset most
any benefit.  There might have been a way to avoid that, but
ultimately I just dropped the issue and did nothing.

3) It uses a similarity measure that diverges from what researches
used for MinHash and other fancy algorithms.  In particular,

   size(A intersect B) / size(A union B)  != size(A intersect B) /
max(size(A), size(B))

The formula on the right hand side would mean that if file A is a
subset of file B, say the first 10% of file B, then it will be treated
as 100% similar when most humans would look at it and say it is only
10% similar.  The MinHash definition for similarity measure seems like
it makes more sense...though if people have been passing numbers like
"60%" to -M, then this change of definition will force them to
recalibrate what numbers they pass.  I think it'd be really nice to
use the MinHash definition and that it might even fix bugs to do so,
but I was worried folks might perceive it as a backward incompatible
change if they need to recalibrate their numberings.

So, I ultimately didn't do anything here either...yet.  But this is
the one that I'd most like to do something about.  Are others
concerned about the possible need to recalibrate, or is that okay?
Maybe the performance gains I'm adding elsewhere will offset possible
grumpy users?

>    Side note: In a very early prototype, the approach to assess
>    similarity between two blobs was very different---there was no
>    attempt to compute "concise summary" for each blob, but we just
>    attempted to create delta (as in the pack data) between src and
>    dst blobs and measured how small a delta we can use to transform
>    from src to dst.

Interesting; I didn't know that.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/2] Optimization batch 6: make full use of exact renames
  2021-02-03 23:06     ` Elijah Newren
@ 2021-02-03 23:26       ` Junio C Hamano
  2021-02-03 23:36       ` Jeff King
  1 sibling, 0 replies; 25+ messages in thread
From: Junio C Hamano @ 2021-02-03 23:26 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Jeff King, Karsten Blees,
	Derrick Stolee

Elijah Newren <newren@gmail.com> writes:

> 3) It uses a similarity measure that diverges from what researches
> used for MinHash and other fancy algorithms.  In particular,
>
>    size(A intersect B) / size(A union B)  != size(A intersect B) /
> max(size(A), size(B))
>
> The formula on the right hand side would mean that if file A is a
> subset of file B, say the first 10% of file B, then it will be treated
> as 100% similar when most humans would look at it and say it is only
> 10% similar.

If you are talking about "you start from 100 lines file and appended
900 lines of your own, then you still have 100% of the original
material remaining in the file", it is quite deliberate that we used
it as an indication that the original "100 lines" file is a good
candidate to have been renamed to the resulting "1000 lines" file.
It is "what you have kept from the original" measure.

Of course, taken to the extreme, this means that rename does not
have to be symmetrical.  "diff A B" may find that the original
100-line file in A has grown into 1000-line file in B elsewhere, but
"diff B A" or "diff -R A B" would not necessarily pair these two
blobs as matching.

> Maybe the performance gains I'm adding elsewhere will offset possible
> grumpy users?

Users, as they are, it would never happen.  When they have something
to complain about, they will, regardless of what else you do.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/2] Optimization batch 6: make full use of exact renames
  2021-02-03 23:06     ` Elijah Newren
  2021-02-03 23:26       ` Junio C Hamano
@ 2021-02-03 23:36       ` Jeff King
  2021-02-04  0:05         ` Elijah Newren
  1 sibling, 1 reply; 25+ messages in thread
From: Jeff King @ 2021-02-03 23:36 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Junio C Hamano, Elijah Newren via GitGitGadget, Git Mailing List,
	Derrick Stolee, Jonathan Tan, Taylor Blau, Karsten Blees,
	Derrick Stolee

On Wed, Feb 03, 2021 at 03:06:26PM -0800, Elijah Newren wrote:

> >    In an early attempt, I tried to retire rename_src[j], once
> >    rename_dst[i] has been found to be a "good enough" match for it,
> >    from the pool of rename src candidates to find a good match for
> >    rename_dst[k] for i < k, but naive implementation of it would not
> >    work well for obvious reasons---rename_src[j] may match a lot
> >    better with rename_dst[k] than rename_dst[i] but we do not know
> >    that until we try to estimate similarity with rename_dst[k].
> 
> You may really like the next two series I submit.  I have a smarter
> way to find a "good enough" match (comparing to exactly one other file
> and often finding sufficient similarity), and one that'll make
> intuitive sense to users.

Here's a really old thread with an approach that may or may not be
similar to what you're thinking of:

  https://lore.kernel.org/git/596909b30710220240g665054d8hc40bc5d2234ba9e1@mail.gmail.com/

Though maybe start with this summary message:

  https://lore.kernel.org/git/596909b30710220309l1a28e646r9fd47f967dc32574@mail.gmail.com/

I remember experimenting some with it at the time, but never making
things faster. It's entirely possible (likely, even) that I was simply
not doing it well enough.

It's also been long enough since I looked at the rename code that I'm
not sure how different it is in practice. It still has a quadratic
matrix in the end. We basically do a similar strategy of
rolling-hash-fingerprint-and-see-where-things-collide, but I think we
may end up with more work during the quadratic part (again, it's been
a while, so I may just be totally off-base).

I've also wondered if something similar might be helpful for delta
compression (which is now done with an O(m*n) sliding window).

-Peff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 0/2] Optimization batch 6: make full use of exact renames
  2021-02-03 23:36       ` Jeff King
@ 2021-02-04  0:05         ` Elijah Newren
  0 siblings, 0 replies; 25+ messages in thread
From: Elijah Newren @ 2021-02-04  0:05 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Elijah Newren via GitGitGadget, Git Mailing List,
	Derrick Stolee, Jonathan Tan, Taylor Blau, Karsten Blees,
	Derrick Stolee

On Wed, Feb 3, 2021 at 3:37 PM Jeff King <peff@peff.net> wrote:
>
> On Wed, Feb 03, 2021 at 03:06:26PM -0800, Elijah Newren wrote:
>
> > >    In an early attempt, I tried to retire rename_src[j], once
> > >    rename_dst[i] has been found to be a "good enough" match for it,
> > >    from the pool of rename src candidates to find a good match for
> > >    rename_dst[k] for i < k, but naive implementation of it would not
> > >    work well for obvious reasons---rename_src[j] may match a lot
> > >    better with rename_dst[k] than rename_dst[i] but we do not know
> > >    that until we try to estimate similarity with rename_dst[k].
> >
> > You may really like the next two series I submit.  I have a smarter
> > way to find a "good enough" match (comparing to exactly one other file
> > and often finding sufficient similarity), and one that'll make
> > intuitive sense to users.
>
> Here's a really old thread with an approach that may or may not be
> similar to what you're thinking of:
>
>   https://lore.kernel.org/git/596909b30710220240g665054d8hc40bc5d2234ba9e1@mail.gmail.com/
>
> Though maybe start with this summary message:
>
>   https://lore.kernel.org/git/596909b30710220309l1a28e646r9fd47f967dc32574@mail.gmail.com/

Interesting thread; thanks for the link.  It's not remotely similar to
what I have done, but a brief glance through it reminds me of the
ideas at https://github.com/gitgitgadget/git/issues/519.

> I remember experimenting some with it at the time, but never making
> things faster. It's entirely possible (likely, even) that I was simply
> not doing it well enough.
>
> It's also been long enough since I looked at the rename code that I'm
> not sure how different it is in practice. It still has a quadratic
> matrix in the end. We basically do a similar strategy of
> rolling-hash-fingerprint-and-see-where-things-collide, but I think we
> may end up with more work during the quadratic part (again, it's been
> a while, so I may just be totally off-base).

I'm not sure if I should spoil the surprise for the patchsets I
haven't yet submitted but... when I get done, for the testcases I have
looked at, rename detection is no longer the slowest part of
merging/rebasing/cherry-picking -- not even when there are tens of
thousands of renames on one side of history.  And I didn't achieve
that by making other parts of the code slower.

If someone can come up with a real-world testcase where rename
detection is still really slow in a merge/rebase/cherry-pick with my
implemented optimizations, I've got at least one more optimization
that I hadn't bothered implementing because everything was fast enough
for me already.  And the idea you linked above and the ideas in the
gitgitgadget issue linked above all have more possibilities that are
complementary to what I have done that might help further.

> I've also wondered if something similar might be helpful for delta
> compression (which is now done with an O(m*n) sliding window).

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/2] diffcore-rename: filter rename_src list when possible
  2021-02-03 20:03   ` [PATCH v2 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget
@ 2021-02-13  1:04     ` Junio C Hamano
  2021-02-13  4:24       ` Elijah Newren
  2021-02-13  1:06     ` Junio C Hamano
  1 sibling, 1 reply; 25+ messages in thread
From: Junio C Hamano @ 2021-02-13  1:04 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King,
	Karsten Blees, Derrick Stolee, Elijah Newren

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +		return; /* culling incompatbile with break detection */

incompatible.

>  	/*
> -	 * Calculate how many renames are left (but all the source
> -	 * files still remain as options for rename/copies!)
> +	 * Calculate how many renames are left
>  	 */

This no longer needs to be a multi-line comment, does it?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/2] diffcore-rename: filter rename_src list when possible
  2021-02-03 20:03   ` [PATCH v2 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget
  2021-02-13  1:04     ` Junio C Hamano
@ 2021-02-13  1:06     ` Junio C Hamano
  2021-02-13  4:43       ` Elijah Newren
  1 sibling, 1 reply; 25+ messages in thread
From: Junio C Hamano @ 2021-02-13  1:06 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Derrick Stolee, Jonathan Tan, Taylor Blau, Jeff King,
	Karsten Blees, Derrick Stolee, Elijah Newren

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> @@ -578,8 +624,7 @@ void diffcore_rename(struct diff_options *options)
>  			struct diff_filespec *one = rename_src[j].p->one;
>  			struct diff_score this_src;
>  
> -			if (one->rename_used && !want_copies)
> -				continue;
> +			assert(!one->rename_used || want_copies || break_idx);

Doesn't assert becomes a no-op in production builds?  Shouldn't this
be a BUG() instead?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/2] diffcore-rename: filter rename_src list when possible
  2021-02-13  1:04     ` Junio C Hamano
@ 2021-02-13  4:24       ` Elijah Newren
  0 siblings, 0 replies; 25+ messages in thread
From: Elijah Newren @ 2021-02-13  4:24 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Jeff King, Karsten Blees,
	Derrick Stolee

On Fri, Feb 12, 2021 at 5:04 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > +             return; /* culling incompatbile with break detection */
>
> incompatible.

Thanks.

> >       /*
> > -      * Calculate how many renames are left (but all the source
> > -      * files still remain as options for rename/copies!)
> > +      * Calculate how many renames are left
> >        */
>
> This no longer needs to be a multi-line comment, does it?

Indeed.

Will fix both.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 2/2] diffcore-rename: filter rename_src list when possible
  2021-02-13  1:06     ` Junio C Hamano
@ 2021-02-13  4:43       ` Elijah Newren
  0 siblings, 0 replies; 25+ messages in thread
From: Elijah Newren @ 2021-02-13  4:43 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Jeff King, Karsten Blees,
	Derrick Stolee

On Fri, Feb 12, 2021 at 5:06 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > @@ -578,8 +624,7 @@ void diffcore_rename(struct diff_options *options)
> >                       struct diff_filespec *one = rename_src[j].p->one;
> >                       struct diff_score this_src;
> >
> > -                     if (one->rename_used && !want_copies)
> > -                             continue;
> > +                     assert(!one->rename_used || want_copies || break_idx);
>
> Doesn't assert becomes a no-op in production builds?  Shouldn't this
> be a BUG() instead?

I guess it depends on the reason for the line.  If we're just trying
to help others understand the code by documenting conditions that are
true (and perhaps help them when they are refactoring), then comments
like

    /* The following is true at this point: !one->rename_used ||
want_copies || break_idx */

could also be used, but it's kind of verbose.  The form

    assert(!one->rename_used || want_copies || break_idx);

is shorter, clearer, and is likely to be up-to-date because even if
most builds and users turn off assertions, some folks will build and
run with assertions on.  If it's a refactoring-aid, then the latter
form is also more likely to help the future developer (who may be
future me).

If, however, the purpose is to check for bad input or worries about
programming logic possibly have edge cases, and we're afraid that such
conditions might cause severe and hard to debug problems later in the
code, then we absolutely would rather use a BUG().

Most of my uses of assert() fall in the former category.  I use BUG()
when it's a line meant for the latter category.  There are a few that
perhaps fall in between, where I just have to make a judgement call.
I'd like to say that I'm more likely to use BUG() for those, but the
lack of a BUG_ON() does tend to make it pretty annoying to use and
certainly discourages it.

Granted, that's the way I look at it.  I'm curious if others have a
different view.  It certainly seems like the project likes to use both
forms nearly equally:
$ git grep assert\( origin/master | wc -l
468
$ git grep BUG\( origin/master | wc -l
506

so I'm curious if there are other factors that make folks pick one or the other.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 0/2] Optimization batch 6: make full use of exact renames
  2021-02-03 20:03 ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget
                     ` (2 preceding siblings ...)
  2021-02-03 21:56   ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Junio C Hamano
@ 2021-02-14  7:34   ` Elijah Newren via GitGitGadget
  2021-02-14  7:35     ` [PATCH v3 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget
  2021-02-14  7:35     ` [PATCH v3 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget
  3 siblings, 2 replies; 25+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-14  7:34 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Karsten Blees, Derrick Stolee, Elijah Newren,
	Elijah Newren

This series makes full use of exact renames; removing not only a destination
pair, but a source pair as well when an exact rename is found and copy
detection is not turned on.

Changes since v2:

 * Fix a comment typo, and fix a multi-line comment that didn't need to be a
   multi-line comment

Elijah Newren (2):
  diffcore-rename: no point trying to find a match better than exact
  diffcore-rename: filter rename_src list when possible

 diffcore-rename.c | 71 ++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 61 insertions(+), 10 deletions(-)


base-commit: f0117958910fbc734457a83a9f8ecc3c62463417
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-842%2Fnewren%2Fort-perf-batch-6-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-842/newren/ort-perf-batch-6-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/842

Range-diff vs v2:

 1:  770e894b4abd = 1:  a59c1960f614 diffcore-rename: no point trying to find a match better than exact
 2:  7ae9460d3dba ! 2:  dd6595b45640 diffcore-rename: filter rename_src list when possible
     @@ diffcore-rename.c: static int find_renames(struct diff_score *mx, int dst_cnt, i
      +	if (detecting_copies)
      +		return; /* nothing to remove */
      +	if (break_idx)
     -+		return; /* culling incompatbile with break detection */
     ++		return; /* culling incompatible with break detection */
      +
      +	/*
      +	 * Note on reasons why we cull unneeded sources but not destinations:
     @@ diffcore-rename.c: static int find_renames(struct diff_score *mx, int dst_cnt, i
       {
       	int detect_rename = options->detect_rename;
      @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
     + 	if (minimum_score == MAX_SCORE)
       		goto cleanup;
       
     - 	/*
     +-	/*
      -	 * Calculate how many renames are left (but all the source
      -	 * files still remain as options for rename/copies!)
     -+	 * Calculate how many renames are left
     - 	 */
     +-	 */
     ++	/* Calculate how many renames are left */
       	num_destinations = (rename_dst_nr - rename_count);
      +	remove_unneeded_paths_from_src(want_copies);
       	num_sources = rename_src_nr;

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 1/2] diffcore-rename: no point trying to find a match better than exact
  2021-02-14  7:34   ` [PATCH v3 " Elijah Newren via GitGitGadget
@ 2021-02-14  7:35     ` Elijah Newren via GitGitGadget
  2021-02-14  7:35     ` [PATCH v3 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget
  1 sibling, 0 replies; 25+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-14  7:35 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Karsten Blees, Derrick Stolee, Elijah Newren,
	Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

diffcore_rename() had some code to avoid having destination paths that
already had an exact rename detected from being re-checked for other
renames.  Source paths, however, were re-checked because we wanted to
allow the possibility of detecting copies.  But if copy detection isn't
turned on, then this merely amounts to attempting to find a
better-than-exact match, which naturally ends up being an expensive
no-op.  In particular, copy detection is never turned on by the merge
machinery.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       14.263 s ±  0.053 s    14.119 s ±  0.101 s
    mega-renames:   5504.231 s ±  5.150 s  1802.044 s ±  0.828 s
    just-one-mega:   158.534 s ±  0.498 s    51.391 s ±  0.028 s

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 8fe6c9384bcb..8b118628b4ef 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -463,9 +463,11 @@ void diffcore_rename(struct diff_options *options)
 	struct diff_score *mx;
 	int i, j, rename_count, skip_unmodified = 0;
 	int num_destinations, dst_cnt;
+	int num_sources, want_copies;
 	struct progress *progress = NULL;
 
 	trace2_region_enter("diff", "setup", options->repo);
+	want_copies = (detect_rename == DIFF_DETECT_COPY);
 	if (!minimum_score)
 		minimum_score = DEFAULT_RENAME_SCORE;
 
@@ -502,7 +504,7 @@ void diffcore_rename(struct diff_options *options)
 				p->one->rename_used++;
 			register_rename_src(p);
 		}
-		else if (detect_rename == DIFF_DETECT_COPY) {
+		else if (want_copies) {
 			/*
 			 * Increment the "rename_used" score by
 			 * one, to indicate ourselves as a user.
@@ -532,12 +534,15 @@ void diffcore_rename(struct diff_options *options)
 	 * files still remain as options for rename/copies!)
 	 */
 	num_destinations = (rename_dst_nr - rename_count);
+	num_sources = rename_src_nr;
+	if (!want_copies)
+		num_sources -= rename_count;
 
 	/* All done? */
-	if (!num_destinations)
+	if (!num_destinations || !num_sources)
 		goto cleanup;
 
-	switch (too_many_rename_candidates(num_destinations, rename_src_nr,
+	switch (too_many_rename_candidates(num_destinations, num_sources,
 					   options)) {
 	case 1:
 		goto cleanup;
@@ -553,7 +558,7 @@ void diffcore_rename(struct diff_options *options)
 	if (options->show_rename_progress) {
 		progress = start_delayed_progress(
 				_("Performing inexact rename detection"),
-				(uint64_t)num_destinations * (uint64_t)rename_src_nr);
+				(uint64_t)num_destinations * (uint64_t)num_sources);
 	}
 
 	mx = xcalloc(st_mult(NUM_CANDIDATE_PER_DST, num_destinations),
@@ -573,6 +578,9 @@ void diffcore_rename(struct diff_options *options)
 			struct diff_filespec *one = rename_src[j].p->one;
 			struct diff_score this_src;
 
+			if (one->rename_used && !want_copies)
+				continue;
+
 			if (skip_unmodified &&
 			    diff_unmodified_pair(rename_src[j].p))
 				continue;
@@ -594,7 +602,7 @@ void diffcore_rename(struct diff_options *options)
 		}
 		dst_cnt++;
 		display_progress(progress,
-				 (uint64_t)dst_cnt * (uint64_t)rename_src_nr);
+				 (uint64_t)dst_cnt * (uint64_t)num_sources);
 	}
 	stop_progress(&progress);
 
@@ -602,7 +610,7 @@ void diffcore_rename(struct diff_options *options)
 	STABLE_QSORT(mx, dst_cnt * NUM_CANDIDATE_PER_DST, score_compare);
 
 	rename_count += find_renames(mx, dst_cnt, minimum_score, 0);
-	if (detect_rename == DIFF_DETECT_COPY)
+	if (want_copies)
 		rename_count += find_renames(mx, dst_cnt, minimum_score, 1);
 	free(mx);
 	trace2_region_leave("diff", "inexact renames", options->repo);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 2/2] diffcore-rename: filter rename_src list when possible
  2021-02-14  7:34   ` [PATCH v3 " Elijah Newren via GitGitGadget
  2021-02-14  7:35     ` [PATCH v3 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget
@ 2021-02-14  7:35     ` Elijah Newren via GitGitGadget
  1 sibling, 0 replies; 25+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-14  7:35 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Jeff King, Karsten Blees, Derrick Stolee, Elijah Newren,
	Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

We have to look at each entry in rename_src a total of rename_dst_nr
times.  When we're not detecting copies, any exact renames or ignorable
rename paths will just be skipped over.  While checking that these can
be skipped over is a relatively cheap check, it's still a waste of time
to do that check more than once, let alone rename_dst_nr times.  When
rename_src_nr is a few thousand times bigger than the number of relevant
sources (such as when cherry-picking a commit that only touched a
handful of files, but from a side of history that has different names
for some high level directories), this time can add up.

First make an initial pass over the rename_src array and move all the
relevant entries to the front, so that we can iterate over just those
relevant entries.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       14.119 s ±  0.101 s    13.815 s ±  0.062 s
    mega-renames:   1802.044 s ±  0.828 s  1799.937 s ±  0.493 s
    just-one-mega:    51.391 s ±  0.028 s    51.289 s ±  0.019 s

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 59 ++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 51 insertions(+), 8 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 8b118628b4ef..6fd0c4a2f485 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -454,6 +454,54 @@ static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, i
 	return count;
 }
 
+static void remove_unneeded_paths_from_src(int detecting_copies)
+{
+	int i, new_num_src;
+
+	if (detecting_copies)
+		return; /* nothing to remove */
+	if (break_idx)
+		return; /* culling incompatible with break detection */
+
+	/*
+	 * Note on reasons why we cull unneeded sources but not destinations:
+	 *   1) Pairings are stored in rename_dst (not rename_src), which we
+	 *      need to keep around.  So, we just can't cull rename_dst even
+	 *      if we wanted to.  But doing so wouldn't help because...
+	 *
+	 *   2) There is a matrix pairwise comparison that follows the
+	 *      "Performing inexact rename detection" progress message.
+	 *      Iterating over the destinations is done in the outer loop,
+	 *      hence we only iterate over each of those once and we can
+	 *      easily skip the outer loop early if the destination isn't
+	 *      relevant.  That's only one check per destination path to
+	 *      skip.
+	 *
+	 *      By contrast, the sources are iterated in the inner loop; if
+	 *      we check whether a source can be skipped, then we'll be
+	 *      checking it N separate times, once for each destination.
+	 *      We don't want to have to iterate over known-not-needed
+	 *      sources N times each, so avoid that by removing the sources
+	 *      from rename_src here.
+	 */
+	for (i = 0, new_num_src = 0; i < rename_src_nr; i++) {
+		/*
+		 * renames are stored in rename_dst, so if a rename has
+		 * already been detected using this source, we can just
+		 * remove the source knowing rename_dst has its info.
+		 */
+		if (rename_src[i].p->one->rename_used)
+			continue;
+
+		if (new_num_src < i)
+			memcpy(&rename_src[new_num_src], &rename_src[i],
+			       sizeof(struct diff_rename_src));
+		new_num_src++;
+	}
+
+	rename_src_nr = new_num_src;
+}
+
 void diffcore_rename(struct diff_options *options)
 {
 	int detect_rename = options->detect_rename;
@@ -529,14 +577,10 @@ void diffcore_rename(struct diff_options *options)
 	if (minimum_score == MAX_SCORE)
 		goto cleanup;
 
-	/*
-	 * Calculate how many renames are left (but all the source
-	 * files still remain as options for rename/copies!)
-	 */
+	/* Calculate how many renames are left */
 	num_destinations = (rename_dst_nr - rename_count);
+	remove_unneeded_paths_from_src(want_copies);
 	num_sources = rename_src_nr;
-	if (!want_copies)
-		num_sources -= rename_count;
 
 	/* All done? */
 	if (!num_destinations || !num_sources)
@@ -578,8 +622,7 @@ void diffcore_rename(struct diff_options *options)
 			struct diff_filespec *one = rename_src[j].p->one;
 			struct diff_score this_src;
 
-			if (one->rename_used && !want_copies)
-				continue;
+			assert(!one->rename_used || want_copies || break_idx);
 
 			if (skip_unmodified &&
 			    diff_unmodified_pair(rename_src[j].p))
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2021-02-14  7:37 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-03  5:49 [PATCH 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget
2021-02-03  5:49 ` [PATCH 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget
2021-02-03 11:44   ` Derrick Stolee
2021-02-03 16:31     ` Elijah Newren
2021-02-03 18:46     ` Junio C Hamano
2021-02-03 19:10       ` Elijah Newren
2021-02-03  5:49 ` [PATCH 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget
     [not found]   ` <13feb106-c3a7-a26d-0e6e-013aa45c58d4@gmail.com>
2021-02-03 17:12     ` Elijah Newren
2021-02-03 19:12   ` Junio C Hamano
2021-02-03 19:19     ` Elijah Newren
2021-02-03 20:03 ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget
2021-02-03 20:03   ` [PATCH v2 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget
2021-02-03 20:03   ` [PATCH v2 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget
2021-02-13  1:04     ` Junio C Hamano
2021-02-13  4:24       ` Elijah Newren
2021-02-13  1:06     ` Junio C Hamano
2021-02-13  4:43       ` Elijah Newren
2021-02-03 21:56   ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Junio C Hamano
2021-02-03 23:06     ` Elijah Newren
2021-02-03 23:26       ` Junio C Hamano
2021-02-03 23:36       ` Jeff King
2021-02-04  0:05         ` Elijah Newren
2021-02-14  7:34   ` [PATCH v3 " Elijah Newren via GitGitGadget
2021-02-14  7:35     ` [PATCH v3 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget
2021-02-14  7:35     ` [PATCH v3 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).