git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames
@ 2021-03-24 21:32 Elijah Newren via GitGitGadget
  2021-03-24 21:32 ` [PATCH 1/7] merge-ort: add data structures for in-memory caching of rename detection Elijah Newren via GitGitGadget
                   ` (8 more replies)
  0 siblings, 9 replies; 32+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-24 21:32 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren

This series builds on ort-readiness, but it's semantically orthogonal to the
previous series and represents a new independent optimization.

=== Basic Optimization idea ===

This series avoids repeatedly detecting the same renames in a sequence of
merges such as a rebase or cherry-pick of several commits. When there are
many renames between the old base and the new base, traditionally all those
renames are re-detected for every commit that is transplanted. This
optimization avoids redoing that work.

This represents "Optimization #4" from my Git Merge 2020 talk[1]; the
details are a bit more involved than I realized at the time, but the high
level idea is the same.

=== Comparison to previous series ===

I previously noted that we had three major rename-related optimizations:

 * exact rename detection (applies when unmodified on renamed side)
 * skip-because-irrelevant (applies when unmodified on unrenamed side)
 * basename-guided rename detection (applies when basename unchanged)

This one adds a fourth (remember-renames), with some interesting properties:

 * unlike basename-guided rename detection, there are no behavioral changes
   (there is no heuristic involved)[2].

 * like skip-because-irrelevant, this optimization does not apply to all git
   commands using the rename machinery. In fact, this one is even more
   restrictive since it is ONLY useful for rebases and cherry-picks (not
   even merges), and only for second and later commits in a linear series.

 * unlike the three previous optimizations, there are no requirements about
   the types of changes done to the file; it just caches renames on the
   "upstream" side of history for subsequent commit picking.

It's also worth noting despite wording about "remembering" or "caching"
renames, that this optimization does NOT write this cache to disk; it's an
in-memory only cache. When the rebase or cherry-pick completes (or hits a
conflict and stops), the cache is discarded.

=== Results ===

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28), the
changes in just this series improves the performance as follows:

                     Before Series           After Series
no-renames:        5.665 s ±  0.129 s     5.624 s ±  0.077 s 
mega-renames:     11.435 s ±  0.158 s    10.213 s ±  0.032 s
just-one-mega:   494.2  ms ±  6.1  ms   497.6  ms ±  5.3  ms


By design, this optimization could not help the just-one-mega testcase. The
gains for the other two testcases may look somewhat smaller than one would
expect given the description (only ~10% for the mega-renames testcase), but
the point was to spend less time detecting renames...and there just wasn't
that much time spent in renames for these testcases before this series for
us to remove. However, if we undid the basename-guided rename detection and
skip-because-unnecessary optimizations, then this series alone would have
improved performance as follows:

                   Before Basename Series   After Just This Series
    no-renames:      13.815 s ±  0.062 s      5.814 s ±  0.094 s
    mega-renames:  1799.937 s ±  0.493 s    303.225 s ±  1.330 s


Showing that this optimization has the ability to improve things when the
other optimizations do not apply. In fact, when I originally implemented
this optimization, it improved the mega-renames testcase by a factor of 2
(at the time, I did not have all the optimizations from ort-perf-batch-7
thru ort-perf-batch-10 in their current shape).

As a reminder, before any merge-ort/diffcore-rename performance work, the
performance results we started with were:

no-renames-am:      6.940 s ±  0.485 s
no-renames:        18.912 s ±  0.174 s
mega-renames:    5964.031 s ± 10.459 s
just-one-mega:    149.583 s ±  0.751 s


=== Further discussion of results ===

If we change our focus from absolute time taken, to the percentage of
overall time spent on rename detection, then we find the following picture
comparing our starting point at the beginning of the performance work to
what we achieve at the end of this series:

       Percentage of time spent on rename detection
   
                  commit 557ac0350d      After this Series
no-renames:             39.4%                   0.2%
mega-renames:           96.6%                   2.7%
just-one-mega:          95.0%                   9.0%


Since this optimization is only applicable for the first two testcases
(because the third only involves rebasing a single commit), even if this
optimization had removed all the remaining time in rename detection it would
have only sped it up the mega-renames case by ~12% rather than the 10% it
achieved. This table makes it clear that our attempts to accelerate rename
detection have succeeded, and any further work to accelerate merges needs to
start concentrating on other areas.

[1]
https://github.com/newren/presentations/blob/pdfs/merge-performance/merge-performance-slides.pdf

[2] Well, almost no changes. There's technically a very narrow way that this
could change the behavior; see the really long "Technically," bullet point
in patch 3 for discussion of this.

Elijah Newren (7):
  merge-ort: add data structures for in-memory caching of rename
    detection
  merge-ort: populate caches of rename detection results
  merge-ort: add code to check for whether cached renames can be reused
  merge-ort: avoid accidental API mis-use
  merge-ort: preserve cached renames for the appropriate side
  merge-ort: add helper functions for using cached renames
  merge-ort, diffcore-rename: employ cached renames when possible

 diffcore-rename.c |  22 +++-
 diffcore.h        |   3 +-
 merge-ort.c       | 273 +++++++++++++++++++++++++++++++++++++++++++---
 merge-ort.h       |   2 +
 4 files changed, 282 insertions(+), 18 deletions(-)


base-commit: c2d2a1ccaea70b7dc8c0539ba9d3a132f9687040
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-859%2Fnewren%2Fort-perf-batch-11-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-859/newren/ort-perf-batch-11-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/859
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/7] merge-ort: add data structures for in-memory caching of rename detection
  2021-03-24 21:32 [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames Elijah Newren via GitGitGadget
@ 2021-03-24 21:32 ` Elijah Newren via GitGitGadget
  2021-03-24 21:32 ` [PATCH 2/7] merge-ort: populate caches of rename detection results Elijah Newren via GitGitGadget
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-24 21:32 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

When there are many renames between the old base of a series of commits
and the new base for a series of commits, the sequence of merges
employed to transplant those commits (from a cherry-pick or rebase
operation) will repeatedly detect the exact same renames.  This is
wasted effort.

Add data structures which will be used to cache rename detection
results, along with the initialization and deallocation of these data
structures.  Future commits will populate these caches, detect the
appropriate circumstances when they can be used, and employ them to
avoid re-detecting the same renames repeatedly.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index 8258d3fd621e..0774152ea64a 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -139,6 +139,37 @@ struct rename_info {
 	int callback_data_nr, callback_data_alloc;
 	char *callback_data_traverse_path;
 
+	/*
+	 * cached_pairs: Caching of renames and deletions.
+	 *
+	 * These are mappings recording renames and deletions of individual
+	 * files (not directories).  They are thus a map from an old
+	 * filename to either NULL (for deletions) or a new filename (for
+	 * renames).
+	 */
+	struct strmap cached_pairs[3];
+
+	/*
+	 * cached_target_names: just the destinations from cached_pairs
+	 *
+	 * We sometimes want a fast lookup to determine if a given filename
+	 * is one of the destinations in cached_pairs.  cached_target_names
+	 * is thus duplicative information, but it provides a fast lookup.
+	 */
+	struct strset cached_target_names[3];
+
+	/*
+	 * cached_irrelevant: Caching of rename_sources that aren't relevant.
+	 *
+	 * cached_pairs records both renames and deletes.  Sometimes we
+	 * do not know if a path is a rename or a delete because we pass
+	 * RELEVANT_LOCATION to diffcore_rename_extended() and based on
+	 * various optimizations it returns without detecting whether that
+	 * path is actually a rename or a delete.  We need to cache such
+	 * paths too, but separately from cached_pairs.
+	 */
+	struct strset cached_irrelevant[3];
+
 	/*
 	 * needed_limit: value needed for inexact rename detection to run
 	 *
@@ -381,6 +412,8 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		reinitialize ? strmap_partial_clear : strmap_clear;
 	void (*strintmap_func)(struct strintmap *) =
 		reinitialize ? strintmap_partial_clear : strintmap_clear;
+	void (*strset_func)(struct strset *) =
+		reinitialize ? strset_partial_clear : strset_clear;
 
 	/*
 	 * We marked opti->paths with strdup_strings = 0, so that we
@@ -424,6 +457,9 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		strmap_func(&renames->dir_renames[i], 0);
 
 		strintmap_func(&renames->relevant_sources[i]);
+		strset_func(&renames->cached_target_names[i]);
+		strmap_func(&renames->cached_pairs[i], 1);
+		strset_func(&renames->cached_irrelevant[i]);
 	}
 
 	if (!reinitialize) {
@@ -3675,6 +3711,12 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
 					 NULL, 0);
 		strintmap_init_with_options(&renames->relevant_sources[i],
 					    0, NULL, 0);
+		strmap_init_with_options(&renames->cached_pairs[i],
+					 NULL, 1);
+		strset_init_with_options(&renames->cached_irrelevant[i],
+					 NULL, 1);
+		strset_init_with_options(&renames->cached_target_names[i],
+					 NULL, 0);
 	}
 
 	/*
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 2/7] merge-ort: populate caches of rename detection results
  2021-03-24 21:32 [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames Elijah Newren via GitGitGadget
  2021-03-24 21:32 ` [PATCH 1/7] merge-ort: add data structures for in-memory caching of rename detection Elijah Newren via GitGitGadget
@ 2021-03-24 21:32 ` Elijah Newren via GitGitGadget
  2021-03-24 21:32 ` [PATCH 3/7] merge-ort: add code to check for whether cached renames can be reused Elijah Newren via GitGitGadget
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-24 21:32 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Fill in cache_pairs, cached_target_names, and cached_irrelevant based on
rename detection results.  Future commits will make use of these values.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 52 insertions(+), 2 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index 0774152ea64a..2303d88e6a92 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -2333,6 +2333,47 @@ static void resolve_diffpair_statuses(struct diff_queue_struct *q)
 	}
 }
 
+static void possibly_cache_new_pair(struct rename_info *renames,
+				    struct diff_filepair *p,
+				    unsigned side,
+				    char *new_path)
+{
+	char *old_value;
+
+	if (!new_path) {
+		int val = strintmap_get(&renames->relevant_sources[side],
+					p->one->path);
+		if (val == 0) {
+			assert(p->status == 'D');
+			strset_add(&renames->cached_irrelevant[side],
+				   p->one->path);
+		}
+		if (val <= 0)
+			return;
+	}
+	if (p->status == 'D') {
+		/*
+		 * If we already had this delete, we'll just set it's value
+		 * to NULL again, so no harm.
+		 */
+		strmap_put(&renames->cached_pairs[side], p->one->path, NULL);
+	} else if (p->status == 'R') {
+		if (!new_path)
+			new_path = p->two->path;
+		new_path = xstrdup(new_path);
+		old_value = strmap_put(&renames->cached_pairs[side],
+				       p->one->path, new_path);
+		strset_add(&renames->cached_target_names[side], new_path);
+		free(old_value);
+	} else if (p->status == 'A' && new_path) {
+		new_path = xstrdup(new_path);
+		old_value = strmap_put(&renames->cached_pairs[side],
+				       p->two->path, new_path);
+		strset_add(&renames->cached_target_names[side], new_path);
+		assert(!old_value);
+	}
+}
+
 static int compare_pairs(const void *a_, const void *b_)
 {
 	const struct diff_filepair *a = *((const struct diff_filepair **)a_);
@@ -2414,6 +2455,7 @@ static int collect_renames(struct merge_options *opt,
 		struct diff_filepair *p = side_pairs->queue[i];
 		char *new_path; /* non-NULL only with directory renames */
 
+		possibly_cache_new_pair(renames, p, side_index, NULL);
 		if (p->status != 'A' && p->status != 'R') {
 			diff_free_filepair(p);
 			continue;
@@ -2430,7 +2472,7 @@ static int collect_renames(struct merge_options *opt,
 			diff_free_filepair(p);
 			continue;
 		}
-
+		possibly_cache_new_pair(renames, p, side_index, new_path);
 		if (new_path)
 			apply_directory_rename_modifications(opt, p, new_path);
 
@@ -3709,8 +3751,16 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
 					 NULL, 1);
 		strmap_init_with_options(&renames->dir_renames[i],
 					 NULL, 0);
+		/*
+		 * relevant_sources uses -1 for the default, because we need
+		 * to be able to distinguish not-in-strintmap from valid
+		 * relevant_source values from enum file_rename_relevance.
+		 * In particular, possibly_cache_new_pair() expects a negative
+		 * value for not-found entries.
+		 */
 		strintmap_init_with_options(&renames->relevant_sources[i],
-					    0, NULL, 0);
+					    -1 /* explicitly invalid */,
+					    NULL, 0);
 		strmap_init_with_options(&renames->cached_pairs[i],
 					 NULL, 1);
 		strset_init_with_options(&renames->cached_irrelevant[i],
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 3/7] merge-ort: add code to check for whether cached renames can be reused
  2021-03-24 21:32 [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames Elijah Newren via GitGitGadget
  2021-03-24 21:32 ` [PATCH 1/7] merge-ort: add data structures for in-memory caching of rename detection Elijah Newren via GitGitGadget
  2021-03-24 21:32 ` [PATCH 2/7] merge-ort: populate caches of rename detection results Elijah Newren via GitGitGadget
@ 2021-03-24 21:32 ` Elijah Newren via GitGitGadget
  2021-03-24 21:32 ` [PATCH 4/7] merge-ort: avoid accidental API mis-use Elijah Newren via GitGitGadget
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-24 21:32 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

We need to know when renames detected in a previous merge operation can
be reused in a later merge operation.  Consider the following setup
(from the git-rebase manpage):

                     A---B---C topic
                    /
               D---E---F---G master

After rebasing, this will appear as:

                             A'--B'--C' topic
                            /
               D---E---F---G master

Further, let's say that 'oldfile' was renamed to 'newfile' between E
and G.  The rebase or cherry-pick of A onto G will involve a three-way
merge between E (as the merge base) and G and A.  After detecting the
rename between E:oldfile and G:newfile, there will be a three-way
content merge of the following:
    E:oldfile
    G:newfile
    A:oldfile
and produce a new result:
    A':newfile

Now, when we want to pick B onto A', we will need to do a three-way
merge between A (as the merge-base) and A' and B.  This will involve
a three-way content merge of
    A:oldfile
    A':newfile
    B:oldfile
but only if we can detect that A:oldfile is similar enough to A':newfile
to be used together in a three-way content merge, i.e. only if we can
detect that A:oldfile and A':newfile are a rename.  But we already know
that A:oldfile and A':newfile are similar enough to be used in a
three-way content merge, because that is precisely where A':newfile came
from in the previous merge.

Note that A & A' both appear in both merges.  That gives us the
condition under which we can reuse renames.

There are a couple important points about this optimization:

  - If the rebase or cherry-pick halts for user conflicts, these caches
    are NOT saved anywhere.  Thus, resuming a halted rebase or
    cherry-pick will result in no reused renames for the next commit.
    This is intentional, as user resolution can change files
    significantly and in ways that violate the similarity assumptions
    here.

  - Technically, in a *very* narrow case this might give slightly
    different results for rename detection.  Using the example above,
    if:
      * E:oldfile had 20 lines
      * G:newfile added 10 new lines at the beginning of the file
      * A:oldfile deleted all but the first three lines of the file
    then
      => A':newfile would have 13 lines, 3 of which matches those
         in A:oldfile.

    Consider the two cases:
      * Without this optimization:
        - the next step of the rebase operation (moving B to B')
          would not detect the rename betwen A:oldfile and A':newfile
        - we'd thus get a modify/delete conflict with the rebase
          operation halting for the user to resolve, and have both
          A':newfile and B:oldfile sitting in the working tree.
      * With this optimization:
        - the rename between A:oldfile and A':newfile would be detected
          via the cache of renames
        - a three-way merge between A:oldfile, A':newfile, and B:oldfile
          would commence and be written to A':newfile

    Now, is the difference in behavior a bug...or a bugfix?  I can't
    tell.  Given that A:oldfile and A':newfile are not very similar,
    when we three-way merge with B:oldfile it seems likely we'll hit a
    conflict for the user to resolve.  And it shouldn't be too hard for
    users to see why we did that three-way merge; oldfile and newfile
    *were* renames somewhere in the sequence.  So, most of these corner
    cases will still behave similarly -- namely, a conflict given to the
    user to resolve.  Also, consider the interesting case when commit B
    is a clean revert of commit A.  Without this optimization, a rebase
    could not both apply a weird patch like A and then immediately
    revert it; users would be forced to resolve merge conflicts.  With
    this optimization, it would successfully apply the clean revert.
    So, there is certainly at least one case that behaves better.  Even
    if it's considered a "difference in behavior", I think both behaviors
    are reasonable, and the time savings provided by this optimization
    justify using the slightly altered rename heuristics.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 64 insertions(+), 2 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index 2303d88e6a92..bb47fa91a339 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -139,6 +139,30 @@ struct rename_info {
 	int callback_data_nr, callback_data_alloc;
 	char *callback_data_traverse_path;
 
+	/*
+	 * merge_trees: trees passed to the merge algorithm for the merge
+	 *
+	 * merge_trees records the trees passed to the merge algorithm.  But,
+	 * this data also is stored in merge_result->priv.  If a sequence of
+	 * merges are being done (such as when cherry-picking or rebasing),
+	 * the next merge can look at this and re-use information from
+	 * previous merges under certain cirumstances.
+	 *
+	 * See also all the cached_* variables.
+	 */
+	struct tree *merge_trees[3];
+
+	/*
+	 * cached_pairs_valid_side: which side's cached info can be reused
+	 *
+	 * See the description for merge_trees.  For repeated merges, at most
+	 * only one side's cached information can be used.  Valid values:
+	 *   MERGE_SIDE2: cached data from side2 can be reused
+	 *   MERGE_SIDE1: cached data from side1 can be reused
+	 *   0:           no cached data can be reused
+	 */
+	int cached_pairs_valid_side;
+
 	/*
 	 * cached_pairs: Caching of renames and deletions.
 	 *
@@ -461,6 +485,8 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		strmap_func(&renames->cached_pairs[i], 1);
 		strset_func(&renames->cached_irrelevant[i]);
 	}
+	renames->cached_pairs_valid_side = 0;
+	renames->dir_rename_mask = 0;
 
 	if (!reinitialize) {
 		struct hashmap_iter iter;
@@ -483,8 +509,6 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		strmap_clear(&opti->output, 0);
 	}
 
-	renames->dir_rename_mask = 0;
-
 	/* Clean out callback_data as well. */
 	FREE_AND_NULL(renames->callback_data);
 	renames->callback_data_nr = renames->callback_data_alloc = 0;
@@ -3792,6 +3816,35 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
 	trace2_region_leave("merge", "allocate/init", opt->repo);
 }
 
+static void merge_check_renames_reusable(struct merge_options *opt,
+					 struct merge_result *result,
+					 struct tree *merge_base,
+					 struct tree *side1,
+					 struct tree *side2)
+{
+	struct rename_info *renames;
+	struct tree **merge_trees;
+	struct merge_options_internal *opti = result->priv;
+
+	if (!opti)
+		return;
+
+	renames = &opti->renames;
+	merge_trees = renames->merge_trees;
+	/* merge_trees[0..2] will only be NULL if opti is */
+	assert(merge_trees[0] && merge_trees[1] && merge_trees[2]);
+
+	/* Check if we meet a condition for re-using cached_pairs */
+	if (     oideq(&merge_base->object.oid, &merge_trees[2]->object.oid) &&
+		 oideq(     &side1->object.oid, &result->tree->object.oid))
+		renames->cached_pairs_valid_side = MERGE_SIDE1;
+	else if (oideq(&merge_base->object.oid, &merge_trees[1]->object.oid) &&
+		 oideq(     &side2->object.oid, &result->tree->object.oid))
+		renames->cached_pairs_valid_side = MERGE_SIDE2;
+	else
+		renames->cached_pairs_valid_side = 0; /* neither side valid */
+}
+
 /*** Function Grouping: merge_incore_*() and their internal variants ***/
 
 /*
@@ -3939,7 +3992,16 @@ void merge_incore_nonrecursive(struct merge_options *opt,
 
 	trace2_region_enter("merge", "merge_start", opt->repo);
 	assert(opt->ancestor != NULL);
+	merge_check_renames_reusable(opt, result, merge_base, side1, side2);
 	merge_start(opt, result);
+	/*
+	 * Record the trees used in this merge, so if there's a next merge in
+	 * a cherry-pick or rebase sequence it might be able to take advantage
+	 * of the cached_pairs in that next merge.
+	 */
+	opt->priv->renames.merge_trees[0] = merge_base;
+	opt->priv->renames.merge_trees[1] = side1;
+	opt->priv->renames.merge_trees[2] = side2;
 	trace2_region_leave("merge", "merge_start", opt->repo);
 
 	merge_ort_nonrecursive_internal(opt, merge_base, side1, side2, result);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 4/7] merge-ort: avoid accidental API mis-use
  2021-03-24 21:32 [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames Elijah Newren via GitGitGadget
                   ` (2 preceding siblings ...)
  2021-03-24 21:32 ` [PATCH 3/7] merge-ort: add code to check for whether cached renames can be reused Elijah Newren via GitGitGadget
@ 2021-03-24 21:32 ` Elijah Newren via GitGitGadget
  2021-03-24 21:32 ` [PATCH 5/7] merge-ort: preserve cached renames for the appropriate side Elijah Newren via GitGitGadget
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-24 21:32 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Previously, callers of the merge-ort API could have passed an
uninitialized value for struct merge_result *result.  However, we want
to check result to see if it has cached renames from a previous merge
that we can reuse; such values would be found behind result->priv.
However, if result->priv is uninitialized, attempting to access behind
it will give a segfault.  So, we need result->priv to be NULL (which
will be the case if the caller does a memset(&result, 0)), or be written
by a previous call to the merge-ort machinery.  Documenting this
requirement may help, but despite being the person who introduced this
requirement, I still missed it once and it did not fail in a very clear
way and led to a long debugging session.

Add a _properly_initialized field to merge_result; that value will be
0 if the caller zero'ed the merge_result, it will be set to a very
specific value by a previous run by the merge-ort machinery, and if it's
uninitialized it will most likely either be 0 or some value that does
not match the specific one we'd expect allowing us to throw a much more
meaningful error.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 7 +++++++
 merge-ort.h | 2 ++
 2 files changed, 9 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index bb47fa91a339..5d56b0f90128 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -52,6 +52,8 @@ enum merge_side {
 	MERGE_SIDE2 = 2
 };
 
+static unsigned RESULT_INITIALIZED = 0x1abe11ed; /* unlikely accidental value */
+
 struct traversal_callback_data {
 	unsigned long mask;
 	unsigned long dirmask;
@@ -3736,6 +3738,10 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
 	assert(opt->obuf.len == 0);
 
 	assert(opt->priv == NULL);
+	if (result->_properly_initialized != 0 &&
+	    result->_properly_initialized != RESULT_INITIALIZED)
+		BUG("struct merge_result passed to merge_incore_*recursive() must be zeroed or filled with values from a previous run");
+	assert(!!result->priv == !!result->_properly_initialized);
 	if (result->priv) {
 		opt->priv = result->priv;
 		result->priv = NULL;
@@ -3895,6 +3901,7 @@ static void merge_ort_nonrecursive_internal(struct merge_options *opt,
 	result->clean &= strmap_empty(&opt->priv->conflicted);
 	if (!opt->priv->call_depth) {
 		result->priv = opt->priv;
+		result->_properly_initialized = RESULT_INITIALIZED;
 		opt->priv = NULL;
 	}
 }
diff --git a/merge-ort.h b/merge-ort.h
index d53a0a339f33..c011864ffeb1 100644
--- a/merge-ort.h
+++ b/merge-ort.h
@@ -29,6 +29,8 @@ struct merge_result {
 	 * !clean) and to print "CONFLICT" messages.  Not for external use.
 	 */
 	void *priv;
+	/* Also private */
+	unsigned _properly_initialized;
 };
 
 /*
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 5/7] merge-ort: preserve cached renames for the appropriate side
  2021-03-24 21:32 [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames Elijah Newren via GitGitGadget
                   ` (3 preceding siblings ...)
  2021-03-24 21:32 ` [PATCH 4/7] merge-ort: avoid accidental API mis-use Elijah Newren via GitGitGadget
@ 2021-03-24 21:32 ` Elijah Newren via GitGitGadget
  2021-03-24 21:32 ` [PATCH 6/7] merge-ort: add helper functions for using cached renames Elijah Newren via GitGitGadget
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-24 21:32 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Previous commits created an in-memory cache of the results of rename
detection, and added logic to detect when that cache could appropriately
be used in a subsequent merge operation -- but we were still
unconditionally clearing the cache with each new merge operation anyway.
If it is valid to reuse the cache from one of the two sides of history,
preserve that side.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index 5d56b0f90128..0dffd65ee1a3 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -475,17 +475,18 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 	/* Free memory used by various renames maps */
 	for (i = MERGE_SIDE1; i <= MERGE_SIDE2; ++i) {
 		strintmap_func(&renames->dirs_removed[i]);
-
-		partial_clear_dir_rename_count(&renames->dir_rename_count[i]);
-		if (!reinitialize)
-			strmap_clear(&renames->dir_rename_count[i], 1);
-
 		strmap_func(&renames->dir_renames[i], 0);
-
 		strintmap_func(&renames->relevant_sources[i]);
-		strset_func(&renames->cached_target_names[i]);
-		strmap_func(&renames->cached_pairs[i], 1);
-		strset_func(&renames->cached_irrelevant[i]);
+		if (!reinitialize)
+			assert(renames->cached_pairs_valid_side == 0);
+		if (i != renames->cached_pairs_valid_side) {
+			strset_func(&renames->cached_target_names[i]);
+			strmap_func(&renames->cached_pairs[i], 1);
+			strset_func(&renames->cached_irrelevant[i]);
+			partial_clear_dir_rename_count(&renames->dir_rename_count[i]);
+			if (!reinitialize)
+				strmap_clear(&renames->dir_rename_count[i], 1);
+		}
 	}
 	renames->cached_pairs_valid_side = 0;
 	renames->dir_rename_mask = 0;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 6/7] merge-ort: add helper functions for using cached renames
  2021-03-24 21:32 [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames Elijah Newren via GitGitGadget
                   ` (4 preceding siblings ...)
  2021-03-24 21:32 ` [PATCH 5/7] merge-ort: preserve cached renames for the appropriate side Elijah Newren via GitGitGadget
@ 2021-03-24 21:32 ` Elijah Newren via GitGitGadget
  2021-03-24 21:32 ` [PATCH 7/7] merge-ort, diffcore-rename: employ cached renames when possible Elijah Newren via GitGitGadget
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-24 21:32 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

If we have a usable rename cache, then we can remove from
relevant_sources all the paths that were cached;
diffcore_rename_extended() can then consider an even smaller set of
relevant_sources in its rename detection.

However, when diffcore_rename_extended() is done, we will need to take
the renames it detected and then add back in all the ones we had cached
from before.

Add helper functions for doing these two operations; the next commit
will make use of them.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index 0dffd65ee1a3..8d0353ffbca2 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -2360,6 +2360,53 @@ static void resolve_diffpair_statuses(struct diff_queue_struct *q)
 	}
 }
 
+MAYBE_UNUSED
+static void prune_cached_from_relevant(struct rename_info *renames,
+				       unsigned side)
+{
+	/* Reason for this function described in add_pair() */
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
+
+	/* Remove from relevant_sources all entries in cached_pairs[side] */
+	strmap_for_each_entry(&renames->cached_pairs[side], &iter, entry) {
+		strintmap_remove(&renames->relevant_sources[side],
+				 entry->key);
+	}
+	/* Remove from relevant_sources all entries in cached_irrelevant[side] */
+	strset_for_each_entry(&renames->cached_irrelevant[side], &iter, entry) {
+		strintmap_remove(&renames->relevant_sources[side],
+				 entry->key);
+	}
+}
+
+MAYBE_UNUSED
+static void use_cached_pairs(struct merge_options *opt,
+			     struct strmap *cached_pairs,
+			     struct diff_queue_struct *pairs)
+{
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
+
+	/*
+	 * Add to side_pairs all entries from renames->cached_pairs[side_index].
+	 * (Info in cached_irrelevant[side_index] is not relevant here.)
+	 */
+	strmap_for_each_entry(cached_pairs, &iter, entry) {
+		struct diff_filespec *one, *two;
+		const char *old_name = entry->key;
+		const char *new_name = entry->value;
+		if (!new_name)
+			new_name = old_name;
+
+		/* We don't care about oid/mode, only filenames and status */
+		one = alloc_filespec(old_name);
+		two = alloc_filespec(new_name);
+		diff_queue(pairs, one, two);
+		pairs->queue[pairs->nr-1]->status = entry->value ? 'R' : 'D';
+	}
+}
+
 static void possibly_cache_new_pair(struct rename_info *renames,
 				    struct diff_filepair *p,
 				    unsigned side,
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 7/7] merge-ort, diffcore-rename: employ cached renames when possible
  2021-03-24 21:32 [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames Elijah Newren via GitGitGadget
                   ` (5 preceding siblings ...)
  2021-03-24 21:32 ` [PATCH 6/7] merge-ort: add helper functions for using cached renames Elijah Newren via GitGitGadget
@ 2021-03-24 21:32 ` Elijah Newren via GitGitGadget
  2021-03-24 22:04 ` [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames Junio C Hamano
  2021-05-04  2:12 ` [PATCH v2 00/13] " Elijah Newren via GitGitGadget
  8 siblings, 0 replies; 32+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-24 21:32 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

When there are many renames between the old base of a series of commits
and the new base, the way sequencer.c, merge-recursive.c, and
diffcore-rename.c have traditionally split the work resulted in
redetecting the same renames with each and every commit being
transplanted.  To address this, the last several commits have been
creating a cache of rename detection results, determining when it was
safe to use such a cache in subsequent merge operations, adding helper
functions, and so on.  See the previous half dozen commit messages for
additional discussion of this optimization, particularly the message a
few commits ago entitled "add code to check for whether cached renames
can be reused".  This commit finally ties all of that work together,
modifying the merge algorithm to make use of these cached renames.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:        5.665 s ±  0.129 s     5.624 s ±  0.077 s
    mega-renames:     11.435 s ±  0.158 s    10.213 s ±  0.032 s
    just-one-mega:   494.2  ms ±  6.1  ms   497.6  ms ±  5.3  ms

That's a fairly small improvement, but mostly because the previous
optimizations were so effective for these particular testcases; this
optimization only kicks in when the others don't.  If we undid the
basename-guided rename detection and skip-irrelevant-renames
optimizations, then we'd see that this series by itself improved
performance as follows:

                   Before Basename Series   After Just This Series
    no-renames:      13.815 s ±  0.062 s      5.814 s ±  0.094 s
    mega-renames:  1799.937 s ±  0.493 s    303.225 s ±  1.330 s

Since this optimization kicks in to help accelerate cases where the
previous optimizations do not apply, this last comparison shows that
this cached-renames optimization has the potential to help signficantly
in cases that don't meet the requirements for the other optimizations to
be effective.

The changes made in this optimization also lay some important groundwork
for a future optimization around having collect_merge_info() avoid
recursing into subtrees in more cases.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 22 ++++++++++++++++++----
 diffcore.h        |  3 ++-
 merge-ort.c       | 48 ++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 63 insertions(+), 10 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 7cc24592617e..dfbe65c917e9 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -568,7 +568,8 @@ static void update_dir_rename_counts(struct dir_rename_info *info,
 static void initialize_dir_rename_info(struct dir_rename_info *info,
 				       struct strintmap *relevant_sources,
 				       struct strintmap *dirs_removed,
-				       struct strmap *dir_rename_count)
+				       struct strmap *dir_rename_count,
+				       struct strmap *cached_pairs)
 {
 	struct hashmap_iter iter;
 	struct strmap_entry *entry;
@@ -633,6 +634,17 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 					 rename_dst[i].p->two->path);
 	}
 
+	/* Add cached_pairs to counts */
+	strmap_for_each_entry(cached_pairs, &iter, entry) {
+		const char *old_name = entry->key;
+		const char *new_name = entry->value;
+		if (!new_name)
+			/* known delete; ignore it */
+			continue;
+
+		update_dir_rename_counts(info, dirs_removed, old_name, new_name);
+	}
+
 	/*
 	 * Now we collapse
 	 *    dir_rename_count: old_directory -> {new_directory -> count}
@@ -1247,7 +1259,8 @@ static void handle_early_known_dir_renames(struct dir_rename_info *info,
 void diffcore_rename_extended(struct diff_options *options,
 			      struct strintmap *relevant_sources,
 			      struct strintmap *dirs_removed,
-			      struct strmap *dir_rename_count)
+			      struct strmap *dir_rename_count,
+			      struct strmap *cached_pairs)
 {
 	int detect_rename = options->detect_rename;
 	int minimum_score = options->rename_score;
@@ -1363,7 +1376,8 @@ void diffcore_rename_extended(struct diff_options *options,
 		/* Preparation for basename-driven matching. */
 		trace2_region_enter("diff", "dir rename setup", options->repo);
 		initialize_dir_rename_info(&info, relevant_sources,
-					   dirs_removed, dir_rename_count);
+					   dirs_removed, dir_rename_count,
+					   cached_pairs);
 		trace2_region_leave("diff", "dir rename setup", options->repo);
 
 		/* Utilize file basenames to quickly find renames. */
@@ -1561,5 +1575,5 @@ void diffcore_rename_extended(struct diff_options *options,
 
 void diffcore_rename(struct diff_options *options)
 {
-	diffcore_rename_extended(options, NULL, NULL, NULL);
+	diffcore_rename_extended(options, NULL, NULL, NULL, NULL);
 }
diff --git a/diffcore.h b/diffcore.h
index cf8d4cb8617d..de01e64becaf 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -181,7 +181,8 @@ void diffcore_rename(struct diff_options *);
 void diffcore_rename_extended(struct diff_options *options,
 			      struct strintmap *relevant_sources,
 			      struct strintmap *dirs_removed,
-			      struct strmap *dir_rename_count);
+			      struct strmap *dir_rename_count,
+			      struct strmap *cached_pairs);
 void diffcore_merge_broken(void);
 void diffcore_pickaxe(struct diff_options *);
 void diffcore_order(const char *orderfile);
diff --git a/merge-ort.c b/merge-ort.c
index 8d0353ffbca2..719222aa4364 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -752,15 +752,48 @@ static void add_pair(struct merge_options *opt,
 	struct rename_info *renames = &opt->priv->renames;
 	int names_idx = is_add ? side : 0;
 
-	if (!is_add) {
+	if (is_add) {
+		if (strset_contains(&renames->cached_target_names[side],
+				    pathname))
+			return;
+	} else {
 		unsigned content_relevant = (match_mask == 0);
 		unsigned location_relevant = (dir_rename_mask == 0x07);
 
+		/*
+		 * If pathname is found in cached_irrelevant[side] due to
+		 * previous pick but for this commit content is relevant,
+		 * then we need to remove it from cached_irrelevant.
+		 */
+		if (content_relevant)
+			/* strset_remove is no-op if strset doesn't have key */
+			strset_remove(&renames->cached_irrelevant[side],
+				      pathname);
+
+		/*
+		 * We do not need to re-detect renames for paths that we already
+		 * know the pairing, i.e. for cached_pairs (or
+		 * cached_irrelevant).  However, handle_deferred_entries() needs
+		 * to loop over the union of keys from relevant_sources[side] and
+		 * cached_pairs[side], so for simplicity we set relevant_sources
+		 * for all the cached_pairs too and then strip them back out in
+		 * prune_cached_from_relevant() at the beginning of
+		 * detect_regular_renames().
+		 */
 		if (content_relevant || location_relevant) {
 			/* content_relevant trumps location_relevant */
 			strintmap_set(&renames->relevant_sources[side], pathname,
 				      content_relevant ? RELEVANT_CONTENT : RELEVANT_LOCATION);
 		}
+
+		/*
+		 * Avoid creating pair if we've already cached rename results.
+		 * Note that we do this after setting relevant_sources[side]
+		 * as noted in the comment above.
+		 */
+		if (strmap_contains(&renames->cached_pairs[side], pathname) ||
+		    strset_contains(&renames->cached_irrelevant[side], pathname))
+			return;
 	}
 
 	one = alloc_filespec(pathname);
@@ -2336,7 +2369,9 @@ static inline int possible_side_renames(struct rename_info *renames,
 static inline int possible_renames(struct rename_info *renames)
 {
 	return possible_side_renames(renames, 1) ||
-	       possible_side_renames(renames, 2);
+	       possible_side_renames(renames, 2) ||
+	       !strmap_empty(&renames->cached_pairs[1]) ||
+	       !strmap_empty(&renames->cached_pairs[2]);
 }
 
 static void resolve_diffpair_statuses(struct diff_queue_struct *q)
@@ -2360,7 +2395,6 @@ static void resolve_diffpair_statuses(struct diff_queue_struct *q)
 	}
 }
 
-MAYBE_UNUSED
 static void prune_cached_from_relevant(struct rename_info *renames,
 				       unsigned side)
 {
@@ -2380,7 +2414,6 @@ static void prune_cached_from_relevant(struct rename_info *renames,
 	}
 }
 
-MAYBE_UNUSED
 static void use_cached_pairs(struct merge_options *opt,
 			     struct strmap *cached_pairs,
 			     struct diff_queue_struct *pairs)
@@ -2463,6 +2496,7 @@ static void detect_regular_renames(struct merge_options *opt,
 	struct diff_options diff_opts;
 	struct rename_info *renames = &opt->priv->renames;
 
+	prune_cached_from_relevant(renames, side_index);
 	if (!possible_side_renames(renames, side_index)) {
 		/*
 		 * No rename detection needed for this side, but we still need
@@ -2473,6 +2507,7 @@ static void detect_regular_renames(struct merge_options *opt,
 		return;
 	}
 
+	partial_clear_dir_rename_count(&renames->dir_rename_count[side_index]);
 	repo_diff_setup(opt->repo, &diff_opts);
 	diff_opts.flags.recursive = 1;
 	diff_opts.flags.rename_empty = 0;
@@ -2490,7 +2525,8 @@ static void detect_regular_renames(struct merge_options *opt,
 	diffcore_rename_extended(&diff_opts,
 				 &renames->relevant_sources[side_index],
 				 &renames->dirs_removed[side_index],
-				 &renames->dir_rename_count[side_index]);
+				 &renames->dir_rename_count[side_index],
+				 &renames->cached_pairs[side_index]);
 	trace2_region_leave("diff", "diffcore_rename", opt->repo);
 	resolve_diffpair_statuses(&diff_queued_diff);
 
@@ -2597,6 +2633,8 @@ static int detect_and_process_renames(struct merge_options *opt,
 	trace2_region_enter("merge", "regular renames", opt->repo);
 	detect_regular_renames(opt, MERGE_SIDE1);
 	detect_regular_renames(opt, MERGE_SIDE2);
+	use_cached_pairs(opt, &renames->cached_pairs[1], &renames->pairs[1]);
+	use_cached_pairs(opt, &renames->cached_pairs[2], &renames->pairs[2]);
 	trace2_region_leave("merge", "regular renames", opt->repo);
 
 	trace2_region_enter("merge", "directory renames", opt->repo);
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames
  2021-03-24 21:32 [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames Elijah Newren via GitGitGadget
                   ` (6 preceding siblings ...)
  2021-03-24 21:32 ` [PATCH 7/7] merge-ort, diffcore-rename: employ cached renames when possible Elijah Newren via GitGitGadget
@ 2021-03-24 22:04 ` Junio C Hamano
  2021-03-24 23:25   ` Elijah Newren
  2021-05-04  2:12 ` [PATCH v2 00/13] " Elijah Newren via GitGitGadget
  8 siblings, 1 reply; 32+ messages in thread
From: Junio C Hamano @ 2021-03-24 22:04 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: git, Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren

"Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:

> === Basic Optimization idea ===
>
> This series avoids repeatedly detecting the same renames in a sequence of
> merges such as a rebase or cherry-pick of several commits. When there are
> many renames between the old base and the new base, traditionally all those
> renames are re-detected for every commit that is transplanted. This
> optimization avoids redoing that work.

Unless this section is easily understandable, the readers have no
incentive to read on, but the above is a bit too hand wavy.

> This one adds a fourth (remember-renames), with some interesting properties:
>
>  * unlike basename-guided rename detection, there are no behavioral changes
>    (there is no heuristic involved)[2].
>
>  * like skip-because-irrelevant, this optimization does not apply to all git
>    commands using the rename machinery. In fact, this one is even more
>    restrictive since it is ONLY useful for rebases and cherry-picks (not
>    even merges), and only for second and later commits in a linear series.

So, is it correct to understand that one case this would help is
this scenario?

 ---o---o---o---X---o---o---o---O ours
     \
      A---B---C topic

where there is a side branch A--B--C that touched some files, while
on our side, there is a commit X that is unknown to the side branch
that renamed these files.  Now we want to transplant the side topic
to the tip of our history, replaying the changes A--B--C made to
these files under their original name to the corresponding files
that have been renamed.

And each step in this "rebase" is a 3-way merge of commits A, B and
C onto HEAD, using the parent of the commit being cherrk-picked as a
virtual common ancestor.  Which means

 - To transplant A (i.e. the first step), we'd compare the diff of
   A^..O (i.e. what our side did, including the renames done at X)
   and diff of A^..A (i.e. what the first commit did in the range),
   and the former does quite a lot of rename detection.

 - After transplanting B (i.e. the second step), then we'd compare
   the diff of A^..A' (where A' is A cherry-picked on O, i.e. the
   result of the previous step).  If we are lucky, O..A' did not
   rename anything so the renames done in A^..O (i.e. what we
   detected during the first step) and A^..A' (i.e. what we should
   be computing for this second step) should be quite similar.

   If we assume that the "quite similar" is good enough, then we can
   blindly reuse the record of "<path in A^> correspnds to <path in
   O>" as if it were "<path in A^> corresponds to <path in A'>".

 - Do the same for C, pretending that renames discovered between A^
   and O is identical to the renames between A^ and B' (i.e. the
   result of cherry-picking A--B on top of O).


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames
  2021-03-24 22:04 ` [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames Junio C Hamano
@ 2021-03-24 23:25   ` Elijah Newren
  2021-03-25 18:59     ` Junio C Hamano
  0 siblings, 1 reply; 32+ messages in thread
From: Elijah Newren @ 2021-03-24 23:25 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau

On Wed, Mar 24, 2021 at 3:04 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > === Basic Optimization idea ===
> >
> > This series avoids repeatedly detecting the same renames in a sequence of
> > merges such as a rebase or cherry-pick of several commits. When there are
> > many renames between the old base and the new base, traditionally all those
> > renames are re-detected for every commit that is transplanted. This
> > optimization avoids redoing that work.
>
> Unless this section is easily understandable, the readers have no
> incentive to read on, but the above is a bit too hand wavy.

Oh, yeah, it's very hand wavy.  I figured the commit messages were the
right place to include the details, and just wanted to give a flavor
of the idea in the cover letter.

> > This one adds a fourth (remember-renames), with some interesting properties:
> >
> >  * unlike basename-guided rename detection, there are no behavioral changes
> >    (there is no heuristic involved)[2].
> >
> >  * like skip-because-irrelevant, this optimization does not apply to all git
> >    commands using the rename machinery. In fact, this one is even more
> >    restrictive since it is ONLY useful for rebases and cherry-picks (not
> >    even merges), and only for second and later commits in a linear series.
>
> So, is it correct to understand that one case this would help is
> this scenario?
>
>  ---o---o---o---X---o---o---o---O ours
>      \
>       A---B---C topic
>
> where there is a side branch A--B--C that touched some files, while
> on our side, there is a commit X that is unknown to the side branch
> that renamed these files.  Now we want to transplant the side topic
> to the tip of our history, replaying the changes A--B--C made to
> these files under their original name to the corresponding files
> that have been renamed.
>
> And each step in this "rebase" is a 3-way merge of commits A, B and
> C onto HEAD, using the parent of the commit being cherrk-picked as a
> virtual common ancestor.  Which means

You generated nearly the same description and diagram I used in the
commit message (the one in 3/7) describing this.  :-)

>  - To transplant A (i.e. the first step), we'd compare the diff of
>    A^..O (i.e. what our side did, including the renames done at X)
>    and diff of A^..A (i.e. what the first commit did in the range),
>    and the former does quite a lot of rename detection.
>
>  - After transplanting B (i.e. the second step), then we'd compare
>    the diff of A^..A' (where A' is A cherry-picked on O, i.e. the

Close, but for transplanting B we do the diff of B^..A', not A^...A'.
(And in this diagram, B^ is A.)  That's critical below...

>    result of the previous step).  If we are lucky, O..A' did not
>    rename anything so the renames done in A^..O (i.e. what we
>    detected during the first step) and A^..A' (i.e. what we should
>    be computing for this second step) should be quite similar.

Again, B^..A' rather than A^..A'.

Luck is not involved here.  If O..A' did rename anything, it's one of
two reasons:

- There were conflicts when trying to transplant A, and when we stop
for conflict resolution, the user added some renames at that point.
- There were renames in A^..A.

In the first case, the presence of conflicts means we drop the cache
and this optimization doesn't try to kick in.  In the second case,
those renames in A' came from A.  Even without this optimization,
since those renames in A' came from A, doing rename detection on A..A'
wouldn't re-detect them and transplanting wouldn't try to reapply
them, so they just aren't relevant anymore -- with or without this
optimization.

>    If we assume that the "quite similar" is good enough, then we can
>    blindly reuse the record of "<path in A^> correspnds to <path in
>    O>" as if it were "<path in A^> corresponds to <path in A'>".

Again, B^ rather than A^ on the last line.

I disagree with the use of the term "blindly" here.  As spelled out in
the third commit message, the transplant of A involved a three-way
content merge of the form:
    A^:oldfile
    O:newfile
    A:oldfile
and produce a new result:
    A':newfile

The point of rename detection is to determine what files are similar
enough to use in a three-way content merge.  In particular, we'd use
rename detection when transplanting B to notice the oldfile -> newfile
rename so that we can do a three-way content merge of the form:
    A:oldfile
    A':newfile
    B:oldfile
and produce a new result:
    B':newfile

But, instead of asking rename detection whether A:oldfile and
A':newfile are similar enough to use together in a three-way content
merge, we could ask ourselves -- do we have any _other_ reason to
believe these files are similar enough to be used in a three-way
content merge?  And the answer that comes back is: these files were
*already* involved in the same three-way content merge -- the one that
A':newfile came from.  It was a three-way content merge with no
conflicts.  (Because when conflicts are triggered we turn this
optimization off.)

>  - Do the same for C, pretending that renames discovered between A^
>    and O is identical to the renames between A^ and B' (i.e. the
>    result of cherry-picking A--B on top of O).

Now you've changed your off-by-one mistake to an off-by-two mistake;
the rename detection is between C^ and B', not A^ and B'.  I think
this error might be critical to why you used terms like "pretend" and
"blindly" and "lucky".  I agree that it would require
luck/blindness/pretending to assume that the renames between A^ and O
are identical to those between A^ and B', but that's not what the
original algorithm would have been using for computing renames; it
would be using C^ and B'.

It's actually quite difficult to generate a case where this
optimization gets a possibly different result.  It requires there were
changes to the content on both sides of history that merge cleanly,
and in particular that need a significant size reduction of the file
by the unrenamed side of history.  If you take the changes on the
*renamed* side of history, which represent <50% changes since it was
detected as a rename, those same changes need to represent a >50%
change when applied to the smaller file.  This is discussed in the
third commit message, as noted in the cover letter:

>> [2] Well, almost no changes. There's technically a very narrow way that this
>> could change the behavior; see the really long "Technically," bullet point
>> in patch 3 for discussion of this.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames
  2021-03-24 23:25   ` Elijah Newren
@ 2021-03-25 18:59     ` Junio C Hamano
  2021-03-29 22:34       ` Elijah Newren
  0 siblings, 1 reply; 32+ messages in thread
From: Junio C Hamano @ 2021-03-25 18:59 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau

Elijah Newren <newren@gmail.com> writes:

>> And each step in this "rebase" is a 3-way merge of commits A, B and
>> C onto HEAD, using the parent of the commit being cherrk-picked as a
>> virtual common ancestor.  Which means
>
> You generated nearly the same description and diagram I used in the
> commit message (the one in 3/7) describing this.  :-)
>
>>  - To transplant A (i.e. the first step), we'd compare the diff of
>>    A^..O (i.e. what our side did, including the renames done at X)
>>    and diff of A^..A (i.e. what the first commit did in the range),
>>    and the former does quite a lot of rename detection.
>>
>>  - After transplanting B (i.e. the second step), then we'd compare
>>    the diff of A^..A' (where A' is A cherry-picked on O, i.e. the
>
> Close, but for transplanting B we do the diff of B^..A', not A^...A'.
> (And in this diagram, B^ is A.)  That's critical below...

Yes, I upfront said "pretend that the parent of the commit being
picked is the common ancestor and run 3-way merge", but then got
confused by the ancestry graph myself, forgetting that the reason
why A^ is used in the first "pick" is *not* because the it is the
fork point of our history and the side branch, but it is because it
is A's parent.

And if the renames in B^..A' and A^..A' are different that must have
come only from the difference between A..B (which is B^..B), but
that comparison is what we do when cherry-picking B on top of A',
so it is easy to take into account to reuse the renames precisely
without "assuming they are the same".

Thanks.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames
  2021-03-25 18:59     ` Junio C Hamano
@ 2021-03-29 22:34       ` Elijah Newren
  2021-03-30 12:07         ` Derrick Stolee
  0 siblings, 1 reply; 32+ messages in thread
From: Elijah Newren @ 2021-03-29 22:34 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau

On Thu, Mar 25, 2021 at 12:00 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> >> And each step in this "rebase" is a 3-way merge of commits A, B and
> >> C onto HEAD, using the parent of the commit being cherrk-picked as a
> >> virtual common ancestor.  Which means
> >
> > You generated nearly the same description and diagram I used in the
> > commit message (the one in 3/7) describing this.  :-)
> >
> >>  - To transplant A (i.e. the first step), we'd compare the diff of
> >>    A^..O (i.e. what our side did, including the renames done at X)
> >>    and diff of A^..A (i.e. what the first commit did in the range),
> >>    and the former does quite a lot of rename detection.
> >>
> >>  - After transplanting B (i.e. the second step), then we'd compare
> >>    the diff of A^..A' (where A' is A cherry-picked on O, i.e. the
> >
> > Close, but for transplanting B we do the diff of B^..A', not A^...A'.
> > (And in this diagram, B^ is A.)  That's critical below...
>
> Yes, I upfront said "pretend that the parent of the commit being
> picked is the common ancestor and run 3-way merge", but then got
> confused by the ancestry graph myself, forgetting that the reason
> why A^ is used in the first "pick" is *not* because the it is the
> fork point of our history and the side branch, but it is because it
> is A's parent.
>
> And if the renames in B^..A' and A^..A' are different that must have
> come only from the difference between A..B (which is B^..B), but
> that comparison is what we do when cherry-picking B on top of A',
> so it is easy to take into account to reuse the renames precisely
> without "assuming they are the same".
>
> Thanks.

After sending the initial series, I decided to type up a more thorough
document that
  * spelled out in more detail how the sequence of cherry-picks work
  * proved why the renames in one pick are always a superset of the
renames in the next
  * proved why the renames in one pick are _almost_ always also a
rename in the next
  * discussed the counterexample cases in more detail, and why the
optimization is still reasonable
I figured the more extended document would be useful in case people
decide to change how things work in the future (e.g. what if someone
wants to turn on break detection?), and wants to be able to check
whether all the conditions and cases still hold.

I then also added details about how things work with directory
renames, in the case that merge.directoryRenames is not the default of
"conflict" (which is trivially handled by stopping and dropping the
cache) but is set to true...and found a case that needed more care due
to interactions with some of the earlier optimizations.  (The earlier
optimizations could result in bypassing directory rename detection in
one merge because there was no file added to the old directory, but
the no-directory-rename would be cached for subsequent rebases.)

So I need to get that fixed up and resubmit this series.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames
  2021-03-29 22:34       ` Elijah Newren
@ 2021-03-30 12:07         ` Derrick Stolee
  0 siblings, 0 replies; 32+ messages in thread
From: Derrick Stolee @ 2021-03-30 12:07 UTC (permalink / raw)
  To: Elijah Newren, Junio C Hamano
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau

On 3/29/2021 6:34 PM, Elijah Newren wrote:...
> After sending the initial series, I decided to type up a more thorough
> document that
>   * spelled out in more detail how the sequence of cherry-picks work
>   * proved why the renames in one pick are always a superset of the
> renames in the next
>   * proved why the renames in one pick are _almost_ always also a
> rename in the next
>   * discussed the counterexample cases in more detail, and why the
> optimization is still reasonable
> I figured the more extended document would be useful in case people
> decide to change how things work in the future (e.g. what if someone
> wants to turn on break detection?), and wants to be able to check
> whether all the conditions and cases still hold.
> 
> I then also added details about how things work with directory
> renames, in the case that merge.directoryRenames is not the default of
> "conflict" (which is trivially handled by stopping and dropping the
> cache) but is set to true...and found a case that needed more care due
> to interactions with some of the earlier optimizations.  (The earlier
> optimizations could result in bypassing directory rename detection in
> one merge because there was no file added to the old directory, but
> the no-directory-rename would be cached for subsequent rebases.)
> 
> So I need to get that fixed up and resubmit this series.

I look forward to that document. I attempted reading this series
yesterday, but did not have the mental energy to convince myself
of the correctness (because of things like not knowing this logic
that you plan to document). Instead, I'll promise to give round 2
a quicker response.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 00/13] Optimization batch 11: avoid repeatedly detecting same renames
  2021-03-24 21:32 [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames Elijah Newren via GitGitGadget
                   ` (7 preceding siblings ...)
  2021-03-24 22:04 ` [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames Junio C Hamano
@ 2021-05-04  2:12 ` Elijah Newren via GitGitGadget
  2021-05-04  2:12   ` [PATCH v2 01/13] t6423: rename file within directory that other side renamed Elijah Newren via GitGitGadget
                     ` (13 more replies)
  8 siblings, 14 replies; 32+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-05-04  2:12 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren,
	Derrick Stolee, Elijah Newren

This series avoids repeatedly detecting the same renames in a sequence of
merges such as a rebase or cherry-pick of several commits. It's
unfortunately become a bit lengthy, but much of the length (the first five
patches) is owed to special testcases and documentation.

Changes since v1:

 * Found and fixed a few bugs affecting merge.directoryRenames=true, one of
   which would have caused excessive rename detection runs (not caching
   things right), and another that would cause conflicts to be reported when
   the merge should be able to succeed.

 * Updated timings. The speedups are approximately the same as in v1, but
   are slightly improved by fixing the above bugs. Also, my v1 cover letter
   appears to have had incorrect "percentage of overall time" reported. Not
   sure what happened there, but I have updated numbers below.

 * Five new patches added to the front of the series (explained in reverse
   order):
   
   * Patch 5: Add a bunch of testcases to cover all the special cases that
     could present problems for the remember renames optimization.
   
   * Patch 4: Extend test-tool fast-rebase slightly for the new testcases.
   
   * Patch 3: Fix an embarrassing bug in fast-rebase, for use in new
     testcases.
   
   * Patch 2: Add documentation that thoroughly explains all the nooks and
     crannies and special cases associated with this optimization to "prove"
     that it is safe. May help if future optimizations or feature changes
     call into question any assumptions in play (e.g. if break detection
     were ever turned on in the merge machinery).
   
   * Patch 1: While thoroughly covering all the special cases, I also found
     and documented a minor merge.directoryRenames=true bug that affects
     both merge-recursive and merge-ort, with or without this optimization;
     this bug has been there for years.

 * One additional patch inserted near the end of the series:
   
   * Patch 11: Special handling for rename/rename(1to1) situations, as
     discussed in Patch 2.

=== Basic Optimization idea ===

When there are many renames between the old base and the new base,
traditionally all those renames are re-detected for every commit that is
transplanted. This optimization avoids redoing that work. While that
description is a simple summary of the high level idea, the reasons why this
optimization are safe and correct can be somewhat intricate; the second
patch adds a document that goes to great length to explain every relevant
detail.

This represents "Optimization #4" from my Git Merge 2020 talk[1]; the
details are a bit more involved than I realized at the time, but the high
level idea is the same.

=== Comparison to previous series ===

I previously noted that we had three major rename-related optimizations:

 * exact rename detection (applies when unmodified on renamed side)
 * skip-because-irrelevant (applies when unmodified on unrenamed side)
 * basename-guided rename detection (applies when basename unchanged)

This one adds a fourth (remember-renames), with some interesting properties:

 * unlike basename-guided rename detection, there are no behavioral changes
   (there is no heuristic involved)[2].

 * like skip-because-irrelevant, this optimization does not apply to all git
   commands using the rename machinery. In fact, this one is even more
   restrictive since it is ONLY useful for rebases and cherry-picks (not
   even merges), and only for second and later commits in a linear series.

 * unlike the three previous optimizations, there are no requirements about
   the types of changes done to the file; it just caches renames on the
   "upstream" side of history for subsequent commit picking.

It's also worth noting despite wording about "remembering" or "caching"
renames, that this optimization does NOT write this cache to disk; it's an
in-memory only cache. When the rebase or cherry-pick completes (or hits a
conflict and stops), the cache is discarded.

=== Results ===

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28), the
changes in just this series improves the performance as follows:

                     Before Series           After Series
no-renames:        5.665 s ±  0.129 s     5.622 s ±  0.059 s
mega-renames:     11.435 s ±  0.158 s    10.127 s ±  0.073 s
just-one-mega:   494.2  ms ±  6.1  ms   500.3  ms ±  3.8  ms


By design, this optimization could not help the just-one-mega testcase. The
gains for the other two testcases may look somewhat smaller than one would
expect given the description (only ~13% for the mega-renames testcase), but
the point was to spend less time detecting renames...and there just wasn't
that much time spent in renames for these testcases before this series for
us to remove. However, if we undid the basename-guided rename detection and
skip-because-unnecessary optimizations, then this series alone would have
improved performance as follows:

               Before Basename Series   After Just This Series
no-renames:      13.815 s ±  0.062 s      5.697 s ±  0.080 s
mega-renames:  1799.937 s ±  0.493 s    205.709 s ±  0.457 s


Showing that this optimization has the ability to improve things when the
other optimizations do not apply. In fact, when I originally implemented
this optimization, it improved the mega-renames testcase by a factor of 2
(at the time, I did not have all the optimizations from ort-perf-batch-7
thru ort-perf-batch-10 in their current shape).

As a reminder, before any merge-ort/diffcore-rename performance work, the
performance results we started with were:

no-renames-am:      6.940 s ±  0.485 s
no-renames:        18.912 s ±  0.174 s
mega-renames:    5964.031 s ± 10.459 s
just-one-mega:    149.583 s ±  0.751 s


=== Further discussion of results ===

If we change our focus from absolute time taken, to the percentage of
overall time spent on rename detection, then we find the following picture
comparing our starting point at the beginning of the performance work to
what we achieve at the end of this series:

         Percentage of time spent on rename detection
   
                  commit 557ac0350d      After this Series
no-renames:             39.4%                   0.2%
mega-renames:           96.6%                   8.7%
just-one-mega:          95.0%                  15.6%


This optimization is only applicable for the first two testcases (because
the third only involves rebasing a single commit). This table makes it clear
that our attempts to accelerate rename detection have succeeded, and any
further work to accelerate merges needs to start concentrating on other
areas.

[1]
https://github.com/newren/presentations/blob/pdfs/merge-performance/merge-performance-slides.pdf

[2] Well, almost no changes. There's technically a very narrow way that this
could change the behavior...though in a way that does not affect correctness
of the merge; see section 5 of the new document in the second patch for the
details.

Elijah Newren (13):
  t6423: rename file within directory that other side renamed
  Documentation/technical: describe remembering renames optimization
  fast-rebase: change assert() to BUG()
  fast-rebase: write conflict state to working tree, index, and HEAD
  t6429: testcases for remembering renames
  merge-ort: add data structures for in-memory caching of rename
    detection
  merge-ort: populate caches of rename detection results
  merge-ort: add code to check for whether cached renames can be reused
  merge-ort: avoid accidental API mis-use
  merge-ort: preserve cached renames for the appropriate side
  merge-ort: add helper functions for using cached renames
  merge-ort: handle interactions of caching and rename/rename(1to1)
    cases
  merge-ort, diffcore-rename: employ cached renames when possible

 .../technical/remembering-renames.txt         | 671 +++++++++++++++++
 diffcore-rename.c                             |  22 +-
 diffcore.h                                    |   3 +-
 merge-ort.c                                   | 319 +++++++-
 merge-ort.h                                   |   2 +
 t/helper/test-fast-rebase.c                   |  54 +-
 t/t6423-merge-rename-directories.sh           |  58 ++
 t/t6429-merge-sequence-rename-caching.sh      | 700 ++++++++++++++++++
 8 files changed, 1791 insertions(+), 38 deletions(-)
 create mode 100644 Documentation/technical/remembering-renames.txt
 create mode 100755 t/t6429-merge-sequence-rename-caching.sh


base-commit: 311531c9de557d25ac087c1637818bd2aad6eb3a
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-859%2Fnewren%2Fort-perf-batch-11-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-859/newren/ort-perf-batch-11-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/859

Range-diff vs v1:

  -:  ------------ >  1:  129136a10c9d t6423: rename file within directory that other side renamed
  -:  ------------ >  2:  027c446286d9 Documentation/technical: describe remembering renames optimization
  -:  ------------ >  3:  7dc273d458ea fast-rebase: change assert() to BUG()
  -:  ------------ >  4:  887b151c26ff fast-rebase: write conflict state to working tree, index, and HEAD
  -:  ------------ >  5:  54c126f38098 t6429: testcases for remembering renames
  1:  689a7de56483 =  6:  2a9e73de2bee merge-ort: add data structures for in-memory caching of rename detection
  2:  5f34332c9605 !  7:  02d517f052a3 merge-ort: populate caches of rename detection results
     @@ merge-ort.c: static void resolve_diffpair_statuses(struct diff_queue_struct *q)
      +				    char *new_path)
      +{
      +	char *old_value;
     ++	int dir_renamed_side = 0;
      +
     -+	if (!new_path) {
     ++	if (new_path) {
     ++		/*
     ++		 * Directory renames happen on the other side of history from
     ++		 * the side that adds new files to the old directory.
     ++		 */
     ++		dir_renamed_side = 3 - side;
     ++	} else {
      +		int val = strintmap_get(&renames->relevant_sources[side],
      +					p->one->path);
     -+		if (val == 0) {
     ++		if (val == RELEVANT_NO_MORE) {
      +			assert(p->status == 'D');
      +			strset_add(&renames->cached_irrelevant[side],
      +				   p->one->path);
     @@ merge-ort.c: static void resolve_diffpair_statuses(struct diff_queue_struct *q)
      +		if (val <= 0)
      +			return;
      +	}
     ++
      +	if (p->status == 'D') {
      +		/*
      +		 * If we already had this delete, we'll just set it's value
     @@ merge-ort.c: static void resolve_diffpair_statuses(struct diff_queue_struct *q)
      +		 */
      +		strmap_put(&renames->cached_pairs[side], p->one->path, NULL);
      +	} else if (p->status == 'R') {
     ++		if (new_path) {
     ++			new_path = xstrdup(new_path);
     ++			old_value = strmap_put(&renames->cached_pairs[dir_renamed_side],
     ++					       p->two->path, new_path);
     ++			strset_add(&renames->cached_target_names[dir_renamed_side],
     ++				   new_path);
     ++			assert(!old_value);
     ++		}
      +		if (!new_path)
      +			new_path = p->two->path;
      +		new_path = xstrdup(new_path);
      +		old_value = strmap_put(&renames->cached_pairs[side],
      +				       p->one->path, new_path);
     -+		strset_add(&renames->cached_target_names[side], new_path);
     ++		strset_add(&renames->cached_target_names[side],
     ++			   new_path);
      +		free(old_value);
      +	} else if (p->status == 'A' && new_path) {
      +		new_path = xstrdup(new_path);
     -+		old_value = strmap_put(&renames->cached_pairs[side],
     ++		old_value = strmap_put(&renames->cached_pairs[dir_renamed_side],
      +				       p->two->path, new_path);
     -+		strset_add(&renames->cached_target_names[side], new_path);
     ++		strset_add(&renames->cached_target_names[dir_renamed_side],
     ++			   new_path);
      +		assert(!old_value);
      +	}
      +}
     @@ merge-ort.c: static void resolve_diffpair_statuses(struct diff_queue_struct *q)
       {
       	const struct diff_filepair *a = *((const struct diff_filepair **)a_);
      @@ merge-ort.c: static int collect_renames(struct merge_options *opt,
     - 		struct diff_filepair *p = side_pairs->queue[i];
       		char *new_path; /* non-NULL only with directory renames */
       
     -+		possibly_cache_new_pair(renames, p, side_index, NULL);
       		if (p->status != 'A' && p->status != 'R') {
     ++			possibly_cache_new_pair(renames, p, side_index, NULL);
       			diff_free_filepair(p);
       			continue;
     + 		}
      @@ merge-ort.c: static int collect_renames(struct merge_options *opt,
     + 						      &collisions,
     + 						      &clean);
     + 
     ++		possibly_cache_new_pair(renames, p, side_index, new_path);
     + 		if (p->status != 'R' && !new_path) {
       			diff_free_filepair(p);
       			continue;
       		}
      -
     -+		possibly_cache_new_pair(renames, p, side_index, new_path);
       		if (new_path)
       			apply_directory_rename_modifications(opt, p, new_path);
       
  3:  aa131c21d14f =  8:  731b6bd15531 merge-ort: add code to check for whether cached renames can be reused
  4:  df12cb5a158e =  9:  becd45103018 merge-ort: avoid accidental API mis-use
  5:  39f7e384611d ! 10:  2163dded5798 merge-ort: preserve cached renames for the appropriate side
     @@ merge-ort.c: static void clear_or_reinit_internal_opts(struct merge_options_inte
       	}
       	renames->cached_pairs_valid_side = 0;
       	renames->dir_rename_mask = 0;
     +@@ merge-ort.c: static void detect_regular_renames(struct merge_options *opt,
     + 		return;
     + 	}
     + 
     ++	partial_clear_dir_rename_count(&renames->dir_rename_count[side_index]);
     + 	repo_diff_setup(opt->repo, &diff_opts);
     + 	diff_opts.flags.recursive = 1;
     + 	diff_opts.flags.rename_empty = 0;
  6:  7d5d94b34ba5 = 11:  c0c1956e75c6 merge-ort: add helper functions for using cached renames
  -:  ------------ > 12:  28b622a8261b merge-ort: handle interactions of caching and rename/rename(1to1) cases
  7:  d1016c342d69 ! 13:  91b6768adf2d merge-ort, diffcore-rename: employ cached renames when possible
     @@ Commit message
          this change improves the performance as follows:
      
                                      Before                  After
     -        no-renames:        5.665 s ±  0.129 s     5.624 s ±  0.077 s
     -        mega-renames:     11.435 s ±  0.158 s    10.213 s ±  0.032 s
     -        just-one-mega:   494.2  ms ±  6.1  ms   497.6  ms ±  5.3  ms
     +        no-renames:        5.665 s ±  0.129 s     5.622 s ±  0.059 s
     +        mega-renames:     11.435 s ±  0.158 s    10.127 s ±  0.073 s
     +        just-one-mega:   494.2  ms ±  6.1  ms   500.3  ms ±  3.8  ms
      
          That's a fairly small improvement, but mostly because the previous
          optimizations were so effective for these particular testcases; this
     @@ Commit message
          performance as follows:
      
                             Before Basename Series   After Just This Series
     -        no-renames:      13.815 s ±  0.062 s      5.814 s ±  0.094 s
     -        mega-renames:  1799.937 s ±  0.493 s    303.225 s ±  1.330 s
     +        no-renames:      13.815 s ±  0.062 s      5.697 s ±  0.080 s
     +        mega-renames:  1799.937 s ±  0.493 s    205.709 s ±  0.457 s
      
          Since this optimization kicks in to help accelerate cases where the
          previous optimizations do not apply, this last comparison shows that
     @@ Commit message
          for a future optimization around having collect_merge_info() avoid
          recursing into subtrees in more cases.
      
     +    However, for this optimization to be effective, merge_switch_to_result()
     +    should only be called when the rebase or cherry-pick operation has
     +    either completed or hit a case where the user needs to resolve a
     +    conflict or edit the result.  If it is called after every commit, as
     +    sequencer.c does, then the working tree and index are needlessly updated
     +    with every commit and the cached metadata is tossed, defeating this
     +    optimization.  Refactoring sequencer.c to only call
     +    merge_switch_to_result() at the end of the operation is a bigger
     +    undertaking, and the practical benefits of this optimization will not be
     +    realized until that work is performed.  Since `test-tool fast-rebase`
     +    only updates at the end of the operation, it was used to obtain the
     +    timings above.
     +
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
     @@ merge-ort.c: static void detect_regular_renames(struct merge_options *opt,
       	if (!possible_side_renames(renames, side_index)) {
       		/*
       		 * No rename detection needed for this side, but we still need
     -@@ merge-ort.c: static void detect_regular_renames(struct merge_options *opt,
     - 		return;
     - 	}
     - 
     -+	partial_clear_dir_rename_count(&renames->dir_rename_count[side_index]);
     - 	repo_diff_setup(opt->repo, &diff_opts);
     - 	diff_opts.flags.recursive = 1;
     - 	diff_opts.flags.rename_empty = 0;
      @@ merge-ort.c: static void detect_regular_renames(struct merge_options *opt,
       	diffcore_rename_extended(&diff_opts,
       				 &renames->relevant_sources[side_index],
     @@ merge-ort.c: static int detect_and_process_renames(struct merge_options *opt,
       	trace2_region_leave("merge", "regular renames", opt->repo);
       
       	trace2_region_enter("merge", "directory renames", opt->repo);
     +
     + ## t/t6429-merge-sequence-rename-caching.sh ##
     +@@ t/t6429-merge-sequence-rename-caching.sh: test_expect_success 'caching renames does not preclude finding new ones' '
     + # dramatic change in size of the file, but remembering the rename and
     + # reusing it is reasonable too.
     + #
     +-# Rename detection (diffcore_rename_extended()) will run twice here; it is
     +-# not needed on the topic side of history for either of the two commits
     +-# being merged, but it is needed on the upstream side of history for each
     +-# commit being picked.
     ++# We do test here that we expect rename detection to only be run once total
     ++# (the topic side of history doesn't need renames, and with caching we
     ++# should be able to only run rename detection on the upstream side one
     ++# time.)
     + test_expect_success 'cherry-pick both a commit and its immediate revert' '
     + 	test_create_repo pick-commit-and-its-immediate-revert &&
     + 	(
     +@@ t/t6429-merge-sequence-rename-caching.sh: test_expect_success 'cherry-pick both a commit and its immediate revert' '
     + 		GIT_TRACE2_PERF="$(pwd)/trace.output" &&
     + 		export GIT_TRACE2_PERF &&
     + 
     +-		test_might_fail test-tool fast-rebase --onto HEAD upstream~1 topic &&
     ++		test-tool fast-rebase --onto HEAD upstream~1 topic &&
     + 		#git cherry-pick upstream~1..topic &&
     + 
     + 		grep region_enter.*diffcore_rename trace.output >calls &&
     +-		test_line_count = 2 calls
     ++		test_line_count = 1 calls
     + 	)
     + '
     + 
     +@@ t/t6429-merge-sequence-rename-caching.sh: test_expect_success 'rename same file identically, then add file to old dir' '
     + # Here we are just concerned that cached renames might prevent us from seeing
     + # the rename conflict, and we want to ensure that we do get a conflict.
     + #
     +-# While at it, also test that we do rename detection three times.  We have to
     +-# detect renames on the upstream side of history once for each merge, plus
     +-# Topic_2 has renames.
     ++# While at it, though, we do test that we only try to detect renames 2
     ++# times and not three.  (The first merge needs to detect renames on the
     ++# upstream side.  Traditionally, the second merge would need to detect
     ++# renames on both sides of history, but our caching of upstream renames
     ++# should avoid the need to re-detect upstream renames.)
     + #
     + test_expect_success 'cached dir rename does not prevent noticing later conflict' '
     + 	test_create_repo dir-rename-cache-not-occluding-later-conflict &&
     +@@ t/t6429-merge-sequence-rename-caching.sh: test_expect_success 'cached dir rename does not prevent noticing later conflict'
     + 		grep CONFLICT..rename/rename output &&
     + 
     + 		grep region_enter.*diffcore_rename trace.output >calls &&
     +-		test_line_count = 3 calls
     ++		test_line_count = 2 calls
     + 	)
     + '
     + 
     +@@ t/t6429-merge-sequence-rename-caching.sh: test_setup_upstream_rename () {
     + # commit to mess up its location either.  We want to make sure that
     + # olddir/newfile doesn't exist in the result and that newdir/newfile does.
     + #
     +-# We also expect rename detection to occur three times.  Although it is
     +-# typically needed two times per commit, there are no deleted files on the
     +-# topic side of history, so we only need to detect renames on the upstream
     +-# side for each of the 3 commits we need to pick.
     ++# We also test that we only do rename detection twice.  We never need
     ++# rename detection on the topic side of history, but we do need it twice on
     ++# the upstream side of history.  For the first topic commit, we only need
     ++# the
     ++#   relevant-rename -> renamed
     ++# rename, because olddir is unmodified by Topic_1.  For Topic_2, however,
     ++# the new file being added to olddir means files that were previously
     ++# irrelevant for rename detection are now relevant, forcing us to repeat
     ++# rename detection for the paths we don't already have cached.  Topic_3 also
     ++# tweaks olddir/newfile, but the renames in olddir/ will have been cached
     ++# from the second rename detection run.
     + #
     + test_expect_success 'dir rename unneeded, then add new file to old dir' '
     + 	test_setup_upstream_rename dir-rename-unneeded-until-new-file &&
     +@@ t/t6429-merge-sequence-rename-caching.sh: test_expect_success 'dir rename unneeded, then add new file to old dir' '
     + 		#git cherry-pick upstream..topic &&
     + 
     + 		grep region_enter.*diffcore_rename trace.output >calls &&
     +-		test_line_count = 3 calls &&
     ++		test_line_count = 2 calls &&
     + 
     + 		git ls-files >tracked &&
     + 		test_line_count = 5 tracked &&
     +@@ t/t6429-merge-sequence-rename-caching.sh: test_expect_success 'dir rename unneeded, then rename existing file into old dir
     + 		#git cherry-pick upstream..topic &&
     + 
     + 		grep region_enter.*diffcore_rename trace.output >calls &&
     +-		test_line_count = 4 calls &&
     ++		test_line_count = 3 calls &&
     + 
     + 		test_path_is_missing somefile &&
     + 		test_path_is_missing olddir/newfile &&
     +@@ t/t6429-merge-sequence-rename-caching.sh: test_expect_success 'caching renames only on upstream side, part 1' '
     + # for the wrong side of history.
     + #
     + #
     +-# This testcase should only need three calls to diffcore_rename_extended(),
     +-# because there are no renames on the topic side of history for picking
     +-# Topic_2.
     ++# This testcase should only need two calls to diffcore_rename_extended(),
     ++# both for the first merge, one for each side of history.
     + #
     + test_expect_success 'caching renames only on upstream side, part 2' '
     + 	test_setup_topic_rename cache-renames-only-upstream-rename-file &&
     +@@ t/t6429-merge-sequence-rename-caching.sh: test_expect_success 'caching renames only on upstream side, part 2' '
     + 		#git cherry-pick upstream..topic &&
     + 
     + 		grep region_enter.*diffcore_rename trace.output >calls &&
     +-		test_line_count = 3 calls &&
     ++		test_line_count = 2 calls &&
     + 
     + 		git ls-files >tracked &&
     + 		test_line_count = 4 tracked &&

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 01/13] t6423: rename file within directory that other side renamed
  2021-05-04  2:12 ` [PATCH v2 00/13] " Elijah Newren via GitGitGadget
@ 2021-05-04  2:12   ` Elijah Newren via GitGitGadget
  2021-05-04  2:12   ` [PATCH v2 02/13] Documentation/technical: describe remembering renames optimization Elijah Newren via GitGitGadget
                     ` (12 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-05-04  2:12 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren,
	Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Add a new testcase where one side of history renames:
   olddir/ -> newdir/
and the other side of history renames:
   olddir/a -> olddir/alpha

When using merge.directoryRenames=true, it seems logical to expect the
file to end up at newdir/alpha.  Unfortunately, both merge-recursive and
merge-ort currently see this as a rename/rename conflict:

   olddir/a -> newdir/a
vs.
   olddir/a -> newdir/alpha

Suggesting that there's some extra logic we probably want to add
somewhere to allow this case to run without triggering a conflict.  For
now simply document this known issue.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 t/t6423-merge-rename-directories.sh | 58 +++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/t/t6423-merge-rename-directories.sh b/t/t6423-merge-rename-directories.sh
index 7134769149fc..be84d22419d9 100755
--- a/t/t6423-merge-rename-directories.sh
+++ b/t/t6423-merge-rename-directories.sh
@@ -4966,6 +4966,64 @@ test_expect_success '12g: Testcase with two kinds of "relevant" renames' '
 	)
 '
 
+# Testcase 12h, Testcase with two kinds of "relevant" renames
+#   Commit O: olddir/{a_1, b}
+#   Commit A: newdir/{a_2, b}
+#   Commit B: olddir/{alpha_1, b}
+#   Expected: newdir/{alpha_2, b}
+
+test_setup_12h () {
+	test_create_repo 12h &&
+	(
+		cd 12h &&
+
+		mkdir olddir &&
+		test_seq 3 8 >olddir/a &&
+		>olddir/b &&
+		git add olddir &&
+		git commit -m orig &&
+
+		git branch O &&
+		git branch A &&
+		git branch B &&
+
+		git switch A &&
+		test_seq 3 10 >olddir/a &&
+		git add olddir/a &&
+		git mv olddir newdir &&
+		git commit -m A &&
+
+		git switch B &&
+
+		git mv olddir/a olddir/alpha &&
+		git commit -m B
+	)
+}
+
+test_expect_failure '12h: renaming a file within a renamed directory' '
+	test_setup_12h &&
+	(
+		cd 12h &&
+
+		git checkout A^0 &&
+
+		test_might_fail git -c merge.directoryRenames=true merge -s recursive B^0 &&
+
+		git ls-files >tracked &&
+		test_line_count = 2 tracked &&
+
+		test_path_is_missing olddir/a &&
+		test_path_is_file newdir/alpha &&
+		test_path_is_file newdir/b &&
+
+		git rev-parse >actual \
+			HEAD:newdir/alpha  HEAD:newdir/b &&
+		git rev-parse >expect \
+			A:newdir/a         O:oldir/b &&
+		test_cmp expect actual
+	)
+'
+
 ###########################################################################
 # SECTION 13: Checking informational and conflict messages
 #
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 02/13] Documentation/technical: describe remembering renames optimization
  2021-05-04  2:12 ` [PATCH v2 00/13] " Elijah Newren via GitGitGadget
  2021-05-04  2:12   ` [PATCH v2 01/13] t6423: rename file within directory that other side renamed Elijah Newren via GitGitGadget
@ 2021-05-04  2:12   ` Elijah Newren via GitGitGadget
  2021-05-04  2:12   ` [PATCH v2 03/13] fast-rebase: change assert() to BUG() Elijah Newren via GitGitGadget
                     ` (11 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-05-04  2:12 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren,
	Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Remembering renames on the upstream side of history in an early merge of
a rebase or cherry-pick for re-use in a latter merge of the same
operation makes pretty good intuitive sense.  However, trying to show
that it doesn't cause some subtle behavioral difference or some funny
edge or corner case is much more involved.  And, in fact, it does
introduce a subtle behavioral change.

Document all the assumptions, special cases, and logic involved in such
an optimization, and describe why this optimization is safe under the
current optimizations/features/etc. -- even when the subtle behavioral
change is triggered.

Part of the point of adding this document that goes over the
optimization in such laborious detail, is that it is possible that
significant future changes (optimizations or feature changes) could
interact with this optimization in interesting ways; this document is
here to help folks making big changes sanity check that the assumptions
and arguments underlying this optimization are still valid.  (As a side
note, creating this document forced me to review things in sufficient
detail that I found I was not properly caching directory-rename-induced
renames, resulting in the code not being aware of those renames and
causing unnecessary diffcore_rename_extended() calls in subsequent
merges.)

A subsequent commit will add several testcases based on this document
meant to stress-test the implementation and also document the case with
the subtle behavioral change.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 .../technical/remembering-renames.txt         | 669 ++++++++++++++++++
 1 file changed, 669 insertions(+)
 create mode 100644 Documentation/technical/remembering-renames.txt

diff --git a/Documentation/technical/remembering-renames.txt b/Documentation/technical/remembering-renames.txt
new file mode 100644
index 000000000000..e7c298d83df2
--- /dev/null
+++ b/Documentation/technical/remembering-renames.txt
@@ -0,0 +1,669 @@
+Rebases and cherry-picks involve a sequence of merges whose results are
+recorded as new single-parent commits.  The first parent side of those
+merges represent the "upstream" side, and often include a far larger set of
+changes than the second parent side.  Traditionally, the renames on the
+first-parent side of that sequence of merges were repeatedly re-detected
+for every merge.  This file explains why it is safe and effective during
+rebases and cherry-picks to remember renames on the upstream side of
+history as an optimization, assuming all merges are automatic and clean
+(i.e. no conflicts and not interrupted for user input or editing).
+
+Outline:
+
+  0. Assumptions
+
+  1. How rebasing and cherry-picking work
+
+  2. Why the renames on MERGE_SIDE1 in any given pick are *always* a
+     superset of the renames on MERGE_SIDE1 for the next pick.
+
+  3. Why any rename on MERGE_SIDE1 in any given pick is _almost_ always also
+     a rename on MERGE_SIDE1 for the next pick
+
+  4. A detailed description of the the counter-examples to #3.
+
+  5. Why the special cases in #4 are still fully reasonable to use to pair
+     up files for three-way content merging in the merge machinery, and why
+     they do not affect the correctness of the merge.
+
+  6. Interaction with skipping of "irrelevant" renames
+
+  7. Additional items that need to be cached
+
+  8. How directory rename detection interacts with the above and why this
+     optimization is still safe even if merge.directoryRenames is set to
+     "true".
+
+
+=== 0. Assumptions ===
+
+There are two assumptions that will hold throughout this document:
+
+  * The upstream side where commits are transplanted to is treated as the
+    first parent side when rebase/cherry-pick call the merge machinery
+
+  * All merges are fully automatic
+
+and a third that will hold in sections 2-5 for simplicity, that I'll later
+address in section 8:
+
+  * No directory renames occur
+
+
+Let me explain more about each assumption and why I include it:
+
+
+The first assumption is merely for the purposes of making this document
+clearer; the optimization implementation does not actually depend upon it.
+However, the assumption does hold in all cases because it reflects the way
+that both rebase and cherry-pick where implemented; and the implementation
+of cherry-pick and rebase are not readily changeable for backwards
+compatibility reasons (see for example the discussion of the --ours and
+--theirs flag in the documentation of `git checkout`, particularly the
+comments about how they behave with rebase).  The optimization avoids
+checking first-parent-ness, though.  It checks the conditions that make the
+optimization valid instead, so it would still continue working if someone
+changed the parent ordering that cherry-pick and rebase use.  But making
+this assumption does make this document much clearer and prevents me from
+having to repeat every example twice.
+
+If the second assumption is violated, then the optimization simply is
+turned off and thus isn't relevant to consider.  The second assumption can
+also be stated as "there is no interruption for a user to resolve conflicts
+or to just further edit or tweak files".  While real rebases and
+cherry-picks are often interrupted (either because it's an interactive
+rebase where the user requested to stop and edit, or because there were
+conflicts that the user needs to resolve), the cache of renames is not
+stored on disk, and thus is thrown away as soon as the rebase or cherry
+pick stops for the user to resolve the operation.
+
+The third assumption makes sections 2-5 simpler, and allows people to
+understand the basics of why this optimization is safe and effective, and
+then I can go back and address the specifics in section 8.  It is probably
+also worth noting that if directory renames do occur, then the default of
+merge.directoryRenames being set to "conflict" means that the operation
+will stop for users to resolve the conflicts and the cache will be thrown
+away, and thus that there won't be an optimization to apply.  So, the only
+reason we need to address directory renames specifically, is that some
+users will have set merge.directoryRenames to "true" to allow the merges to
+continue to proceed automatically.  The optimization is still safe with
+this config setting, but we have to discuss a few more cases to show why;
+this discussion is deferred until section 8.
+
+
+=== 1. How rebasing and cherry-picking work ===
+
+Consider the following setup (from the git-rebase manpage):
+
+		     A---B---C topic
+		    /
+	       D---E---F---G main
+
+After rebasing or cherry-picking topic onto main, this will appear as:
+
+			     A'--B'--C' topic
+			    /
+	       D---E---F---G main
+
+The way the commits A', B', and C' are created is through a series of
+merges, where rebase or cherry-pick sequentially uses each of the three
+A-B-C commits in a special merge operation.  Let's label the three commits
+in the merge operation as MERGE_BASE, MERGE_SIDE1, and MERGE_SIDE2.  For
+this picture, the three commits for each of the three merges would be:
+
+To create A':
+   MERGE_BASE:   E
+   MERGE_SIDE1:  G
+   MERGE_SIDE2:  A
+
+To create B':
+   MERGE_BASE:   A
+   MERGE_SIDE1:  A'
+   MERGE_SIDE2:  B
+
+To create C':
+   MERGE_BASE:   B
+   MERGE_SIDE1:  B'
+   MERGE_SIDE2:  C
+
+Sometimes, folks are surprised that these three-way merges are done.  It
+can be useful in understanding these three-way merges to view them in a
+slightly different light.  For example, in creating C', you can view it as
+either:
+
+  * Apply the changes between B & C to B'
+  * Apply the changes between B & B' to C
+
+Conceptually the two statements above are the same as a three-way merge of
+B, B', and C, at least the parts before you decide to record a commit.
+
+
+=== 2. Why the renames on MERGE_SIDE1 in any given pick are always a ===
+===    superset of the renames on MERGE_SIDE1 for the next pick.     ===
+
+The merge machinery uses the filenames it is fed from MERGE_BASE,
+MERGE_SIDE1, and MERGE_SIDE2.  It will only move content to a different
+filename under one of three conditions:
+
+  * To make both pieces of a conflict available to a user during conflict
+    resolution (examples: directory/file conflict, add/add type conflict
+    such as symlink vs. regular file)
+
+  * When MERGE_SIDE1 renames the file.
+
+  * When MERGE_SIDE2 renames the file.
+
+First, let's remember what commits are involved in the first and second
+picks of the cherry-pick or rebase sequence:
+
+To create A':
+   MERGE_BASE:   E
+   MERGE_SIDE1:  G
+   MERGE_SIDE2:  A
+
+To create B':
+   MERGE_BASE:   A
+   MERGE_SIDE1:  A'
+   MERGE_SIDE2:  B
+
+So, in particular, we need to show that the renames between E and G are a
+superset of those between A and A'.
+
+A' is created by the first merge.  A' will only have renames for one of the
+three reasons listed above.  The first case, a conflict, results in a
+situation where the cache is dropped and thus this optimization doesn't
+take effect, so we need not consider that case.  The third case, a rename
+on MERGE_SIDE2 (i.e. from G to A), will show up in A' but it also shows up
+in A -- therefore when diffing A and A' that path does not show up as a
+rename.  The only remaining way for renames to show up in A' is for the
+rename to come from MERGE_SIDE1.  Therefore, all renames between A and A'
+are a subset of those between E and G.  Equivalently, all renames between E
+and G are a superset of those between A and A'.
+
+
+=== 3. Why any rename on MERGE_SIDE1 in any given pick is _almost_   ===
+===    always also a rename on MERGE_SIDE1 for the next pick.        ===
+
+Let's again look at the first two picks:
+
+To create A':
+   MERGE_BASE:   E
+   MERGE_SIDE1:  G
+   MERGE_SIDE2:  A
+
+To create B':
+   MERGE_BASE:   A
+   MERGE_SIDE1:  A'
+   MERGE_SIDE2:  B
+
+Now let's look at any given rename from MERGE_SIDE1 of the first pick, i.e.
+any given rename from E to G.  Let's use the filenames 'oldfile' and
+'newfile' for demonstration purposes.  That first pick will function as
+follows; when the rename is detected, the merge machinery will do a
+three-way content merge of the following:
+    E:oldfile
+    G:newfile
+    A:oldfile
+and produce a new result:
+    A':newfile
+
+Note above that I've assumed that E->A did not rename oldfile.  If that
+side did rename, then we most likely have a rename/rename(1to2) conflict
+that will cause the rebase or cherry-pick operation to halt and drop the
+in-memory cache of renames and thus doesn't need to be considered further.
+In the special case that E->A does rename the file but also renames it to
+newfile, then there is no conflict from the renaming and the merge can
+succeed.  In this special case, the rename is not valid to cache because
+the second merge will find A:newfile in the MERGE_BASE.  So a
+rename/rename(1to1) needs to be specially handled by pruning renames from
+the cache and decrementing the dir_rename_counts in the current and leading
+directories associated with those renames.  Or, since these are really
+rare, one could just take the easy way out and disable the remembering
+renames optimization when a rename/rename(1to1) happens.
+
+The previous paragraph handled the cases for E->A renaming oldfile, let's
+continue assuming that oldfile is not renamed in A.
+
+As per the diagram for creating B', MERGE_SIDE1 involves the changes from A
+to A'.  So, we are curious whether A:oldfile and A':newfile will be viewed
+as renames.  Note that:
+
+  * There will be no A':oldfile (because there could not have been a
+    G:oldfile as we do not do break detection in the merge machinery and
+    G:newfile was detected as a rename, and by the construction of the
+    rename above that merged cleanly, the merge machinery will ensure there
+    is no 'oldfile' in the result).
+
+  * There will be no A:newfile (if there had been, we would have had a
+    rename/add conflict).
+
+  * Clearly A:oldfile and A':newfile are "related" (A':newfile came from a
+    clean three-way content merge involving A:oldfile).
+
+We can also expound on the third point above, by noting that three-way
+content merges can also be viewed as applying the differences between the
+base and one side to the other side.  Thus we can view A':newfile as
+having been created by taking the changes between E:oldfile and G:newfile
+(which were detected as being related, i.e. <50% changed) to A:oldfile.
+
+Thus A:oldfile and A':newfile are just as related as E:oldfile and
+G:newfile are -- they have exactly identical differences.  Since the latter
+were detected as renames, A:oldfile and A':newfile should also be
+detectable as renames almost always.
+
+
+=== 4. A detailed description of the counter-examples to #3.         ===
+
+We already noted in section 3 that rename/rename(1to1) (i.e. both sides
+renaming a file the same way) was one counter-example.  The more
+interesting bit, though, is why did we need to use the "almost" qualifier
+when stating that A:oldfile and A':newfile are "almost" always detectable
+as renames?
+
+Let's repeat an earlier point that section 3 made:
+
+  A':newfile was created by applying the changes between E:oldfile and
+  G:newfile to A:oldfile.  The changes between E:oldfile and G:newfile were
+  <50% of the size of E:oldfile.
+
+If those changes that were <50% of the size of E:oldfile are also <50% of
+the size of A:oldfile, then A:oldfile and A':newfile will be detectable as
+renames.  However, if there is a dramatic size reduction between E:oldfile
+and A:oldfile (but the changes between E:oldfile, G:newfile, and A:oldfile
+still somehow merge cleanly), then traditional rename detection would not
+detect A:oldfile and A':newfile as renames.
+
+Here's an example where that can happen:
+  * E:oldfile had 20 lines
+  * G:newfile added 10 new lines at the beginning of the file
+  * A:oldfile kept the first 3 lines of the file, and deleted all the rest
+then
+  => A':newfile would have 13 lines, 3 of which matches those in A:oldfile.
+E:oldfile -> G:newfile would be detected as a rename, but A:oldfile and
+A':newfile would not be.
+
+
+=== 5. Why the special cases in #4 are still fully reasonable to use to    ===
+===    pair up files for three-way content merging in the merge machinery, ===
+===    and why they do not affect the correctness of the merge.            ===
+
+In the rename/rename(1to1) case, A:newfile and A':newfile are not renames
+since they use the *same* filename.  However, files with the same filename
+are obviously fine to pair up for three-way content merging (the merge
+machinery has never employed break detection).  The interesting
+counter-example case is thus not the rename/rename(1to1) case, but the case
+where A did not rename oldfile.  That was the case that we spent most of
+the time discussing in sections 3 and 4.  The remainder of this section
+will be devoted to that case as well.
+
+So, even if A:oldfile and A':newfile aren't detectable as renames, why is
+it still reasonable to pair them up for three-way content merging in the
+merge machinery?  There are multiple reasons:
+
+  * As noted in sections 3 and 4, the diff between A:oldfile and A':newfile
+    is *exactly* the same as the diff between E:oldfile and G:newfile.  The
+    latter pair were detected as renames, so it seems unlikely to surprise
+    users for us to treat A:oldfile and A':newfile as renames.
+
+  * In fact, "oldfile" and "newfile" were at one point detected as renames
+    due to how they were constructed in the E..G chain.  And we used that
+    information once already in this rebase/cherry-pick.  I think users
+    would be unlikely to be surprised at us continuing to treat the files
+    as renames and would quickly understand why we had done so.
+
+  * Marking or declaring files as renames is *not* the end goal for merges.
+    Merges use renames to determine which files make sense to be paired up
+    for three-way content merges.
+
+  * A:oldfile and A':newfile were _already_ paired up in a three-way
+    content merge; that is how A':newfile was created.  In fact, that
+    three-way content merge was clean.  So using them again in a later
+    three-way content merge seems very reasonable.
+
+However, the above is focusing on the common scenarios.  Let's try to look
+at all possible unusual scenarios and compare without the optimization to
+with the optimization.  Consider the following theoretical cases; we will
+the dive into each to determine which of them are possible,
+and if so, what they mean:
+
+  1. Without the optimization, the second merge results in a conflict.
+     With the optimization, the second merge also results in a conflict.
+     Questions: Are the conflicts confusingly different?  Better in one case?
+
+  2. Without the optimization, the second merge results in NO conflict.
+     With the optimization, the second merge also results in NO conflict.
+     Questions: Are the merges the same?
+
+  3. Without the optimization, the second merge results in a conflict.
+     With the optimization, the second merge results in NO conflict.
+     Questions: Possible?  Bug, bugfix, or something else?
+
+  4. Without the optimization, the second merge results in NO conflict.
+     With the optimization, the second merge results in a conflict.
+     Questions: Possible?  Bug, bugfix, or something else?
+
+I'll consider all four cases, but out of order.
+
+The fourth case is impossible.  For the code without the remembering
+renames optimization to not get a conflict, B:oldfile would need to exactly
+match A:oldfile -- if it doesn't, there would be a modify/delete conflict.
+If A:oldfile matches B:oldfile exactly, then a three-way content merge
+between A:oldfile, A':newfile, and B:oldfile would have no conflict and
+just give us the version of newfile from A' as the result.
+
+From the same logic as the above paragraph, the second case would indeed
+result in identical merges.  When A:oldfile exactly matches B:oldfile, an
+undetected rename would say, "Oh, I see one side didn't modify 'oldfile'
+and the other side deleted it.  I'll delete it.  And I see you have this
+brand new file named 'newfile' in A', so I'll keep it."  That gives the
+same results as three-way content merging A:oldfile, A':newfile, and
+B:oldfile -- a removal of oldfile with the version of newfile from A'
+showing up in the result.
+
+The third case is interesting.  It means that A:oldfile and A':newfile were
+not just similar enough, but that the changes between them did not conflict
+with the changes between A:oldfile and B:oldfile.  This would validate our
+hunch that the files were similar enough to be used in a three-way content
+merge, and thus seems entirely correct for us to have used them that way.
+(Sidenote: One particular example here may be enlightening.  Let's say that
+B was an immediate revert of A.  B clearly would have been a clean revert
+of A, since A was B's immediate parent.  One would assume that if you can
+pick a commit, you should also be able to cherry-pick its immediate revert.
+However, this is one of those funny corner cases; without this
+optimization, we just successfully picked a commit cleanly, but we are
+unable to cherry-pick its immediate revert due to the size differences
+between E:oldfile and A:oldfile.)
+
+That leaves only the first case to consider -- when we get conflicts both
+with or without the optimization.  Without the optimization, we'll have a
+modify/delete conflict, where both A':newfile and B:oldfile are left in the
+tree for the user to deal with and no hints about the potential similarity
+between the two.  With the optimization, we'll have a three-way content
+merged A:oldfile, A':newfile, and B:oldfile with conflict markers
+suggesting we though the files were related but giving the user the chance
+to resolve.  As noted above, I don't think users will find us treating
+'oldfile' and 'newfile' as related as a surprise since they were between E
+and G.  In any event, though, this case shouldn't be concerning since we
+hit a conflict in both cases, told the user what we know, and asked them to
+resolve it.
+
+So, in summary, case 4 is impossible, case 2 yields the same behavior, and
+cases 1 and 3 seem to provide as good or better behavior with the
+optimization than without.
+
+
+=== 6. Interaction with skipping of "irrelevant" renames ===
+
+Previous optimizations involved skipping rename detection for paths
+considered to be "irrelevant".  See for example the following commits:
+
+  * 32a56dfb99 ("merge-ort: precompute subset of sources for which we
+		need rename detection", 2021-03-11)
+  * 2fd9eda462 ("merge-ort: precompute whether directory rename
+		detection is needed", 2021-03-11)
+  * 9bd342137e ("diffcore-rename: determine which relevant_sources are
+		no longer relevant", 2021-03-13)
+
+However, relevance is always determined by what the _other_ side of
+history has done, in terms of modifing a file that our side renamed,
+or adding a file to a directory which our side renamed.  However, this
+means that a path that is "irrelevant" when picking the first commit
+of a series in a rebase or cherry-pick, may suddenly become "relevant"
+when picking the next commit.
+
+The upshot of this is that we can only cache rename detection results
+for relevant paths, and need to re-check relevance in subsequent
+commits.  If those subsequent commits have additional paths that are
+relevant for rename detection, then we will need to redo rename
+detection -- though we can limit it to the paths for which we have not
+already detected renames.
+
+
+=== 7. Additional items that need to be cached ===
+
+It turns out we have to cache more than just renames; we also cache:
+
+  A) non-renames (i.e. unpaired deletes)
+  B) counts of renames within directories
+  C) sources that were marked as RELEVANT_LOCATION, but which were
+     downgraded to RELEVANT_NO_MORE
+  D) the toplevel trees involved in the merge
+
+These are all stored in struct rename_info, and respectively appear in
+  * cached_pairs (along side actual renames, just with a value of NULL)
+  * dir_rename_counts
+  * cached_irrelevant
+  * merge_trees
+
+The reason for (A) comes from the irrelevant renames skipping
+optimization discussed in section 6.  The fact that irrelevant renames
+are skipped means we only get a subset of the potential renames
+detected and subsequent commits may need to run rename detection on
+the upstream side on a subset of the remaining renames (to get the
+renames that are relevant for that later commit).  Since unpaired
+deletes are involved in rename detection too, we don't want to
+repeatedly check that those paths remain unpaired on the upstream side
+with every commit we are transplanting.
+
+The reason for (B) is that diffcore_rename_extended() is what
+generates the counts of renames by directory which is needed in
+directory rename detection, and if we don't run
+diffcore_rename_extended() again then we need to have the output from
+it, including dir_rename_counts, from the previous run.
+
+The reason for (C) is that merge-ort's tree traversal will again think
+those paths are relevant (marking them as RELEVANT_LOCATION), but the
+fact that they were downgraded to RELEVANT_NO_MORE means that
+dir_rename_counts already has the information we need for directory
+rename detection.  (A path which becomes RELEVANT_CONTENT in a
+subsequent commit will be removed from cached_irrelevant.)
+
+The reason for (D) is that is how we determine whether the remember
+renames optimization can be used.  In particular, remembering that our
+sequence of merges looks like:
+
+   Merge 1:
+   MERGE_BASE:   E
+   MERGE_SIDE1:  G
+   MERGE_SIDE2:  A
+   => Creates    A'
+
+   Merge 2:
+   MERGE_BASE:   A
+   MERGE_SIDE1:  A'
+   MERGE_SIDE2:  B
+   => Creates    B'
+
+It is the fact that the trees A and A' appear both in Merge 1 and in
+Merge 2, with A as a parent of A' that allows this optimization.  So
+we store the trees to compare with what we are asked to merge next
+time.
+
+
+=== 8. How directory rename detection interacts with the above and   ===
+===    why this optimization is still safe even if                   ===
+===    merge.directoryRenames is set to "true".                      ===
+
+As noted in the assumptions section:
+
+    """
+    ...if directory renames do occur, then the default of
+    merge.directoryRenames being set to "conflict" means that the operation
+    will stop for users to resolve the conflicts and the cache will be
+    thrown away, and thus that there won't be an optimization to apply.
+    So, the only reason we need to address directory renames specifically,
+    is that some users will have set merge.directoryRenames to "true" to
+    allow the merges to continue to proceed automatically.
+    """
+
+Let's remember that we need to look at how any given pick affects the next
+one.  So let's again use the first two picks from the diagram in section
+one:
+
+  First pick does this three-way merge:
+    MERGE_BASE:   E
+    MERGE_SIDE1:  G
+    MERGE_SIDE2:  A
+    => creates A'
+
+  Second pick does this three-way merge:
+    MERGE_BASE:   A
+    MERGE_SIDE1:  A'
+    MERGE_SIDE2:  B
+    => creates B'
+
+Now, directory rename detection exists so that if one side of history
+renames a directory, and the other side adds a new file to the old
+directory, then the merge (with merge.directoryRenames=true) can move the
+file into the new directory.  There are two qualitatively different ways to
+add a new file to an old directory: create a new file, or rename a file
+into that directory.  Also, directory renames can be done on either side of
+history, so there are four cases to consider:
+
+  * MERGE_SIDE1 renames old dir, MERGE_SIDE2 adds new file to   old dir
+  * MERGE_SIDE1 renames old dir, MERGE_SIDE2 renames  file into old dir
+  * MERGE_SIDE1 adds new file to   old dir, MERGE_SIDE2 renames old dir
+  * MERGE_SIDE1 renames  file into old dir, MERGE_SIDE2 renames old dir
+
+One last note before we consider these four cases: There are some
+important properties about how we implement this optimization with
+respect to directory rename detection that we need to bear in mind
+while considering all of these cases:
+
+  * rename caching occurs *after* applying directory renames
+
+  * a rename created by directory rename detection is recorded for the side
+    of history that did the directory rename.
+
+  * dir_rename_counts, the nested map of
+	{oldname => {newname => count}},
+    is cached between runs as well.  This basically means that directory
+    rename detection is also cached, though only on the side of history
+    that we cache renames for (MERGE_SIDE1 as far as this document is
+    concerned; see the assumptions section).  Two interesting sub-notes
+    about these counts:
+
+    * If we need to perform rename-detection again on the given side (e.g.
+      some paths are relevant for rename detection that weren't before),
+      then we clear dir_rename_counts and recompute it, making use of
+      cached_pairs.  The reason it is important to do this is optimizations
+      around RELEVANT_LOCATION exist to prevent us from computing
+      unnecessary renames for directory rename detection and from computing
+      dir_rename_counts for irrelevant directories; but those same renames
+      or directories may become necessary for subsequent merges.  The
+      easiest way to "fix up" dir_rename_counts in such cases is to just
+      recompute it.
+
+    * If we prune rename/rename(1to1) entries from the cache, then we also
+      need to update dir_rename_counts to decrement the counts for the
+      involved directory and any relevant parent directories (to undo what
+      update_dir_rename_counts() in diffcore-rename.c incremented when the
+      rename was initially found).  If we instead just disable the
+      remembering renames optimization when the exceedingly rare
+      rename/rename(1to1) cases occur, then dir_rename_counts will get
+      re-computed the next time rename detection occurs, as noted above.
+
+  * the side with multiple commits to pick, is the side of history that we
+    do NOT cache renames for.  Thus, there are no additional commits to
+    change the number of renames in a directory, except for those done by
+    directory rename detection (which always pad the majority).
+
+  * the "renames" we cache are modified slightly by any directory rename,
+    as noted below.
+
+Now, with those notes out of the way, let's go through the four cases
+in order:
+
+Case 1: MERGE_SIDE1 renames old dir, MERGE_SIDE2 adds new file to old dir
+
+  This case looks like this:
+
+    MERGE_BASE:   E,   Has olddir/
+    MERGE_SIDE1:  G,   Renames olddir/ -> newdir/
+    MERGE_SIDE2:  A,   Adds olddir/newfile
+    => creates    A',  With newdir/newfile
+
+    MERGE_BASE:   A,   Has olddir/newfile
+    MERGE_SIDE1:  A',  Has newdir/newfile
+    MERGE_SIDE2:  B,   Modifies olddir/newfile
+    => expected   B',  with threeway-merged newdir/newfile from above
+
+  In this case, with the optimization, note that after the first commit:
+    * MERGE_SIDE1 remembers olddir/ -> newdir/
+    * MERGE_SIDE1 has cached olddir/newfile -> newdir/newfile
+  Given the cached rename noted above, the second merge can proceed as
+  expected without needing to perform rename detection from A -> A'.
+
+Case 2: MERGE_SIDE1 renames old dir, MERGE_SIDE2 renames  file into old dir
+
+  This case looks like this:
+    MERGE_BASE:   E    oldfile, olddir/
+    MERGE_SIDE1:  G    oldfile, olddir/ -> newdir/
+    MERGE_SIDE2:  A    oldfile -> olddir/newfile
+    => creates    A',  With newdir/newfile representing original oldfile
+
+    MERGE_BASE:   A    olddir/newfile
+    MERGE_SIDE1:  A'   newdir/newfile
+    MERGE_SIDE2:  B    modify olddir/newfile
+    => expected   B',  with threeway-merged newdir/newfile from above
+
+  In this case, with the optimization, note that after the first commit:
+    * MERGE_SIDE1 remembers olddir/ -> newdir/
+    * MERGE_SIDE1 has cached olddir/newfile -> newdir/newfile
+		  (NOT oldfile -> newdir/newfile; compare to case with
+		   (p->status == 'R' && new_path) in possibly_cache_new_pair())
+
+  Given the cached rename noted above, the second merge can proceed as
+  expected without needing to perform rename detection from A -> A'.
+
+Case 3: MERGE_SIDE1 adds new file to   old dir, MERGE_SIDE2 renames old dir
+
+  This case looks like this:
+
+    MERGE_BASE:   E,   Has olddir/
+    MERGE_SIDE1:  G,   Adds olddir/newfile
+    MERGE_SIDE2:  A,   Renames olddir/ -> newdir/
+    => creates    A',  With newdir/newfile
+
+    MERGE_BASE:   A,   Has newdir/, but no notion of newdir/newfile
+    MERGE_SIDE1:  A',  Has newdir/newfile
+    MERGE_SIDE2:  B,   Has newdir/, but no notion of newdir/newfile
+    => expected   B',  with newdir/newfile from A'
+
+  In this case, with the optimization, note that after the first commit there
+  were no renames on MERGE_SIDE1, and any renames on MERGE_SIDE2 are tossed.
+  But the second merge didn't need any renames so this is fine.
+
+Case 4: MERGE_SIDE1 renames  file into old dir, MERGE_SIDE2 renames old dir
+
+  This case looks like this:
+
+    MERGE_BASE:   E,   Has olddir/
+    MERGE_SIDE1:  G,   Renames oldfile -> olddir/newfile
+    MERGE_SIDE2:  A,   Renames olddir/ -> newdir/
+    => creates    A',  With newdir/newfile representing original oldfile
+
+    MERGE_BASE:   A,   Has oldfile
+    MERGE_SIDE1:  A',  Has newdir/newfile
+    MERGE_SIDE2:  B,   Modifies oldfile
+    => expected   B',  with threeway-merged newdir/newfile from above
+
+  In this case, with the optimization, note that after the first commit:
+    * MERGE_SIDE1 remembers oldfile -> newdir/newfile
+		  (NOT oldfile -> olddir/newfile; compare to case of second
+		   block under p->status == 'R' in possibly_cache_new_pair())
+    * MERGE_SIDE2 renames are tossed because only MERGE_SIDE1 is remembered
+
+  Given the cached rename noted above, the second merge can proceed as
+  expected without needing to perform rename detection from A -> A'.
+
+Finally, I'll just note here that interactions with the
+skip-irrelevant-renames optimization means we sometimes don't detect
+renames for any files within a directory that was renamed, in which
+case we will not have been able to detect any rename for the directory
+itself.  In such a case, we do not know whether the directory was
+renamed; we want to be careful to avoid cacheing some kind of "this
+directory was not renamed" statement.  If we did, then a subsequent
+commit being rebased could add a file to the old directory, and the
+user would expect it to end up in the correct directory -- something
+our erroneous "this directory was not renamed" cache would preclude.
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 03/13] fast-rebase: change assert() to BUG()
  2021-05-04  2:12 ` [PATCH v2 00/13] " Elijah Newren via GitGitGadget
  2021-05-04  2:12   ` [PATCH v2 01/13] t6423: rename file within directory that other side renamed Elijah Newren via GitGitGadget
  2021-05-04  2:12   ` [PATCH v2 02/13] Documentation/technical: describe remembering renames optimization Elijah Newren via GitGitGadget
@ 2021-05-04  2:12   ` Elijah Newren via GitGitGadget
  2021-05-04  2:12   ` [PATCH v2 04/13] fast-rebase: write conflict state to working tree, index, and HEAD Elijah Newren via GitGitGadget
                     ` (10 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-05-04  2:12 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren,
	Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

assert() can succinctly document expectations for the code, and do so in
a way that may be useful to future folks trying to refactor the code and
change basic assumptions; it allows them to more quickly find some
places where their violations of previous assumptions trips things up.

Unfortunately, assert() can surround a function call with important
side-effects, which is a huge mistake since some users will compile with
assertions disabled.  I've had to debug such mistakes before in other
codebases, so I should know better.  Luckily, this was only in test
code, but it's still very embarrassing.  Change an assert() to an if
(...) BUG (...).

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 t/helper/test-fast-rebase.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/t/helper/test-fast-rebase.c b/t/helper/test-fast-rebase.c
index 373212256a66..39fb7f41e8c1 100644
--- a/t/helper/test-fast-rebase.c
+++ b/t/helper/test-fast-rebase.c
@@ -124,7 +124,8 @@ int cmd__fast_rebase(int argc, const char **argv)
 	assert(oideq(&onto->object.oid, &head));
 
 	hold_locked_index(&lock, LOCK_DIE_ON_ERROR);
-	assert(repo_read_index(the_repository) >= 0);
+	if (repo_read_index(the_repository) < 0)
+		BUG("Could not read index");
 
 	repo_init_revisions(the_repository, &revs, NULL);
 	revs.verbose_header = 1;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 04/13] fast-rebase: write conflict state to working tree, index, and HEAD
  2021-05-04  2:12 ` [PATCH v2 00/13] " Elijah Newren via GitGitGadget
                     ` (2 preceding siblings ...)
  2021-05-04  2:12   ` [PATCH v2 03/13] fast-rebase: change assert() to BUG() Elijah Newren via GitGitGadget
@ 2021-05-04  2:12   ` Elijah Newren via GitGitGadget
  2021-05-17 13:32     ` Derrick Stolee
  2021-05-04  2:12   ` [PATCH v2 05/13] t6429: testcases for remembering renames Elijah Newren via GitGitGadget
                     ` (9 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-05-04  2:12 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren,
	Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Previously, when fast-rebase hit a conflict, it simply aborted and left
HEAD, the index, and the working tree where they were before the
operation started.  While fast-rebase does not support restarting from a
conflicted state, write the conflicted state out anyway as it gives us a
way to see what the conflicts are and write tests that check for them.

This will be important in the upcoming commits, because sequencer.c is
only superficially integrated with merge-ort.c; in particular, it calls
merge_switch_to_result() after EACH merge instead of only calling it at
the end of all the sequence of merges (or when a conflict is hit).  This
not only causes needless updates to the working copy and index, but also
causes all intermediate data to be freed and tossed, preventing caching
information from one merge to the next.  However, integrating
sequencer.c more deeply with merge-ort.c is a big task, and making this
small extension to fast-rebase.c provides us with a simple way to test
the edge and corner cases that we want to make sure continue working.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 t/helper/test-fast-rebase.c | 51 +++++++++++++++++++++++--------------
 1 file changed, 32 insertions(+), 19 deletions(-)

diff --git a/t/helper/test-fast-rebase.c b/t/helper/test-fast-rebase.c
index 39fb7f41e8c1..fc2d46090434 100644
--- a/t/helper/test-fast-rebase.c
+++ b/t/helper/test-fast-rebase.c
@@ -91,7 +91,6 @@ int cmd__fast_rebase(int argc, const char **argv)
 	struct commit *last_commit = NULL, *last_picked_commit = NULL;
 	struct object_id head;
 	struct lock_file lock = LOCK_INIT;
-	int clean = 1;
 	struct strvec rev_walk_args = STRVEC_INIT;
 	struct rev_info revs;
 	struct commit *commit;
@@ -176,11 +175,10 @@ int cmd__fast_rebase(int argc, const char **argv)
 		free((char*)merge_opt.ancestor);
 		merge_opt.ancestor = NULL;
 		if (!result.clean)
-			die("Aborting: Hit a conflict and restarting is not implemented.");
+			break;
 		last_picked_commit = commit;
 		last_commit = create_commit(result.tree, commit, last_commit);
 	}
-	fprintf(stderr, "\nDone.\n");
 	/* TODO: There should be some kind of rev_info_free(&revs) call... */
 	memset(&revs, 0, sizeof(revs));
 
@@ -189,24 +187,39 @@ int cmd__fast_rebase(int argc, const char **argv)
 	if (result.clean < 0)
 		exit(128);
 
-	strbuf_addf(&reflog_msg, "finish rebase %s onto %s",
-		    oid_to_hex(&last_picked_commit->object.oid),
-		    oid_to_hex(&last_commit->object.oid));
-	if (update_ref(reflog_msg.buf, branch_name.buf,
-		       &last_commit->object.oid,
-		       &last_picked_commit->object.oid,
-		       REF_NO_DEREF, UPDATE_REFS_MSG_ON_ERR)) {
-		error(_("could not update %s"), argv[4]);
-		die("Failed to update %s", argv[4]);
+	if (result.clean) {
+		fprintf(stderr, "\nDone.\n");
+		strbuf_addf(&reflog_msg, "finish rebase %s onto %s",
+			    oid_to_hex(&last_picked_commit->object.oid),
+			    oid_to_hex(&last_commit->object.oid));
+		if (update_ref(reflog_msg.buf, branch_name.buf,
+			       &last_commit->object.oid,
+			       &last_picked_commit->object.oid,
+			       REF_NO_DEREF, UPDATE_REFS_MSG_ON_ERR)) {
+			error(_("could not update %s"), argv[4]);
+			die("Failed to update %s", argv[4]);
+		}
+		if (create_symref("HEAD", branch_name.buf, reflog_msg.buf) < 0)
+			die(_("unable to update HEAD"));
+		strbuf_release(&reflog_msg);
+		strbuf_release(&branch_name);
+
+		prime_cache_tree(the_repository, the_repository->index,
+				 result.tree);
+	} else {
+		fprintf(stderr, "\nAborting: Hit a conflict.\n");
+		strbuf_addf(&reflog_msg, "rebase progress up to %s",
+			    oid_to_hex(&last_picked_commit->object.oid));
+		if (update_ref(reflog_msg.buf, "HEAD",
+			       &last_commit->object.oid,
+			       &head,
+			       REF_NO_DEREF, UPDATE_REFS_MSG_ON_ERR)) {
+			error(_("could not update %s"), argv[4]);
+			die("Failed to update %s", argv[4]);
+		}
 	}
-	if (create_symref("HEAD", branch_name.buf, reflog_msg.buf) < 0)
-		die(_("unable to update HEAD"));
-	strbuf_release(&reflog_msg);
-	strbuf_release(&branch_name);
-
-	prime_cache_tree(the_repository, the_repository->index, result.tree);
 	if (write_locked_index(&the_index, &lock,
 			       COMMIT_LOCK | SKIP_IF_UNCHANGED))
 		die(_("unable to write %s"), get_index_file());
-	return (clean == 0);
+	return (result.clean == 0);
 }
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 05/13] t6429: testcases for remembering renames
  2021-05-04  2:12 ` [PATCH v2 00/13] " Elijah Newren via GitGitGadget
                     ` (3 preceding siblings ...)
  2021-05-04  2:12   ` [PATCH v2 04/13] fast-rebase: write conflict state to working tree, index, and HEAD Elijah Newren via GitGitGadget
@ 2021-05-04  2:12   ` Elijah Newren via GitGitGadget
  2021-05-04  2:12   ` [PATCH v2 06/13] merge-ort: add data structures for in-memory caching of rename detection Elijah Newren via GitGitGadget
                     ` (8 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-05-04  2:12 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren,
	Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

We will soon be adding an optimization that caches (in memory only,
never written to disk) upstream renames during a sequence of merges such
as occurs during a cherry-pick or rebase operation.  Add several tests
meant to stress such an implementation to ensure it does the right
thing, and include a test whose outcome we will later change due to this
optimization as well.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 .../technical/remembering-renames.txt         |  14 +-
 t/t6429-merge-sequence-rename-caching.sh      | 692 ++++++++++++++++++
 2 files changed, 700 insertions(+), 6 deletions(-)
 create mode 100755 t/t6429-merge-sequence-rename-caching.sh

diff --git a/Documentation/technical/remembering-renames.txt b/Documentation/technical/remembering-renames.txt
index e7c298d83df2..a1d3499ca727 100644
--- a/Documentation/technical/remembering-renames.txt
+++ b/Documentation/technical/remembering-renames.txt
@@ -214,12 +214,14 @@ in-memory cache of renames and thus doesn't need to be considered further.
 In the special case that E->A does rename the file but also renames it to
 newfile, then there is no conflict from the renaming and the merge can
 succeed.  In this special case, the rename is not valid to cache because
-the second merge will find A:newfile in the MERGE_BASE.  So a
-rename/rename(1to1) needs to be specially handled by pruning renames from
-the cache and decrementing the dir_rename_counts in the current and leading
-directories associated with those renames.  Or, since these are really
-rare, one could just take the easy way out and disable the remembering
-renames optimization when a rename/rename(1to1) happens.
+the second merge will find A:newfile in the MERGE_BASE (see also the new
+testcases in t6429 with "rename same file identically" in their
+description).  So a rename/rename(1to1) needs to be specially handled by
+pruning renames from the cache and decrementing the dir_rename_counts in
+the current and leading directories associated with those renames.  Or,
+since these are really rare, one could just take the easy way out and
+disable the remembering renames optimization when a rename/rename(1to1)
+happens.
 
 The previous paragraph handled the cases for E->A renaming oldfile, let's
 continue assuming that oldfile is not renamed in A.
diff --git a/t/t6429-merge-sequence-rename-caching.sh b/t/t6429-merge-sequence-rename-caching.sh
new file mode 100755
index 000000000000..f47d8924ee73
--- /dev/null
+++ b/t/t6429-merge-sequence-rename-caching.sh
@@ -0,0 +1,692 @@
+#!/bin/sh
+
+test_description="remember regular & dir renames in sequence of merges"
+
+. ./test-lib.sh
+
+#
+# NOTE 1: this testfile tends to not only rename files, but modify on both
+#         sides; without modifying on both sides, optimizations can kick in
+#         which make rename detection irrelevant or trivial.  We want to make
+#         sure that we are triggering rename caching rather than rename
+#         bypassing.
+#
+# NOTE 2: this testfile uses 'test-tool fast-rebase' instead of either
+#         cherry-pick or rebase.  sequencer.c is only superficially
+#         integrated with merge-ort; it calls merge_switch_to_result()
+#         after EACH merge, which updates the index and working copy AND
+#         throws away the cached results (because merge_switch_to_result()
+#         is only supposed to be called at the end of the sequence).
+#         Integrating them more deeply is a big task, so for now the tests
+#         use 'test-tool fast-rebase'.
+#
+
+
+#
+# In the following simple testcase:
+#   Base:     numbers_1, values_1
+#   Upstream: numbers_2, values_2
+#   Topic_1:  sequence_3
+#   Topic_2:  scruples_3
+# or, in english, rename numbers -> sequence in the first commit, and rename
+# values -> scruples in the second commit.
+#
+# This shouldn't be a challenge, it's just verifying that cached renames isn't
+# preventing us from finding new renames.
+#
+test_expect_success 'caching renames does not preclude finding new ones' '
+	test_create_repo caching-renames-and-new-renames &&
+	(
+		cd caching-renames-and-new-renames &&
+
+		test_seq 2 10 >numbers &&
+		test_seq 2 10 >values &&
+		git add numbers values &&
+		git commit -m orig &&
+
+		git branch upstream &&
+		git branch topic &&
+
+		git switch upstream &&
+		test_seq 1 10 >numbers &&
+		test_seq 1 10 >values &&
+		git add numbers values &&
+		git commit -m "Tweaked both files" &&
+
+		git switch topic &&
+
+		test_seq 2 12 >numbers &&
+		git add numbers &&
+		git mv numbers sequence &&
+		git commit -m A &&
+
+		test_seq 2 12 >values &&
+		git add values &&
+		git mv values scruples &&
+		git commit -m B &&
+
+		#
+		# Actual testing
+		#
+
+		git switch upstream &&
+
+		test-tool fast-rebase --onto HEAD upstream~1 topic &&
+		#git cherry-pick upstream~1..topic
+
+		git ls-files >tracked-files &&
+		test_line_count = 2 tracked-files &&
+		test_seq 1 12 >expect &&
+		test_cmp expect sequence &&
+		test_cmp expect scruples
+	)
+'
+
+#
+# In the following testcase:
+#   Base:     numbers_1
+#   Upstream: rename numbers_1 -> sequence_2
+#   Topic_1:  numbers_3
+#   Topic_2:  numbers_1
+# or, in english, the first commit on the topic branch modifies numbers by
+# shrinking it (dramatically) and the second commit on topic reverts its
+# parent.
+#
+# Can git apply both patches?
+#
+# Traditional cherry-pick/rebase will fail to apply the second commit, the
+# one that reverted its parent, because despite detecting the rename from
+# 'numbers' to 'sequence' for the first commit, it fails to detect that
+# rename when picking the second commit.  That's "reasonable" given the
+# dramatic change in size of the file, but remembering the rename and
+# reusing it is reasonable too.
+#
+# Rename detection (diffcore_rename_extended()) will run twice here; it is
+# not needed on the topic side of history for either of the two commits
+# being merged, but it is needed on the upstream side of history for each
+# commit being picked.
+test_expect_success 'cherry-pick both a commit and its immediate revert' '
+	test_create_repo pick-commit-and-its-immediate-revert &&
+	(
+		cd pick-commit-and-its-immediate-revert &&
+
+		test_seq 11 30 >numbers &&
+		git add numbers &&
+		git commit -m orig &&
+
+		git branch upstream &&
+		git branch topic &&
+
+		git switch upstream &&
+		test_seq 1 30 >numbers &&
+		git add numbers &&
+		git mv numbers sequence &&
+		git commit -m "Renamed (and modified) numbers -> sequence" &&
+
+		git switch topic &&
+
+		test_seq 11 13 >numbers &&
+		git add numbers &&
+		git commit -m A &&
+
+		git revert HEAD &&
+
+		#
+		# Actual testing
+		#
+
+		git switch upstream &&
+
+		GIT_TRACE2_PERF="$(pwd)/trace.output" &&
+		export GIT_TRACE2_PERF &&
+
+		test_might_fail test-tool fast-rebase --onto HEAD upstream~1 topic &&
+		#git cherry-pick upstream~1..topic &&
+
+		grep region_enter.*diffcore_rename trace.output >calls &&
+		test_line_count = 2 calls
+	)
+'
+
+#
+# In the following testcase:
+#   Base:     sequence_1
+#   Upstream: rename sequence_1 -> values_2
+#   Topic_1:  rename sequence_1 -> values_3
+#   Topic_2:  add unrelated sequence_4
+# or, in english, both sides rename sequence -> values, and then the second
+# commit on the topic branch adds an unrelated file called sequence.
+#
+# This testcase presents no problems for git traditionally, but having both
+# sides do the same rename in effect "uses it up" and if it remains cached,
+# could cause a spurious rename/add conflict.
+#
+test_expect_success 'rename same file identically, then reintroduce it' '
+	test_create_repo rename-rename-1to1-then-add-old-filename &&
+	(
+		cd rename-rename-1to1-then-add-old-filename &&
+
+		test_seq 3 8 >sequence &&
+		git add sequence &&
+		git commit -m orig &&
+
+		git branch upstream &&
+		git branch topic &&
+
+		git switch upstream &&
+		test_seq 1 8 >sequence &&
+		git add sequence &&
+		git mv sequence values &&
+		git commit -m "Renamed (and modified) sequence -> values" &&
+
+		git switch topic &&
+
+		test_seq 3 10 >sequence &&
+		git add sequence &&
+		git mv sequence values &&
+		git commit -m A &&
+
+		test_write_lines A B C D E F G H I J >sequence &&
+		git add sequence &&
+		git commit -m B &&
+
+		#
+		# Actual testing
+		#
+
+		git switch upstream &&
+
+		GIT_TRACE2_PERF="$(pwd)/trace.output" &&
+		export GIT_TRACE2_PERF &&
+
+		test-tool fast-rebase --onto HEAD upstream~1 topic &&
+		#git cherry-pick upstream~1..topic &&
+
+		git ls-files >tracked &&
+		test_line_count = 2 tracked &&
+		test_path_is_file values &&
+		test_path_is_file sequence &&
+
+		grep region_enter.*diffcore_rename trace.output >calls &&
+		test_line_count = 2 calls
+	)
+'
+
+#
+# In the following testcase:
+#   Base:     olddir/{valuesZ_1, valuesY_1, valuesX_1}
+#   Upstream: rename olddir/valuesZ_1 -> dirA/valuesZ_2
+#             rename olddir/valuesY_1 -> dirA/valuesY_2
+#             rename olddir/valuesX_1 -> dirB/valuesX_2
+#   Topic_1:  rename olddir/valuesZ_1 -> dirA/valuesZ_3
+#             rename olddir/valuesY_1 -> dirA/valuesY_3
+#   Topic_2:  add olddir/newfile
+#   Expected Pick1: dirA/{valuesZ, valuesY}, dirB/valuesX
+#   Expected Pick2: dirA/{valuesZ, valuesY}, dirB/{valuesX, newfile}
+#
+# This testcase presents no problems for git traditionally, but having both
+# sides do the same renames in effect "use it up" but if the renames remain
+# cached, the directory rename could put newfile in the wrong directory.
+#
+test_expect_success 'rename same file identically, then add file to old dir' '
+	test_create_repo rename-rename-1to1-then-add-file-to-old-dir &&
+	(
+		cd rename-rename-1to1-then-add-file-to-old-dir &&
+
+		mkdir olddir/ &&
+		test_seq 3 8 >olddir/valuesZ &&
+		test_seq 3 8 >olddir/valuesY &&
+		test_seq 3 8 >olddir/valuesX &&
+		git add olddir &&
+		git commit -m orig &&
+
+		git branch upstream &&
+		git branch topic &&
+
+		git switch upstream &&
+		test_seq 1 8 >olddir/valuesZ &&
+		test_seq 1 8 >olddir/valuesY &&
+		test_seq 1 8 >olddir/valuesX &&
+		git add olddir &&
+		mkdir dirA &&
+		git mv olddir/valuesZ olddir/valuesY dirA &&
+		git mv olddir/ dirB/ &&
+		git commit -m "Renamed (and modified) values*" &&
+
+		git switch topic &&
+
+		test_seq 3 10 >olddir/valuesZ &&
+		test_seq 3 10 >olddir/valuesY &&
+		git add olddir &&
+		mkdir dirA &&
+		git mv olddir/valuesZ olddir/valuesY dirA &&
+		git commit -m A &&
+
+		>olddir/newfile &&
+		git add olddir/newfile &&
+		git commit -m B &&
+
+		#
+		# Actual testing
+		#
+
+		git switch upstream &&
+		git config merge.directoryRenames true &&
+
+		GIT_TRACE2_PERF="$(pwd)/trace.output" &&
+		export GIT_TRACE2_PERF &&
+
+		test-tool fast-rebase --onto HEAD upstream~1 topic &&
+		#git cherry-pick upstream~1..topic &&
+
+		git ls-files >tracked &&
+		test_line_count = 4 tracked &&
+		test_path_is_file dirA/valuesZ &&
+		test_path_is_file dirA/valuesY &&
+		test_path_is_file dirB/valuesX &&
+		test_path_is_file dirB/newfile &&
+
+		grep region_enter.*diffcore_rename trace.output >calls &&
+		test_line_count = 3 calls
+	)
+'
+
+#
+# In the following testcase, upstream renames a directory, and the topic branch
+# first adds a file to the directory, then later renames the directory
+# differently:
+#   Base:     olddir/a
+#             olddir/b
+#   Upstream: rename olddir/ -> newdir/
+#   Topic_1:  add olddir/newfile
+#   Topic_2:  rename olddir/ -> otherdir/
+#
+# Here we are just concerned that cached renames might prevent us from seeing
+# the rename conflict, and we want to ensure that we do get a conflict.
+#
+# While at it, also test that we do rename detection three times.  We have to
+# detect renames on the upstream side of history once for each merge, plus
+# Topic_2 has renames.
+#
+test_expect_success 'cached dir rename does not prevent noticing later conflict' '
+	test_create_repo dir-rename-cache-not-occluding-later-conflict &&
+	(
+		cd dir-rename-cache-not-occluding-later-conflict &&
+
+		mkdir olddir &&
+		test_seq 3 10 >olddir/a &&
+		test_seq 3 10 >olddir/b &&
+		git add olddir &&
+		git commit -m orig &&
+
+		git branch upstream &&
+		git branch topic &&
+
+		git switch upstream &&
+		test_seq 3 10 >olddir/a &&
+		test_seq 3 10 >olddir/b &&
+		git add olddir &&
+		git mv olddir newdir &&
+		git commit -m "Dir renamed" &&
+
+		git switch topic &&
+
+		>olddir/newfile &&
+		git add olddir/newfile &&
+		git commit -m A &&
+
+		test_seq 1 8 >olddir/a &&
+		test_seq 1 8 >olddir/b &&
+		git add olddir &&
+		git mv olddir otherdir &&
+		git commit -m B &&
+
+		#
+		# Actual testing
+		#
+
+		git switch upstream &&
+		git config merge.directoryRenames true &&
+
+		GIT_TRACE2_PERF="$(pwd)/trace.output" &&
+		export GIT_TRACE2_PERF &&
+
+		test_must_fail test-tool fast-rebase --onto HEAD upstream~1 topic >output &&
+		#git cherry-pick upstream..topic &&
+
+		grep CONFLICT..rename/rename output &&
+
+		grep region_enter.*diffcore_rename trace.output >calls &&
+		test_line_count = 3 calls
+	)
+'
+
+# Helper for the next two tests
+test_setup_upstream_rename () {
+	test_create_repo $1 &&
+	(
+		cd $1 &&
+
+		test_seq 3 8 >somefile &&
+		test_seq 3 8 >relevant-rename &&
+		git add somefile relevant-rename &&
+		mkdir olddir &&
+		test_write_lines a b c d e f g >olddir/a &&
+		test_write_lines z y x w v u t >olddir/b &&
+		git add olddir &&
+		git commit -m orig &&
+
+		git branch upstream &&
+		git branch topic &&
+
+		git switch upstream &&
+		test_seq 1 8 >somefile &&
+		test_seq 1 8 >relevant-rename &&
+		git add somefile relevant-rename &&
+		git mv relevant-rename renamed &&
+		echo h >>olddir/a &&
+		echo s >>olddir/b &&
+		git add olddir &&
+		git mv olddir newdir &&
+		git commit -m "Dir renamed"
+	)
+}
+
+#
+# In the following testcase, upstream renames a file in the toplevel directory
+# as well as its only directory:
+#   Base:     relevant-rename_1
+#             somefile
+#             olddir/a
+#             olddir/b
+#   Upstream: rename relevant-rename_1 -> renamed_2
+#             rename olddir/           -> newdir/
+#   Topic_1:  relevant-rename_3
+#   Topic_2:  olddir/newfile_1
+#   Topic_3:  olddir/newfile_2
+#
+# In this testcase, since the first commit being picked only modifies a
+# file in the toplevel directory, the directory rename is irrelevant for
+# that first merge.  However, we need to notice the directory rename for
+# the merge that picks the second commit, and we don't want the third
+# commit to mess up its location either.  We want to make sure that
+# olddir/newfile doesn't exist in the result and that newdir/newfile does.
+#
+# We also expect rename detection to occur three times.  Although it is
+# typically needed two times per commit, there are no deleted files on the
+# topic side of history, so we only need to detect renames on the upstream
+# side for each of the 3 commits we need to pick.
+#
+test_expect_success 'dir rename unneeded, then add new file to old dir' '
+	test_setup_upstream_rename dir-rename-unneeded-until-new-file &&
+	(
+		cd dir-rename-unneeded-until-new-file &&
+
+		git switch topic &&
+
+		test_seq 3 10 >relevant-rename &&
+		git add relevant-rename &&
+		git commit -m A &&
+
+		echo foo >olddir/newfile &&
+		git add olddir/newfile &&
+		git commit -m B &&
+
+		echo bar >>olddir/newfile &&
+		git add olddir/newfile &&
+		git commit -m C &&
+
+		#
+		# Actual testing
+		#
+
+		git switch upstream &&
+		git config merge.directoryRenames true &&
+
+		GIT_TRACE2_PERF="$(pwd)/trace.output" &&
+		export GIT_TRACE2_PERF &&
+
+		test-tool fast-rebase --onto HEAD upstream~1 topic &&
+		#git cherry-pick upstream..topic &&
+
+		grep region_enter.*diffcore_rename trace.output >calls &&
+		test_line_count = 3 calls &&
+
+		git ls-files >tracked &&
+		test_line_count = 5 tracked &&
+		test_path_is_missing olddir/newfile &&
+		test_path_is_file newdir/newfile
+	)
+'
+
+#
+# The following testcase is *very* similar to the last one, but instead of
+# adding a new olddir/newfile, it renames somefile -> olddir/newfile:
+#   Base:     relevant-rename_1
+#             somefile_1
+#             olddir/a
+#             olddir/b
+#   Upstream: rename relevant-rename_1 -> renamed_2
+#             rename olddir/           -> newdir/
+#   Topic_1:  relevant-rename_3
+#   Topic_2:  rename somefile -> olddir/newfile_2
+#   Topic_3:  modify olddir/newfile_3
+#
+# In this testcase, since the first commit being picked only modifies a
+# file in the toplevel directory, the directory rename is irrelevant for
+# that first merge.  However, we need to notice the directory rename for
+# the merge that picks the second commit, and we don't want the third
+# commit to mess up its location either.  We want to make sure that
+# neither somefile or olddir/newfile exists in the result and that
+# newdir/newfile does.
+#
+# This testcase needs one more call to rename detection than the last
+# testcase, because of the somefile -> olddir/newfile rename in Topic_2.
+test_expect_success 'dir rename unneeded, then rename existing file into old dir' '
+	test_setup_upstream_rename dir-rename-unneeded-until-file-moved-inside &&
+	(
+		cd dir-rename-unneeded-until-file-moved-inside &&
+
+		git switch topic &&
+
+		test_seq 3 10 >relevant-rename &&
+		git add relevant-rename &&
+		git commit -m A &&
+
+		test_seq 1 10 >somefile &&
+		git add somefile &&
+		git mv somefile olddir/newfile &&
+		git commit -m B &&
+
+		test_seq 1 12 >olddir/newfile &&
+		git add olddir/newfile &&
+		git commit -m C &&
+
+		#
+		# Actual testing
+		#
+
+		git switch upstream &&
+		git config merge.directoryRenames true &&
+
+		GIT_TRACE2_PERF="$(pwd)/trace.output" &&
+		export GIT_TRACE2_PERF &&
+
+		test-tool fast-rebase --onto HEAD upstream~1 topic &&
+		#git cherry-pick upstream..topic &&
+
+		grep region_enter.*diffcore_rename trace.output >calls &&
+		test_line_count = 4 calls &&
+
+		test_path_is_missing somefile &&
+		test_path_is_missing olddir/newfile &&
+		test_path_is_file newdir/newfile &&
+		git ls-files >tracked &&
+		test_line_count = 4 tracked
+	)
+'
+
+# Helper for the next two tests
+test_setup_topic_rename () {
+	test_create_repo $1 &&
+	(
+		cd $1 &&
+
+		test_seq 3 8 >somefile &&
+		mkdir olddir &&
+		test_seq 3 8 >olddir/a &&
+		echo b >olddir/b &&
+		git add olddir somefile &&
+		git commit -m orig &&
+
+		git branch upstream &&
+		git branch topic &&
+
+		git switch topic &&
+		test_seq 1 8 >somefile &&
+		test_seq 1 8 >olddir/a &&
+		git add somefile olddir/a &&
+		git mv olddir newdir &&
+		git commit -m "Dir renamed" &&
+
+		test_seq 1 10 >somefile &&
+		git add somefile &&
+		mkdir olddir &&
+		>olddir/unrelated-file &&
+		git add olddir &&
+		git commit -m "Unrelated file in recreated old dir"
+	)
+}
+
+#
+# In the following testcase, the first commit on the topic branch renames
+# a directory, while the second recreates the old directory and places a
+# file into it:
+#   Base:     somefile
+#             olddir/a
+#             olddir/b
+#   Upstream: olddir/newfile
+#   Topic_1:  somefile_2
+#             rename olddir/ -> newdir/
+#   Topic_2:  olddir/unrelated-file
+#
+# Note that the first pick should merge:
+#   Base:     somefile
+#             olddir/{a,b}
+#   Upstream: olddir/newfile
+#   Topic_1:  rename olddir/ -> newdir/
+# For which the expected result (assuming merge.directoryRenames=true) is
+# clearly:
+#   Result:   somefile
+#             newdir/{a, b, newfile}
+#
+# While the second pick does the following three-way merge:
+#   Base (Topic_1):           somefile
+#                             newdir/{a,b}
+#   Upstream (Result from 1): same files as base, but adds newdir/newfile
+#   Topic_2:                  same files as base, but adds olddir/unrelated-file
+#
+# The second merge is pretty trivial; upstream adds newdir/newfile, and
+# topic_2 adds olddir/unrelated-file.  We're just testing that we don't
+# accidentally cache directory renames somehow and rename
+# olddir/unrelated-file to newdir/unrelated-file.
+#
+# This testcase should only need one call to diffcore_rename_extended().
+test_expect_success 'caching renames only on upstream side, part 1' '
+	test_setup_topic_rename cache-renames-only-upstream-add-file &&
+	(
+		cd cache-renames-only-upstream-add-file &&
+
+		git switch upstream &&
+
+		>olddir/newfile &&
+		git add olddir/newfile &&
+		git commit -m "Add newfile" &&
+
+		#
+		# Actual testing
+		#
+
+		git switch upstream &&
+
+		git config merge.directoryRenames true &&
+
+		GIT_TRACE2_PERF="$(pwd)/trace.output" &&
+		export GIT_TRACE2_PERF &&
+
+		test-tool fast-rebase --onto HEAD upstream~1 topic &&
+		#git cherry-pick upstream..topic &&
+
+		grep region_enter.*diffcore_rename trace.output >calls &&
+		test_line_count = 1 calls &&
+
+		git ls-files >tracked &&
+		test_line_count = 5 tracked &&
+		test_path_is_missing newdir/unrelated-file &&
+		test_path_is_file olddir/unrelated-file &&
+		test_path_is_file newdir/newfile &&
+		test_path_is_file newdir/b &&
+		test_path_is_file newdir/a &&
+		test_path_is_file somefile
+	)
+'
+
+#
+# The following testcase is *very* similar to the last one, but instead of
+# adding a new olddir/newfile, it renames somefile -> olddir/newfile:
+#   Base:     somefile
+#             olddir/a
+#             olddir/b
+#   Upstream: somefile_1 -> olddir/newfile
+#   Topic_1:  rename olddir/ -> newdir/
+#             somefile_2
+#   Topic_2:  olddir/unrelated-file
+#             somefile_3
+#
+# Much like the previous test, this case is actually trivial and we are just
+# making sure there isn't some spurious directory rename caching going on
+# for the wrong side of history.
+#
+#
+# This testcase should only need three calls to diffcore_rename_extended(),
+# because there are no renames on the topic side of history for picking
+# Topic_2.
+#
+test_expect_success 'caching renames only on upstream side, part 2' '
+	test_setup_topic_rename cache-renames-only-upstream-rename-file &&
+	(
+		cd cache-renames-only-upstream-rename-file &&
+
+		git switch upstream &&
+
+		git mv somefile olddir/newfile &&
+		git commit -m "Add newfile" &&
+
+		#
+		# Actual testing
+		#
+
+		git switch upstream &&
+
+		git config merge.directoryRenames true &&
+
+		GIT_TRACE2_PERF="$(pwd)/trace.output" &&
+		export GIT_TRACE2_PERF &&
+
+		test-tool fast-rebase --onto HEAD upstream~1 topic &&
+		#git cherry-pick upstream..topic &&
+
+		grep region_enter.*diffcore_rename trace.output >calls &&
+		test_line_count = 3 calls &&
+
+		git ls-files >tracked &&
+		test_line_count = 4 tracked &&
+		test_path_is_missing newdir/unrelated-file &&
+		test_path_is_file olddir/unrelated-file &&
+		test_path_is_file newdir/newfile &&
+		test_path_is_file newdir/b &&
+		test_path_is_file newdir/a
+	)
+'
+
+test_done
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 06/13] merge-ort: add data structures for in-memory caching of rename detection
  2021-05-04  2:12 ` [PATCH v2 00/13] " Elijah Newren via GitGitGadget
                     ` (4 preceding siblings ...)
  2021-05-04  2:12   ` [PATCH v2 05/13] t6429: testcases for remembering renames Elijah Newren via GitGitGadget
@ 2021-05-04  2:12   ` Elijah Newren via GitGitGadget
  2021-05-17 13:41     ` Derrick Stolee
  2021-05-04  2:12   ` [PATCH v2 07/13] merge-ort: populate caches of rename detection results Elijah Newren via GitGitGadget
                     ` (7 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-05-04  2:12 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren,
	Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

When there are many renames between the old base of a series of commits
and the new base for a series of commits, the sequence of merges
employed to transplant those commits (from a cherry-pick or rebase
operation) will repeatedly detect the exact same renames.  This is
wasted effort.

Add data structures which will be used to cache rename detection
results, along with the initialization and deallocation of these data
structures.  Future commits will populate these caches, detect the
appropriate circumstances when they can be used, and employ them to
avoid re-detecting the same renames repeatedly.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index b1795d838eaf..8602f88a960c 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -140,6 +140,37 @@ struct rename_info {
 	int callback_data_nr, callback_data_alloc;
 	char *callback_data_traverse_path;
 
+	/*
+	 * cached_pairs: Caching of renames and deletions.
+	 *
+	 * These are mappings recording renames and deletions of individual
+	 * files (not directories).  They are thus a map from an old
+	 * filename to either NULL (for deletions) or a new filename (for
+	 * renames).
+	 */
+	struct strmap cached_pairs[3];
+
+	/*
+	 * cached_target_names: just the destinations from cached_pairs
+	 *
+	 * We sometimes want a fast lookup to determine if a given filename
+	 * is one of the destinations in cached_pairs.  cached_target_names
+	 * is thus duplicative information, but it provides a fast lookup.
+	 */
+	struct strset cached_target_names[3];
+
+	/*
+	 * cached_irrelevant: Caching of rename_sources that aren't relevant.
+	 *
+	 * cached_pairs records both renames and deletes.  Sometimes we
+	 * do not know if a path is a rename or a delete because we pass
+	 * RELEVANT_LOCATION to diffcore_rename_extended() and based on
+	 * various optimizations it returns without detecting whether that
+	 * path is actually a rename or a delete.  We need to cache such
+	 * paths too, but separately from cached_pairs.
+	 */
+	struct strset cached_irrelevant[3];
+
 	/*
 	 * needed_limit: value needed for inexact rename detection to run
 	 *
@@ -382,6 +413,8 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		reinitialize ? strmap_partial_clear : strmap_clear;
 	void (*strintmap_func)(struct strintmap *) =
 		reinitialize ? strintmap_partial_clear : strintmap_clear;
+	void (*strset_func)(struct strset *) =
+		reinitialize ? strset_partial_clear : strset_clear;
 
 	/*
 	 * We marked opti->paths with strdup_strings = 0, so that we
@@ -425,6 +458,9 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		strmap_func(&renames->dir_renames[i], 0);
 
 		strintmap_func(&renames->relevant_sources[i]);
+		strset_func(&renames->cached_target_names[i]);
+		strmap_func(&renames->cached_pairs[i], 1);
+		strset_func(&renames->cached_irrelevant[i]);
 	}
 
 	if (!reinitialize) {
@@ -3667,6 +3703,12 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
 					 NULL, 0);
 		strintmap_init_with_options(&renames->relevant_sources[i],
 					    0, NULL, 0);
+		strmap_init_with_options(&renames->cached_pairs[i],
+					 NULL, 1);
+		strset_init_with_options(&renames->cached_irrelevant[i],
+					 NULL, 1);
+		strset_init_with_options(&renames->cached_target_names[i],
+					 NULL, 0);
 	}
 
 	/*
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 07/13] merge-ort: populate caches of rename detection results
  2021-05-04  2:12 ` [PATCH v2 00/13] " Elijah Newren via GitGitGadget
                     ` (5 preceding siblings ...)
  2021-05-04  2:12   ` [PATCH v2 06/13] merge-ort: add data structures for in-memory caching of rename detection Elijah Newren via GitGitGadget
@ 2021-05-04  2:12   ` Elijah Newren via GitGitGadget
  2021-05-17 13:51     ` Derrick Stolee
  2021-05-04  2:12   ` [PATCH v2 08/13] merge-ort: add code to check for whether cached renames can be reused Elijah Newren via GitGitGadget
                     ` (6 subsequent siblings)
  13 siblings, 1 reply; 32+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-05-04  2:12 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren,
	Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Fill in cache_pairs, cached_target_names, and cached_irrelevant based on
rename detection results.  Future commits will make use of these values.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 70 insertions(+), 2 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index 8602f88a960c..5523fc9e86b3 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -2333,6 +2333,65 @@ static void resolve_diffpair_statuses(struct diff_queue_struct *q)
 	}
 }
 
+static void possibly_cache_new_pair(struct rename_info *renames,
+				    struct diff_filepair *p,
+				    unsigned side,
+				    char *new_path)
+{
+	char *old_value;
+	int dir_renamed_side = 0;
+
+	if (new_path) {
+		/*
+		 * Directory renames happen on the other side of history from
+		 * the side that adds new files to the old directory.
+		 */
+		dir_renamed_side = 3 - side;
+	} else {
+		int val = strintmap_get(&renames->relevant_sources[side],
+					p->one->path);
+		if (val == RELEVANT_NO_MORE) {
+			assert(p->status == 'D');
+			strset_add(&renames->cached_irrelevant[side],
+				   p->one->path);
+		}
+		if (val <= 0)
+			return;
+	}
+
+	if (p->status == 'D') {
+		/*
+		 * If we already had this delete, we'll just set it's value
+		 * to NULL again, so no harm.
+		 */
+		strmap_put(&renames->cached_pairs[side], p->one->path, NULL);
+	} else if (p->status == 'R') {
+		if (new_path) {
+			new_path = xstrdup(new_path);
+			old_value = strmap_put(&renames->cached_pairs[dir_renamed_side],
+					       p->two->path, new_path);
+			strset_add(&renames->cached_target_names[dir_renamed_side],
+				   new_path);
+			assert(!old_value);
+		}
+		if (!new_path)
+			new_path = p->two->path;
+		new_path = xstrdup(new_path);
+		old_value = strmap_put(&renames->cached_pairs[side],
+				       p->one->path, new_path);
+		strset_add(&renames->cached_target_names[side],
+			   new_path);
+		free(old_value);
+	} else if (p->status == 'A' && new_path) {
+		new_path = xstrdup(new_path);
+		old_value = strmap_put(&renames->cached_pairs[dir_renamed_side],
+				       p->two->path, new_path);
+		strset_add(&renames->cached_target_names[dir_renamed_side],
+			   new_path);
+		assert(!old_value);
+	}
+}
+
 static int compare_pairs(const void *a_, const void *b_)
 {
 	const struct diff_filepair *a = *((const struct diff_filepair **)a_);
@@ -2415,6 +2474,7 @@ static int collect_renames(struct merge_options *opt,
 		char *new_path; /* non-NULL only with directory renames */
 
 		if (p->status != 'A' && p->status != 'R') {
+			possibly_cache_new_pair(renames, p, side_index, NULL);
 			diff_free_filepair(p);
 			continue;
 		}
@@ -2426,11 +2486,11 @@ static int collect_renames(struct merge_options *opt,
 						      &collisions,
 						      &clean);
 
+		possibly_cache_new_pair(renames, p, side_index, new_path);
 		if (p->status != 'R' && !new_path) {
 			diff_free_filepair(p);
 			continue;
 		}
-
 		if (new_path)
 			apply_directory_rename_modifications(opt, p, new_path);
 
@@ -3701,8 +3761,16 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
 					 NULL, 1);
 		strmap_init_with_options(&renames->dir_renames[i],
 					 NULL, 0);
+		/*
+		 * relevant_sources uses -1 for the default, because we need
+		 * to be able to distinguish not-in-strintmap from valid
+		 * relevant_source values from enum file_rename_relevance.
+		 * In particular, possibly_cache_new_pair() expects a negative
+		 * value for not-found entries.
+		 */
 		strintmap_init_with_options(&renames->relevant_sources[i],
-					    0, NULL, 0);
+					    -1 /* explicitly invalid */,
+					    NULL, 0);
 		strmap_init_with_options(&renames->cached_pairs[i],
 					 NULL, 1);
 		strset_init_with_options(&renames->cached_irrelevant[i],
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 08/13] merge-ort: add code to check for whether cached renames can be reused
  2021-05-04  2:12 ` [PATCH v2 00/13] " Elijah Newren via GitGitGadget
                     ` (6 preceding siblings ...)
  2021-05-04  2:12   ` [PATCH v2 07/13] merge-ort: populate caches of rename detection results Elijah Newren via GitGitGadget
@ 2021-05-04  2:12   ` Elijah Newren via GitGitGadget
  2021-05-04  2:12   ` [PATCH v2 09/13] merge-ort: avoid accidental API mis-use Elijah Newren via GitGitGadget
                     ` (5 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-05-04  2:12 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren,
	Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

We need to know when renames detected in a previous merge operation can
be reused in a later merge operation.  Consider the following setup
(from the git-rebase manpage):

                     A---B---C topic
                    /
               D---E---F---G master

After rebasing, this will appear as:

                             A'--B'--C' topic
                            /
               D---E---F---G master

Further, let's say that 'oldfile' was renamed to 'newfile' between E
and G.  The rebase or cherry-pick of A onto G will involve a three-way
merge between E (as the merge base) and G and A.  After detecting the
rename between E:oldfile and G:newfile, there will be a three-way
content merge of the following:
    E:oldfile
    G:newfile
    A:oldfile
and produce a new result:
    A':newfile

Now, when we want to pick B onto A', we will need to do a three-way
merge between A (as the merge-base) and A' and B.  This will involve
a three-way content merge of
    A:oldfile
    A':newfile
    B:oldfile
but only if we can detect that A:oldfile is similar enough to A':newfile
to be used together in a three-way content merge, i.e. only if we can
detect that A:oldfile and A':newfile are a rename.  But we already know
that A:oldfile and A':newfile are similar enough to be used in a
three-way content merge, because that is precisely where A':newfile came
from in the previous merge.

Note that A & A' both appear in both merges.  That gives us the
condition under which we can reuse renames.

There are a couple important points about this optimization:

  - If the rebase or cherry-pick halts for user conflicts, these caches
    are NOT saved anywhere.  Thus, resuming a halted rebase or
    cherry-pick will result in no reused renames for the next commit.
    This is intentional, as user resolution can change files
    significantly and in ways that violate the similarity assumptions
    here.

  - Technically, in a *very* narrow case this might give slightly
    different results for rename detection.  Using the example above,
    if:
      * E:oldfile had 20 lines
      * G:newfile added 10 new lines at the beginning of the file
      * A:oldfile deleted all but the first three lines of the file
    then
      => A':newfile would have 13 lines, 3 of which matches those
         in A:oldfile.

    Consider the two cases:
      * Without this optimization:
        - the next step of the rebase operation (moving B to B')
          would not detect the rename betwen A:oldfile and A':newfile
        - we'd thus get a modify/delete conflict with the rebase
          operation halting for the user to resolve, and have both
          A':newfile and B:oldfile sitting in the working tree.
      * With this optimization:
        - the rename between A:oldfile and A':newfile would be detected
          via the cache of renames
        - a three-way merge between A:oldfile, A':newfile, and B:oldfile
          would commence and be written to A':newfile

    Now, is the difference in behavior a bug...or a bugfix?  I can't
    tell.  Given that A:oldfile and A':newfile are not very similar,
    when we three-way merge with B:oldfile it seems likely we'll hit a
    conflict for the user to resolve.  And it shouldn't be too hard for
    users to see why we did that three-way merge; oldfile and newfile
    *were* renames somewhere in the sequence.  So, most of these corner
    cases will still behave similarly -- namely, a conflict given to the
    user to resolve.  Also, consider the interesting case when commit B
    is a clean revert of commit A.  Without this optimization, a rebase
    could not both apply a weird patch like A and then immediately
    revert it; users would be forced to resolve merge conflicts.  With
    this optimization, it would successfully apply the clean revert.
    So, there is certainly at least one case that behaves better.  Even
    if it's considered a "difference in behavior", I think both behaviors
    are reasonable, and the time savings provided by this optimization
    justify using the slightly altered rename heuristics.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 64 insertions(+), 2 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index 5523fc9e86b3..a342cc6344fd 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -140,6 +140,30 @@ struct rename_info {
 	int callback_data_nr, callback_data_alloc;
 	char *callback_data_traverse_path;
 
+	/*
+	 * merge_trees: trees passed to the merge algorithm for the merge
+	 *
+	 * merge_trees records the trees passed to the merge algorithm.  But,
+	 * this data also is stored in merge_result->priv.  If a sequence of
+	 * merges are being done (such as when cherry-picking or rebasing),
+	 * the next merge can look at this and re-use information from
+	 * previous merges under certain cirumstances.
+	 *
+	 * See also all the cached_* variables.
+	 */
+	struct tree *merge_trees[3];
+
+	/*
+	 * cached_pairs_valid_side: which side's cached info can be reused
+	 *
+	 * See the description for merge_trees.  For repeated merges, at most
+	 * only one side's cached information can be used.  Valid values:
+	 *   MERGE_SIDE2: cached data from side2 can be reused
+	 *   MERGE_SIDE1: cached data from side1 can be reused
+	 *   0:           no cached data can be reused
+	 */
+	int cached_pairs_valid_side;
+
 	/*
 	 * cached_pairs: Caching of renames and deletions.
 	 *
@@ -462,6 +486,8 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		strmap_func(&renames->cached_pairs[i], 1);
 		strset_func(&renames->cached_irrelevant[i]);
 	}
+	renames->cached_pairs_valid_side = 0;
+	renames->dir_rename_mask = 0;
 
 	if (!reinitialize) {
 		struct hashmap_iter iter;
@@ -484,8 +510,6 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		strmap_clear(&opti->output, 0);
 	}
 
-	renames->dir_rename_mask = 0;
-
 	/* Clean out callback_data as well. */
 	FREE_AND_NULL(renames->callback_data);
 	renames->callback_data_nr = renames->callback_data_alloc = 0;
@@ -3802,6 +3826,35 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
 	trace2_region_leave("merge", "allocate/init", opt->repo);
 }
 
+static void merge_check_renames_reusable(struct merge_options *opt,
+					 struct merge_result *result,
+					 struct tree *merge_base,
+					 struct tree *side1,
+					 struct tree *side2)
+{
+	struct rename_info *renames;
+	struct tree **merge_trees;
+	struct merge_options_internal *opti = result->priv;
+
+	if (!opti)
+		return;
+
+	renames = &opti->renames;
+	merge_trees = renames->merge_trees;
+	/* merge_trees[0..2] will only be NULL if opti is */
+	assert(merge_trees[0] && merge_trees[1] && merge_trees[2]);
+
+	/* Check if we meet a condition for re-using cached_pairs */
+	if (     oideq(&merge_base->object.oid, &merge_trees[2]->object.oid) &&
+		 oideq(     &side1->object.oid, &result->tree->object.oid))
+		renames->cached_pairs_valid_side = MERGE_SIDE1;
+	else if (oideq(&merge_base->object.oid, &merge_trees[1]->object.oid) &&
+		 oideq(     &side2->object.oid, &result->tree->object.oid))
+		renames->cached_pairs_valid_side = MERGE_SIDE2;
+	else
+		renames->cached_pairs_valid_side = 0; /* neither side valid */
+}
+
 /*** Function Grouping: merge_incore_*() and their internal variants ***/
 
 /*
@@ -3949,7 +4002,16 @@ void merge_incore_nonrecursive(struct merge_options *opt,
 
 	trace2_region_enter("merge", "merge_start", opt->repo);
 	assert(opt->ancestor != NULL);
+	merge_check_renames_reusable(opt, result, merge_base, side1, side2);
 	merge_start(opt, result);
+	/*
+	 * Record the trees used in this merge, so if there's a next merge in
+	 * a cherry-pick or rebase sequence it might be able to take advantage
+	 * of the cached_pairs in that next merge.
+	 */
+	opt->priv->renames.merge_trees[0] = merge_base;
+	opt->priv->renames.merge_trees[1] = side1;
+	opt->priv->renames.merge_trees[2] = side2;
 	trace2_region_leave("merge", "merge_start", opt->repo);
 
 	merge_ort_nonrecursive_internal(opt, merge_base, side1, side2, result);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 09/13] merge-ort: avoid accidental API mis-use
  2021-05-04  2:12 ` [PATCH v2 00/13] " Elijah Newren via GitGitGadget
                     ` (7 preceding siblings ...)
  2021-05-04  2:12   ` [PATCH v2 08/13] merge-ort: add code to check for whether cached renames can be reused Elijah Newren via GitGitGadget
@ 2021-05-04  2:12   ` Elijah Newren via GitGitGadget
  2021-05-04  2:12   ` [PATCH v2 10/13] merge-ort: preserve cached renames for the appropriate side Elijah Newren via GitGitGadget
                     ` (4 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-05-04  2:12 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren,
	Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Previously, callers of the merge-ort API could have passed an
uninitialized value for struct merge_result *result.  However, we want
to check result to see if it has cached renames from a previous merge
that we can reuse; such values would be found behind result->priv.
However, if result->priv is uninitialized, attempting to access behind
it will give a segfault.  So, we need result->priv to be NULL (which
will be the case if the caller does a memset(&result, 0)), or be written
by a previous call to the merge-ort machinery.  Documenting this
requirement may help, but despite being the person who introduced this
requirement, I still missed it once and it did not fail in a very clear
way and led to a long debugging session.

Add a _properly_initialized field to merge_result; that value will be
0 if the caller zero'ed the merge_result, it will be set to a very
specific value by a previous run by the merge-ort machinery, and if it's
uninitialized it will most likely either be 0 or some value that does
not match the specific one we'd expect allowing us to throw a much more
meaningful error.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 7 +++++++
 merge-ort.h | 2 ++
 2 files changed, 9 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index a342cc6344fd..e6a02fa928f5 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -53,6 +53,8 @@ enum merge_side {
 	MERGE_SIDE2 = 2
 };
 
+static unsigned RESULT_INITIALIZED = 0x1abe11ed; /* unlikely accidental value */
+
 struct traversal_callback_data {
 	unsigned long mask;
 	unsigned long dirmask;
@@ -3746,6 +3748,10 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
 	assert(opt->obuf.len == 0);
 
 	assert(opt->priv == NULL);
+	if (result->_properly_initialized != 0 &&
+	    result->_properly_initialized != RESULT_INITIALIZED)
+		BUG("struct merge_result passed to merge_incore_*recursive() must be zeroed or filled with values from a previous run");
+	assert(!!result->priv == !!result->_properly_initialized);
 	if (result->priv) {
 		opt->priv = result->priv;
 		result->priv = NULL;
@@ -3905,6 +3911,7 @@ static void merge_ort_nonrecursive_internal(struct merge_options *opt,
 	result->clean &= strmap_empty(&opt->priv->conflicted);
 	if (!opt->priv->call_depth) {
 		result->priv = opt->priv;
+		result->_properly_initialized = RESULT_INITIALIZED;
 		opt->priv = NULL;
 	}
 }
diff --git a/merge-ort.h b/merge-ort.h
index d53a0a339f33..c011864ffeb1 100644
--- a/merge-ort.h
+++ b/merge-ort.h
@@ -29,6 +29,8 @@ struct merge_result {
 	 * !clean) and to print "CONFLICT" messages.  Not for external use.
 	 */
 	void *priv;
+	/* Also private */
+	unsigned _properly_initialized;
 };
 
 /*
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 10/13] merge-ort: preserve cached renames for the appropriate side
  2021-05-04  2:12 ` [PATCH v2 00/13] " Elijah Newren via GitGitGadget
                     ` (8 preceding siblings ...)
  2021-05-04  2:12   ` [PATCH v2 09/13] merge-ort: avoid accidental API mis-use Elijah Newren via GitGitGadget
@ 2021-05-04  2:12   ` Elijah Newren via GitGitGadget
  2021-05-04  2:12   ` [PATCH v2 11/13] merge-ort: add helper functions for using cached renames Elijah Newren via GitGitGadget
                     ` (3 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-05-04  2:12 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren,
	Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Previous commits created an in-memory cache of the results of rename
detection, and added logic to detect when that cache could appropriately
be used in a subsequent merge operation -- but we were still
unconditionally clearing the cache with each new merge operation anyway.
If it is valid to reuse the cache from one of the two sides of history,
preserve that side.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index e6a02fa928f5..b524a2db2769 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -476,17 +476,18 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 	/* Free memory used by various renames maps */
 	for (i = MERGE_SIDE1; i <= MERGE_SIDE2; ++i) {
 		strintmap_func(&renames->dirs_removed[i]);
-
-		partial_clear_dir_rename_count(&renames->dir_rename_count[i]);
-		if (!reinitialize)
-			strmap_clear(&renames->dir_rename_count[i], 1);
-
 		strmap_func(&renames->dir_renames[i], 0);
-
 		strintmap_func(&renames->relevant_sources[i]);
-		strset_func(&renames->cached_target_names[i]);
-		strmap_func(&renames->cached_pairs[i], 1);
-		strset_func(&renames->cached_irrelevant[i]);
+		if (!reinitialize)
+			assert(renames->cached_pairs_valid_side == 0);
+		if (i != renames->cached_pairs_valid_side) {
+			strset_func(&renames->cached_target_names[i]);
+			strmap_func(&renames->cached_pairs[i], 1);
+			strset_func(&renames->cached_irrelevant[i]);
+			partial_clear_dir_rename_count(&renames->dir_rename_count[i]);
+			if (!reinitialize)
+				strmap_clear(&renames->dir_rename_count[i], 1);
+		}
 	}
 	renames->cached_pairs_valid_side = 0;
 	renames->dir_rename_mask = 0;
@@ -2443,6 +2444,7 @@ static void detect_regular_renames(struct merge_options *opt,
 		return;
 	}
 
+	partial_clear_dir_rename_count(&renames->dir_rename_count[side_index]);
 	repo_diff_setup(opt->repo, &diff_opts);
 	diff_opts.flags.recursive = 1;
 	diff_opts.flags.rename_empty = 0;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 11/13] merge-ort: add helper functions for using cached renames
  2021-05-04  2:12 ` [PATCH v2 00/13] " Elijah Newren via GitGitGadget
                     ` (9 preceding siblings ...)
  2021-05-04  2:12   ` [PATCH v2 10/13] merge-ort: preserve cached renames for the appropriate side Elijah Newren via GitGitGadget
@ 2021-05-04  2:12   ` Elijah Newren via GitGitGadget
  2021-05-04  2:12   ` [PATCH v2 12/13] merge-ort: handle interactions of caching and rename/rename(1to1) cases Elijah Newren via GitGitGadget
                     ` (2 subsequent siblings)
  13 siblings, 0 replies; 32+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-05-04  2:12 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren,
	Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

If we have a usable rename cache, then we can remove from
relevant_sources all the paths that were cached;
diffcore_rename_extended() can then consider an even smaller set of
relevant_sources in its rename detection.

However, when diffcore_rename_extended() is done, we will need to take
the renames it detected and then add back in all the ones we had cached
from before.

Add helper functions for doing these two operations; the next commit
will make use of them.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index b524a2db2769..741970cd05e7 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -2360,6 +2360,53 @@ static void resolve_diffpair_statuses(struct diff_queue_struct *q)
 	}
 }
 
+MAYBE_UNUSED
+static void prune_cached_from_relevant(struct rename_info *renames,
+				       unsigned side)
+{
+	/* Reason for this function described in add_pair() */
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
+
+	/* Remove from relevant_sources all entries in cached_pairs[side] */
+	strmap_for_each_entry(&renames->cached_pairs[side], &iter, entry) {
+		strintmap_remove(&renames->relevant_sources[side],
+				 entry->key);
+	}
+	/* Remove from relevant_sources all entries in cached_irrelevant[side] */
+	strset_for_each_entry(&renames->cached_irrelevant[side], &iter, entry) {
+		strintmap_remove(&renames->relevant_sources[side],
+				 entry->key);
+	}
+}
+
+MAYBE_UNUSED
+static void use_cached_pairs(struct merge_options *opt,
+			     struct strmap *cached_pairs,
+			     struct diff_queue_struct *pairs)
+{
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
+
+	/*
+	 * Add to side_pairs all entries from renames->cached_pairs[side_index].
+	 * (Info in cached_irrelevant[side_index] is not relevant here.)
+	 */
+	strmap_for_each_entry(cached_pairs, &iter, entry) {
+		struct diff_filespec *one, *two;
+		const char *old_name = entry->key;
+		const char *new_name = entry->value;
+		if (!new_name)
+			new_name = old_name;
+
+		/* We don't care about oid/mode, only filenames and status */
+		one = alloc_filespec(old_name);
+		two = alloc_filespec(new_name);
+		diff_queue(pairs, one, two);
+		pairs->queue[pairs->nr-1]->status = entry->value ? 'R' : 'D';
+	}
+}
+
 static void possibly_cache_new_pair(struct rename_info *renames,
 				    struct diff_filepair *p,
 				    unsigned side,
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 12/13] merge-ort: handle interactions of caching and rename/rename(1to1) cases
  2021-05-04  2:12 ` [PATCH v2 00/13] " Elijah Newren via GitGitGadget
                     ` (10 preceding siblings ...)
  2021-05-04  2:12   ` [PATCH v2 11/13] merge-ort: add helper functions for using cached renames Elijah Newren via GitGitGadget
@ 2021-05-04  2:12   ` Elijah Newren via GitGitGadget
  2021-05-04  2:12   ` [PATCH v2 13/13] merge-ort, diffcore-rename: employ cached renames when possible Elijah Newren via GitGitGadget
  2021-05-14 17:37   ` [PATCH v2 00/13] Optimization batch 11: avoid repeatedly detecting same renames Elijah Newren
  13 siblings, 0 replies; 32+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-05-04  2:12 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren,
	Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

As documented in Documentation/technical/remembering-renames.txt, and as
tested for in the two testcases in t6429 with "rename same file
identically" in their description, there is one case where we need to
have renames in one commit NOT be cached for the next commit in our
rebase sequence -- namely, rename/rename(1to1) cases.  Rather than
specifically trying to uncache those and fix up dir_rename_counts() to
match (which would also be valid but more work), we simply disable the
optimization when this really rare type of rename occurs.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 30 +++++++++++++++++++++++++++++-
 1 file changed, 29 insertions(+), 1 deletion(-)

diff --git a/merge-ort.c b/merge-ort.c
index 741970cd05e7..2fc98b803d1c 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -2100,6 +2100,9 @@ static int process_renames(struct merge_options *opt,
 			VERIFY_CI(side2);
 
 			if (!strcmp(pathnames[1], pathnames[2])) {
+				struct rename_info *ri = &opt->priv->renames;
+				int j;
+
 				/* Both sides renamed the same way */
 				assert(side1 == side2);
 				memcpy(&side1->stages[0], &base->stages[0],
@@ -2109,6 +2112,16 @@ static int process_renames(struct merge_options *opt,
 				base->merged.is_null = 1;
 				base->merged.clean = 1;
 
+				/*
+				 * Disable remembering renames optimization;
+				 * rename/rename(1to1) is incredibly rare, and
+				 * just disabling the optimization is easier
+				 * than purging cached_pairs,
+				 * cached_target_names, and dir_rename_counts.
+				 */
+				for (j = 0; j < 3; j++)
+					ri->merge_trees[j] = NULL;
+
 				/* We handled both renames, i.e. i+1 handled */
 				i++;
 				/* Move to next rename */
@@ -3896,7 +3909,22 @@ static void merge_check_renames_reusable(struct merge_options *opt,
 
 	renames = &opti->renames;
 	merge_trees = renames->merge_trees;
-	/* merge_trees[0..2] will only be NULL if opti is */
+
+	/*
+	 * Handle case where previous merge operation did not want cache to
+	 * take effect, e.g. because rename/rename(1to1) makes it invalid.
+	 */
+	if (!merge_trees[0]) {
+		assert(!merge_trees[0] && !merge_trees[1] && !merge_trees[2]);
+		renames->cached_pairs_valid_side = 0; /* neither side valid */
+		return;
+	}
+
+	/*
+	 * Handle other cases; note that merge_trees[0..2] will only
+	 * be NULL if opti is, or if all three were manually set to
+	 * NULL by e.g. rename/rename(1to1) handling.
+	 */
 	assert(merge_trees[0] && merge_trees[1] && merge_trees[2]);
 
 	/* Check if we meet a condition for re-using cached_pairs */
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 13/13] merge-ort, diffcore-rename: employ cached renames when possible
  2021-05-04  2:12 ` [PATCH v2 00/13] " Elijah Newren via GitGitGadget
                     ` (11 preceding siblings ...)
  2021-05-04  2:12   ` [PATCH v2 12/13] merge-ort: handle interactions of caching and rename/rename(1to1) cases Elijah Newren via GitGitGadget
@ 2021-05-04  2:12   ` Elijah Newren via GitGitGadget
  2021-05-14 17:37   ` [PATCH v2 00/13] Optimization batch 11: avoid repeatedly detecting same renames Elijah Newren
  13 siblings, 0 replies; 32+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-05-04  2:12 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren,
	Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

When there are many renames between the old base of a series of commits
and the new base, the way sequencer.c, merge-recursive.c, and
diffcore-rename.c have traditionally split the work resulted in
redetecting the same renames with each and every commit being
transplanted.  To address this, the last several commits have been
creating a cache of rename detection results, determining when it was
safe to use such a cache in subsequent merge operations, adding helper
functions, and so on.  See the previous half dozen commit messages for
additional discussion of this optimization, particularly the message a
few commits ago entitled "add code to check for whether cached renames
can be reused".  This commit finally ties all of that work together,
modifying the merge algorithm to make use of these cached renames.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:        5.665 s ±  0.129 s     5.622 s ±  0.059 s
    mega-renames:     11.435 s ±  0.158 s    10.127 s ±  0.073 s
    just-one-mega:   494.2  ms ±  6.1  ms   500.3  ms ±  3.8  ms

That's a fairly small improvement, but mostly because the previous
optimizations were so effective for these particular testcases; this
optimization only kicks in when the others don't.  If we undid the
basename-guided rename detection and skip-irrelevant-renames
optimizations, then we'd see that this series by itself improved
performance as follows:

                   Before Basename Series   After Just This Series
    no-renames:      13.815 s ±  0.062 s      5.697 s ±  0.080 s
    mega-renames:  1799.937 s ±  0.493 s    205.709 s ±  0.457 s

Since this optimization kicks in to help accelerate cases where the
previous optimizations do not apply, this last comparison shows that
this cached-renames optimization has the potential to help signficantly
in cases that don't meet the requirements for the other optimizations to
be effective.

The changes made in this optimization also lay some important groundwork
for a future optimization around having collect_merge_info() avoid
recursing into subtrees in more cases.

However, for this optimization to be effective, merge_switch_to_result()
should only be called when the rebase or cherry-pick operation has
either completed or hit a case where the user needs to resolve a
conflict or edit the result.  If it is called after every commit, as
sequencer.c does, then the working tree and index are needlessly updated
with every commit and the cached metadata is tossed, defeating this
optimization.  Refactoring sequencer.c to only call
merge_switch_to_result() at the end of the operation is a bigger
undertaking, and the practical benefits of this optimization will not be
realized until that work is performed.  Since `test-tool fast-rebase`
only updates at the end of the operation, it was used to obtain the
timings above.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c                        | 22 +++++++++--
 diffcore.h                               |  3 +-
 merge-ort.c                              | 47 ++++++++++++++++++++---
 t/t6429-merge-sequence-rename-caching.sh | 48 ++++++++++++++----------
 4 files changed, 90 insertions(+), 30 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 963ca582216b..3375e24659ea 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -568,7 +568,8 @@ static void update_dir_rename_counts(struct dir_rename_info *info,
 static void initialize_dir_rename_info(struct dir_rename_info *info,
 				       struct strintmap *relevant_sources,
 				       struct strintmap *dirs_removed,
-				       struct strmap *dir_rename_count)
+				       struct strmap *dir_rename_count,
+				       struct strmap *cached_pairs)
 {
 	struct hashmap_iter iter;
 	struct strmap_entry *entry;
@@ -633,6 +634,17 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 					 rename_dst[i].p->two->path);
 	}
 
+	/* Add cached_pairs to counts */
+	strmap_for_each_entry(cached_pairs, &iter, entry) {
+		const char *old_name = entry->key;
+		const char *new_name = entry->value;
+		if (!new_name)
+			/* known delete; ignore it */
+			continue;
+
+		update_dir_rename_counts(info, dirs_removed, old_name, new_name);
+	}
+
 	/*
 	 * Now we collapse
 	 *    dir_rename_count: old_directory -> {new_directory -> count}
@@ -1247,7 +1259,8 @@ static void handle_early_known_dir_renames(struct dir_rename_info *info,
 void diffcore_rename_extended(struct diff_options *options,
 			      struct strintmap *relevant_sources,
 			      struct strintmap *dirs_removed,
-			      struct strmap *dir_rename_count)
+			      struct strmap *dir_rename_count,
+			      struct strmap *cached_pairs)
 {
 	int detect_rename = options->detect_rename;
 	int minimum_score = options->rename_score;
@@ -1363,7 +1376,8 @@ void diffcore_rename_extended(struct diff_options *options,
 		/* Preparation for basename-driven matching. */
 		trace2_region_enter("diff", "dir rename setup", options->repo);
 		initialize_dir_rename_info(&info, relevant_sources,
-					   dirs_removed, dir_rename_count);
+					   dirs_removed, dir_rename_count,
+					   cached_pairs);
 		trace2_region_leave("diff", "dir rename setup", options->repo);
 
 		/* Utilize file basenames to quickly find renames. */
@@ -1560,5 +1574,5 @@ void diffcore_rename_extended(struct diff_options *options,
 
 void diffcore_rename(struct diff_options *options)
 {
-	diffcore_rename_extended(options, NULL, NULL, NULL);
+	diffcore_rename_extended(options, NULL, NULL, NULL, NULL);
 }
diff --git a/diffcore.h b/diffcore.h
index f5c6de4841ed..533b30e21e7f 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -181,7 +181,8 @@ void diffcore_rename(struct diff_options *);
 void diffcore_rename_extended(struct diff_options *options,
 			      struct strintmap *relevant_sources,
 			      struct strintmap *dirs_removed,
-			      struct strmap *dir_rename_count);
+			      struct strmap *dir_rename_count,
+			      struct strmap *cached_pairs);
 void diffcore_merge_broken(void);
 void diffcore_pickaxe(struct diff_options *);
 void diffcore_order(const char *orderfile);
diff --git a/merge-ort.c b/merge-ort.c
index 2fc98b803d1c..17dc3deb3c73 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -753,15 +753,48 @@ static void add_pair(struct merge_options *opt,
 	struct rename_info *renames = &opt->priv->renames;
 	int names_idx = is_add ? side : 0;
 
-	if (!is_add) {
+	if (is_add) {
+		if (strset_contains(&renames->cached_target_names[side],
+				    pathname))
+			return;
+	} else {
 		unsigned content_relevant = (match_mask == 0);
 		unsigned location_relevant = (dir_rename_mask == 0x07);
 
+		/*
+		 * If pathname is found in cached_irrelevant[side] due to
+		 * previous pick but for this commit content is relevant,
+		 * then we need to remove it from cached_irrelevant.
+		 */
+		if (content_relevant)
+			/* strset_remove is no-op if strset doesn't have key */
+			strset_remove(&renames->cached_irrelevant[side],
+				      pathname);
+
+		/*
+		 * We do not need to re-detect renames for paths that we already
+		 * know the pairing, i.e. for cached_pairs (or
+		 * cached_irrelevant).  However, handle_deferred_entries() needs
+		 * to loop over the union of keys from relevant_sources[side] and
+		 * cached_pairs[side], so for simplicity we set relevant_sources
+		 * for all the cached_pairs too and then strip them back out in
+		 * prune_cached_from_relevant() at the beginning of
+		 * detect_regular_renames().
+		 */
 		if (content_relevant || location_relevant) {
 			/* content_relevant trumps location_relevant */
 			strintmap_set(&renames->relevant_sources[side], pathname,
 				      content_relevant ? RELEVANT_CONTENT : RELEVANT_LOCATION);
 		}
+
+		/*
+		 * Avoid creating pair if we've already cached rename results.
+		 * Note that we do this after setting relevant_sources[side]
+		 * as noted in the comment above.
+		 */
+		if (strmap_contains(&renames->cached_pairs[side], pathname) ||
+		    strset_contains(&renames->cached_irrelevant[side], pathname))
+			return;
 	}
 
 	one = alloc_filespec(pathname);
@@ -2349,7 +2382,9 @@ static inline int possible_side_renames(struct rename_info *renames,
 static inline int possible_renames(struct rename_info *renames)
 {
 	return possible_side_renames(renames, 1) ||
-	       possible_side_renames(renames, 2);
+	       possible_side_renames(renames, 2) ||
+	       !strmap_empty(&renames->cached_pairs[1]) ||
+	       !strmap_empty(&renames->cached_pairs[2]);
 }
 
 static void resolve_diffpair_statuses(struct diff_queue_struct *q)
@@ -2373,7 +2408,6 @@ static void resolve_diffpair_statuses(struct diff_queue_struct *q)
 	}
 }
 
-MAYBE_UNUSED
 static void prune_cached_from_relevant(struct rename_info *renames,
 				       unsigned side)
 {
@@ -2393,7 +2427,6 @@ static void prune_cached_from_relevant(struct rename_info *renames,
 	}
 }
 
-MAYBE_UNUSED
 static void use_cached_pairs(struct merge_options *opt,
 			     struct strmap *cached_pairs,
 			     struct diff_queue_struct *pairs)
@@ -2494,6 +2527,7 @@ static void detect_regular_renames(struct merge_options *opt,
 	struct diff_options diff_opts;
 	struct rename_info *renames = &opt->priv->renames;
 
+	prune_cached_from_relevant(renames, side_index);
 	if (!possible_side_renames(renames, side_index)) {
 		/*
 		 * No rename detection needed for this side, but we still need
@@ -2522,7 +2556,8 @@ static void detect_regular_renames(struct merge_options *opt,
 	diffcore_rename_extended(&diff_opts,
 				 &renames->relevant_sources[side_index],
 				 &renames->dirs_removed[side_index],
-				 &renames->dir_rename_count[side_index]);
+				 &renames->dir_rename_count[side_index],
+				 &renames->cached_pairs[side_index]);
 	trace2_region_leave("diff", "diffcore_rename", opt->repo);
 	resolve_diffpair_statuses(&diff_queued_diff);
 
@@ -2629,6 +2664,8 @@ static int detect_and_process_renames(struct merge_options *opt,
 	trace2_region_enter("merge", "regular renames", opt->repo);
 	detect_regular_renames(opt, MERGE_SIDE1);
 	detect_regular_renames(opt, MERGE_SIDE2);
+	use_cached_pairs(opt, &renames->cached_pairs[1], &renames->pairs[1]);
+	use_cached_pairs(opt, &renames->cached_pairs[2], &renames->pairs[2]);
 	trace2_region_leave("merge", "regular renames", opt->repo);
 
 	trace2_region_enter("merge", "directory renames", opt->repo);
diff --git a/t/t6429-merge-sequence-rename-caching.sh b/t/t6429-merge-sequence-rename-caching.sh
index f47d8924ee73..035edc40b1eb 100755
--- a/t/t6429-merge-sequence-rename-caching.sh
+++ b/t/t6429-merge-sequence-rename-caching.sh
@@ -101,10 +101,10 @@ test_expect_success 'caching renames does not preclude finding new ones' '
 # dramatic change in size of the file, but remembering the rename and
 # reusing it is reasonable too.
 #
-# Rename detection (diffcore_rename_extended()) will run twice here; it is
-# not needed on the topic side of history for either of the two commits
-# being merged, but it is needed on the upstream side of history for each
-# commit being picked.
+# We do test here that we expect rename detection to only be run once total
+# (the topic side of history doesn't need renames, and with caching we
+# should be able to only run rename detection on the upstream side one
+# time.)
 test_expect_success 'cherry-pick both a commit and its immediate revert' '
 	test_create_repo pick-commit-and-its-immediate-revert &&
 	(
@@ -140,11 +140,11 @@ test_expect_success 'cherry-pick both a commit and its immediate revert' '
 		GIT_TRACE2_PERF="$(pwd)/trace.output" &&
 		export GIT_TRACE2_PERF &&
 
-		test_might_fail test-tool fast-rebase --onto HEAD upstream~1 topic &&
+		test-tool fast-rebase --onto HEAD upstream~1 topic &&
 		#git cherry-pick upstream~1..topic &&
 
 		grep region_enter.*diffcore_rename trace.output >calls &&
-		test_line_count = 2 calls
+		test_line_count = 1 calls
 	)
 '
 
@@ -304,9 +304,11 @@ test_expect_success 'rename same file identically, then add file to old dir' '
 # Here we are just concerned that cached renames might prevent us from seeing
 # the rename conflict, and we want to ensure that we do get a conflict.
 #
-# While at it, also test that we do rename detection three times.  We have to
-# detect renames on the upstream side of history once for each merge, plus
-# Topic_2 has renames.
+# While at it, though, we do test that we only try to detect renames 2
+# times and not three.  (The first merge needs to detect renames on the
+# upstream side.  Traditionally, the second merge would need to detect
+# renames on both sides of history, but our caching of upstream renames
+# should avoid the need to re-detect upstream renames.)
 #
 test_expect_success 'cached dir rename does not prevent noticing later conflict' '
 	test_create_repo dir-rename-cache-not-occluding-later-conflict &&
@@ -357,7 +359,7 @@ test_expect_success 'cached dir rename does not prevent noticing later conflict'
 		grep CONFLICT..rename/rename output &&
 
 		grep region_enter.*diffcore_rename trace.output >calls &&
-		test_line_count = 3 calls
+		test_line_count = 2 calls
 	)
 '
 
@@ -412,10 +414,17 @@ test_setup_upstream_rename () {
 # commit to mess up its location either.  We want to make sure that
 # olddir/newfile doesn't exist in the result and that newdir/newfile does.
 #
-# We also expect rename detection to occur three times.  Although it is
-# typically needed two times per commit, there are no deleted files on the
-# topic side of history, so we only need to detect renames on the upstream
-# side for each of the 3 commits we need to pick.
+# We also test that we only do rename detection twice.  We never need
+# rename detection on the topic side of history, but we do need it twice on
+# the upstream side of history.  For the first topic commit, we only need
+# the
+#   relevant-rename -> renamed
+# rename, because olddir is unmodified by Topic_1.  For Topic_2, however,
+# the new file being added to olddir means files that were previously
+# irrelevant for rename detection are now relevant, forcing us to repeat
+# rename detection for the paths we don't already have cached.  Topic_3 also
+# tweaks olddir/newfile, but the renames in olddir/ will have been cached
+# from the second rename detection run.
 #
 test_expect_success 'dir rename unneeded, then add new file to old dir' '
 	test_setup_upstream_rename dir-rename-unneeded-until-new-file &&
@@ -450,7 +459,7 @@ test_expect_success 'dir rename unneeded, then add new file to old dir' '
 		#git cherry-pick upstream..topic &&
 
 		grep region_enter.*diffcore_rename trace.output >calls &&
-		test_line_count = 3 calls &&
+		test_line_count = 2 calls &&
 
 		git ls-files >tracked &&
 		test_line_count = 5 tracked &&
@@ -516,7 +525,7 @@ test_expect_success 'dir rename unneeded, then rename existing file into old dir
 		#git cherry-pick upstream..topic &&
 
 		grep region_enter.*diffcore_rename trace.output >calls &&
-		test_line_count = 4 calls &&
+		test_line_count = 3 calls &&
 
 		test_path_is_missing somefile &&
 		test_path_is_missing olddir/newfile &&
@@ -648,9 +657,8 @@ test_expect_success 'caching renames only on upstream side, part 1' '
 # for the wrong side of history.
 #
 #
-# This testcase should only need three calls to diffcore_rename_extended(),
-# because there are no renames on the topic side of history for picking
-# Topic_2.
+# This testcase should only need two calls to diffcore_rename_extended(),
+# both for the first merge, one for each side of history.
 #
 test_expect_success 'caching renames only on upstream side, part 2' '
 	test_setup_topic_rename cache-renames-only-upstream-rename-file &&
@@ -677,7 +685,7 @@ test_expect_success 'caching renames only on upstream side, part 2' '
 		#git cherry-pick upstream..topic &&
 
 		grep region_enter.*diffcore_rename trace.output >calls &&
-		test_line_count = 3 calls &&
+		test_line_count = 2 calls &&
 
 		git ls-files >tracked &&
 		test_line_count = 4 tracked &&
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/13] Optimization batch 11: avoid repeatedly detecting same renames
  2021-05-04  2:12 ` [PATCH v2 00/13] " Elijah Newren via GitGitGadget
                     ` (12 preceding siblings ...)
  2021-05-04  2:12   ` [PATCH v2 13/13] merge-ort, diffcore-rename: employ cached renames when possible Elijah Newren via GitGitGadget
@ 2021-05-14 17:37   ` Elijah Newren
  2021-05-14 21:04     ` Derrick Stolee
  13 siblings, 1 reply; 32+ messages in thread
From: Elijah Newren @ 2021-05-14 17:37 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: Git Mailing List, Derrick Stolee, Jonathan Tan, Taylor Blau,
	Derrick Stolee

On Mon, May 3, 2021 at 7:12 PM Elijah Newren via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> This series avoids repeatedly detecting the same renames in a sequence of
> merges such as a rebase or cherry-pick of several commits. It's
> unfortunately become a bit lengthy, but much of the length (the first five
> patches) is owed to special testcases and documentation.

Since this obviously hasn't inspired much review, let me note that you
can cut out 90% of the review size by skipping patches 2 & 5.

Patch 2 is essentially written as something approaching a formal
proof, so yes it's dense and lengthy, but it's not at all required;
there's no code there.  Think of it as insurance for if someone wants
to introduce some new tricky optimizations or radically different
features to the merge machinery, because the remember-renames
optimization by its nature tends to interact with other optimizations.
I figured because of that interaction that documenting why and how the
remember renames optimization works at a much deeper level than is
typical for optimizations or code in git in general that it might help
with future maintenance...and it happened to help me catch two minor
bugs.

Patch 5 is very much related to patch 2; it's testcases inspired by
that document.  Most of those tests were just "what could possibly go
wrong in a new from-scratch implementation of this optimization?"
based on what's written in this proof-like document.  Most of the
tests didn't turn up anything, but a couple found some small issues in
my implementation.  I decided to just include all of them; it's nice
to be thorough.

You can get 95% of the whole idea behind this optimization skipping
those patches and reading Junio's great two-paragraph summary at
https://lore.kernel.org/git/xmqqzgyrukic.fsf@gitster.g/, and then just
read the other patches in this series.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 00/13] Optimization batch 11: avoid repeatedly detecting same renames
  2021-05-14 17:37   ` [PATCH v2 00/13] Optimization batch 11: avoid repeatedly detecting same renames Elijah Newren
@ 2021-05-14 21:04     ` Derrick Stolee
  0 siblings, 0 replies; 32+ messages in thread
From: Derrick Stolee @ 2021-05-14 21:04 UTC (permalink / raw)
  To: Elijah Newren, Elijah Newren via GitGitGadget
  Cc: Git Mailing List, Derrick Stolee, Jonathan Tan, Taylor Blau



On 5/14/2021 1:37 PM, Elijah Newren wrote:
> On Mon, May 3, 2021 at 7:12 PM Elijah Newren via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> This series avoids repeatedly detecting the same renames in a sequence of
>> merges such as a rebase or cherry-pick of several commits. It's
>> unfortunately become a bit lengthy, but much of the length (the first five
>> patches) is owed to special testcases and documentation.
> 
> Since this obviously hasn't inspired much review, let me note that you
> can cut out 90% of the review size by skipping patches 2 & 5.
> 
> Patch 2 is essentially written as something approaching a formal
> proof, so yes it's dense and lengthy, but it's not at all required;
> there's no code there.  Think of it as insurance for if someone wants
> to introduce some new tricky optimizations or radically different
> features to the merge machinery, because the remember-renames
> optimization by its nature tends to interact with other optimizations.
> I figured because of that interaction that documenting why and how the
> remember renames optimization works at a much deeper level than is
> typical for optimizations or code in git in general that it might help
> with future maintenance...and it happened to help me catch two minor
> bugs.
> 
> Patch 5 is very much related to patch 2; it's testcases inspired by
> that document.  Most of those tests were just "what could possibly go
> wrong in a new from-scratch implementation of this optimization?"
> based on what's written in this proof-like document.  Most of the
> tests didn't turn up anything, but a couple found some small issues in
> my implementation.  I decided to just include all of them; it's nice
> to be thorough.
> 
> You can get 95% of the whole idea behind this optimization skipping
> those patches and reading Junio's great two-paragraph summary at
> https://lore.kernel.org/git/xmqqzgyrukic.fsf@gitster.g/, and then just
> read the other patches in this series.

Sorry, yes. I've been reading this a bit but haven't commented yet.

Patch 2 was enlightening and I appreciate the attention to detail
there. The overall argument made sense to me.

I can promise a completed review on Monday.

-Stolee

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 04/13] fast-rebase: write conflict state to working tree, index, and HEAD
  2021-05-04  2:12   ` [PATCH v2 04/13] fast-rebase: write conflict state to working tree, index, and HEAD Elijah Newren via GitGitGadget
@ 2021-05-17 13:32     ` Derrick Stolee
  0 siblings, 0 replies; 32+ messages in thread
From: Derrick Stolee @ 2021-05-17 13:32 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren

On 5/3/21 10:12 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
> 
> Previously, when fast-rebase hit a conflict, it simply aborted and left
> HEAD, the index, and the working tree where they were before the
> operation started.  While fast-rebase does not support restarting from a
> conflicted state, write the conflicted state out anyway as it gives us a
> way to see what the conflicts are and write tests that check for them.
> 
> This will be important in the upcoming commits, because sequencer.c is
> only superficially integrated with merge-ort.c; in particular, it calls
> merge_switch_to_result() after EACH merge instead of only calling it at
> the end of all the sequence of merges (or when a conflict is hit).  This
> not only causes needless updates to the working copy and index, but also
> causes all intermediate data to be freed and tossed, preventing caching
> information from one merge to the next.  However, integrating
> sequencer.c more deeply with merge-ort.c is a big task, and making this
> small extension to fast-rebase.c provides us with a simple way to test
> the edge and corner cases that we want to make sure continue working.

This seems a noble goal. I'm worried about the ramifications of such
a change, specifically about the state an automated process would be in
after hitting a conflict.

If I understand correctly, then the old code would abort the rebase and
go back to the previous HEAD location. The new code will pause the
rebase at the previous commit and leave the conflict markers in the
working directory.

The critical change is here:

> -	strbuf_addf(&reflog_msg, "finish rebase %s onto %s",
> -		    oid_to_hex(&last_picked_commit->object.oid),
> -		    oid_to_hex(&last_commit->object.oid));
> -	if (update_ref(reflog_msg.buf, branch_name.buf,
> -		       &last_commit->object.oid,
> -		       &last_picked_commit->object.oid,
> -		       REF_NO_DEREF, UPDATE_REFS_MSG_ON_ERR)) {
> -		error(_("could not update %s"), argv[4]);
> -		die("Failed to update %s", argv[4]);
> +	if (result.clean) {
> +		fprintf(stderr, "\nDone.\n");
> +		strbuf_addf(&reflog_msg, "finish rebase %s onto %s",
> +			    oid_to_hex(&last_picked_commit->object.oid),
> +			    oid_to_hex(&last_commit->object.oid));
> +		if (update_ref(reflog_msg.buf, branch_name.buf,
> +			       &last_commit->object.oid,
> +			       &last_picked_commit->object.oid,
> +			       REF_NO_DEREF, UPDATE_REFS_MSG_ON_ERR)) {
> +			error(_("could not update %s"), argv[4]);
> +			die("Failed to update %s", argv[4]);
> +		}
> +		if (create_symref("HEAD", branch_name.buf, reflog_msg.buf) < 0)
> +			die(_("unable to update HEAD"));
> +		strbuf_release(&reflog_msg);
> +		strbuf_release(&branch_name);
> +
> +		prime_cache_tree(the_repository, the_repository->index,
> +				 result.tree);
> +	} else {
> +		fprintf(stderr, "\nAborting: Hit a conflict.\n");
> +		strbuf_addf(&reflog_msg, "rebase progress up to %s",
> +			    oid_to_hex(&last_picked_commit->object.oid));
> +		if (update_ref(reflog_msg.buf, "HEAD",
> +			       &last_commit->object.oid,
> +			       &head,
> +			       REF_NO_DEREF, UPDATE_REFS_MSG_ON_ERR)) {
> +			error(_("could not update %s"), argv[4]);
> +			die("Failed to update %s", argv[4]);
> +		}
>  	}
> -	if (create_symref("HEAD", branch_name.buf, reflog_msg.buf) < 0)
> -		die(_("unable to update HEAD"));
> -	strbuf_release(&reflog_msg);
> -	strbuf_release(&branch_name);
> -
> -	prime_cache_tree(the_repository, the_repository->index, result.tree);

So perhaps this could use a test case to demonstrate the change in
behavior? Such a test would want to be created before this commit, then
the behavior change is provided as an edit to the test in this commit.

But maybe I'm misunderstanding the change here and such a test is
inappropriate.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 06/13] merge-ort: add data structures for in-memory caching of rename detection
  2021-05-04  2:12   ` [PATCH v2 06/13] merge-ort: add data structures for in-memory caching of rename detection Elijah Newren via GitGitGadget
@ 2021-05-17 13:41     ` Derrick Stolee
  0 siblings, 0 replies; 32+ messages in thread
From: Derrick Stolee @ 2021-05-17 13:41 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren

On 5/3/21 10:12 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
> 
> When there are many renames between the old base of a series of commits
> and the new base for a series of commits, the sequence of merges
> employed to transplant those commits (from a cherry-pick or rebase
> operation) will repeatedly detect the exact same renames.  This is
> wasted effort.
> 
> Add data structures which will be used to cache rename detection
> results, along with the initialization and deallocation of these data
> structures.  Future commits will populate these caches, detect the
> appropriate circumstances when they can be used, and employ them to
> avoid re-detecting the same renames repeatedly.

I appreciate the definitions and boilerplate for these data
structures being isolated to their own patch.

> @@ -140,6 +140,37 @@ struct rename_info {
>  	int callback_data_nr, callback_data_alloc;
>  	char *callback_data_traverse_path;
>  
> +	/*
> +	 * cached_pairs: Caching of renames and deletions.
> +	 *
> +	 * These are mappings recording renames and deletions of individual
> +	 * files (not directories).  They are thus a map from an old
> +	 * filename to either NULL (for deletions) or a new filename (for
> +	 * renames).
> +	 */
> +	struct strmap cached_pairs[3];
> +
> +	/*
> +	 * cached_target_names: just the destinations from cached_pairs
> +	 *
> +	 * We sometimes want a fast lookup to determine if a given filename
> +	 * is one of the destinations in cached_pairs.  cached_target_names
> +	 * is thus duplicative information, but it provides a fast lookup.
> +	 */
> +	struct strset cached_target_names[3];

These two work well together. Very clear.

> +	/*
> +	 * cached_irrelevant: Caching of rename_sources that aren't relevant.
> +	 *
> +	 * cached_pairs records both renames and deletes.  Sometimes we
> +	 * do not know if a path is a rename or a delete because we pass
> +	 * RELEVANT_LOCATION to diffcore_rename_extended() and based on
> +	 * various optimizations it returns without detecting whether that
> +	 * path is actually a rename or a delete.  We need to cache such
> +	 * paths too, but separately from cached_pairs.
> +	 */
> +	struct strset cached_irrelevant[3];

I'm having a hard time parsing what these "irrelevant" paths will be.
It seems like diffcore_rename_extended() will report something other
than "rename" or "delete" for some paths. Could we explicitly mark
that state as "irrelevant"?

	/*
	 * cached_irrelevant: Caching of rename_sources that aren't relevant.
	 *
	 * cached_pairs records both renames and deletes.  Sometimes we
	 * do not know if a path is a rename or a delete because we pass
	 * RELEVANT_LOCATION to diffcore_rename_extended() which might
	 * describe a path as "irrelevant" instead of as a "rename" or "delete".
	 *  We need to cache such paths too, but separately from cached_pairs.
	 */

Does this make sense? diffcore_rename_extended() might need an update
to match this extra, explicit state.

The rest of the code looks good.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 07/13] merge-ort: populate caches of rename detection results
  2021-05-04  2:12   ` [PATCH v2 07/13] merge-ort: populate caches of rename detection results Elijah Newren via GitGitGadget
@ 2021-05-17 13:51     ` Derrick Stolee
  0 siblings, 0 replies; 32+ messages in thread
From: Derrick Stolee @ 2021-05-17 13:51 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren

On 5/3/21 10:12 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
> 
> Fill in cache_pairs, cached_target_names, and cached_irrelevant based on
> rename detection results.  Future commits will make use of these values.

Thank you for continuing to break this down into nice-sized pieces.

> +static void possibly_cache_new_pair(struct rename_info *renames,
> +				    struct diff_filepair *p,
> +				    unsigned side,
> +				    char *new_path)
> +{
> +	char *old_value;
> +	int dir_renamed_side = 0;
> +
> +	if (new_path) {
> +		/*
> +		 * Directory renames happen on the other side of history from
> +		 * the side that adds new files to the old directory.
> +		 */
> +		dir_renamed_side = 3 - side;

Neat trick. Side is in { 1, 2 } so this makes sense.

> +	} else {
> +		int val = strintmap_get(&renames->relevant_sources[side],
> +					p->one->path);
> +		if (val == RELEVANT_NO_MORE) {
> +			assert(p->status == 'D');
> +			strset_add(&renames->cached_irrelevant[side],
> +				   p->one->path);

Ok, I see a transition here from a relevant side to an
irrelevant one.

> +		}
> +		if (val <= 0)
> +			return;
> +	}
> +
> +	if (p->status == 'D') {
> +		/*
> +		 * If we already had this delete, we'll just set it's value
> +		 * to NULL again, so no harm.
> +		 */
> +		strmap_put(&renames->cached_pairs[side], p->one->path, NULL);
> +	} else if (p->status == 'R') {
> +		if (new_path) {
> +			new_path = xstrdup(new_path);
> +			old_value = strmap_put(&renames->cached_pairs[dir_renamed_side],
> +					       p->two->path, new_path);
> +			strset_add(&renames->cached_target_names[dir_renamed_side],
> +				   new_path);
> +			assert(!old_value);

This assert implies that p->status == 'R' only if this is the
first side (and first commit) to show a rename, right?

> +		}
> +		if (!new_path)
> +			new_path = p->two->path;
> +		new_path = xstrdup(new_path);

If new_path was provided as non-NULL, then this is the second
time we are dup-ing it. However, that seems correct because we
want a different copy or every time we add it to the cached_pairs
and cached_target_names data.

> +		old_value = strmap_put(&renames->cached_pairs[side],
> +				       p->one->path, new_path);
> +		strset_add(&renames->cached_target_names[side],
> +			   new_path);

Since we appear to be doing this in multiple places, would this
be a good place for a helper method? We could have it take a
`const char *new_path` and have the helper manage the `xstrdup()`
so we never forget to do that exactly once per insert to these
sets.

> +		free(old_value);
> +	} else if (p->status == 'A' && new_path) {
> +		new_path = xstrdup(new_path);
> +		old_value = strmap_put(&renames->cached_pairs[dir_renamed_side],
> +				       p->two->path, new_path);
> +		strset_add(&renames->cached_target_names[dir_renamed_side],
> +			   new_path);
> +		assert(!old_value);

And here's the third instance, making the "three is many" rule
kick in. A helper method would help make this easier. You can
also have a parameter corresponding to whether you need to
free() the old_value or assert it is NULL.

> +	}
> +}
> +
>  static int compare_pairs(const void *a_, const void *b_)
>  {
>  	const struct diff_filepair *a = *((const struct diff_filepair **)a_);
> @@ -2415,6 +2474,7 @@ static int collect_renames(struct merge_options *opt,
>  		char *new_path; /* non-NULL only with directory renames */
>  
>  		if (p->status != 'A' && p->status != 'R') {
> +			possibly_cache_new_pair(renames, p, side_index, NULL);
>  			diff_free_filepair(p);
>  			continue;
>  		}
> @@ -2426,11 +2486,11 @@ static int collect_renames(struct merge_options *opt,
>  						      &collisions,
>  						      &clean);
>  
> +		possibly_cache_new_pair(renames, p, side_index, new_path);
>  		if (p->status != 'R' && !new_path) {
>  			diff_free_filepair(p);
>  			continue;
>  		}
> -

nit: this deletion seems unnecessary.

>  		if (new_path)
>  			apply_directory_rename_modifications(opt, p, new_path);
>  
> @@ -3701,8 +3761,16 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
>  					 NULL, 1);
>  		strmap_init_with_options(&renames->dir_renames[i],
>  					 NULL, 0);
> +		/*
> +		 * relevant_sources uses -1 for the default, because we need
> +		 * to be able to distinguish not-in-strintmap from valid
> +		 * relevant_source values from enum file_rename_relevance.
> +		 * In particular, possibly_cache_new_pair() expects a negative
> +		 * value for not-found entries.
> +		 */
>  		strintmap_init_with_options(&renames->relevant_sources[i],
> -					    0, NULL, 0);
> +					    -1 /* explicitly invalid */,
> +					    NULL, 0);
>  		strmap_init_with_options(&renames->cached_pairs[i],
>  					 NULL, 1);
>  		strset_init_with_options(&renames->cached_irrelevant[i],
> 

Functionally looks good. I just had some nits about organization.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2021-05-17 13:51 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-24 21:32 [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames Elijah Newren via GitGitGadget
2021-03-24 21:32 ` [PATCH 1/7] merge-ort: add data structures for in-memory caching of rename detection Elijah Newren via GitGitGadget
2021-03-24 21:32 ` [PATCH 2/7] merge-ort: populate caches of rename detection results Elijah Newren via GitGitGadget
2021-03-24 21:32 ` [PATCH 3/7] merge-ort: add code to check for whether cached renames can be reused Elijah Newren via GitGitGadget
2021-03-24 21:32 ` [PATCH 4/7] merge-ort: avoid accidental API mis-use Elijah Newren via GitGitGadget
2021-03-24 21:32 ` [PATCH 5/7] merge-ort: preserve cached renames for the appropriate side Elijah Newren via GitGitGadget
2021-03-24 21:32 ` [PATCH 6/7] merge-ort: add helper functions for using cached renames Elijah Newren via GitGitGadget
2021-03-24 21:32 ` [PATCH 7/7] merge-ort, diffcore-rename: employ cached renames when possible Elijah Newren via GitGitGadget
2021-03-24 22:04 ` [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames Junio C Hamano
2021-03-24 23:25   ` Elijah Newren
2021-03-25 18:59     ` Junio C Hamano
2021-03-29 22:34       ` Elijah Newren
2021-03-30 12:07         ` Derrick Stolee
2021-05-04  2:12 ` [PATCH v2 00/13] " Elijah Newren via GitGitGadget
2021-05-04  2:12   ` [PATCH v2 01/13] t6423: rename file within directory that other side renamed Elijah Newren via GitGitGadget
2021-05-04  2:12   ` [PATCH v2 02/13] Documentation/technical: describe remembering renames optimization Elijah Newren via GitGitGadget
2021-05-04  2:12   ` [PATCH v2 03/13] fast-rebase: change assert() to BUG() Elijah Newren via GitGitGadget
2021-05-04  2:12   ` [PATCH v2 04/13] fast-rebase: write conflict state to working tree, index, and HEAD Elijah Newren via GitGitGadget
2021-05-17 13:32     ` Derrick Stolee
2021-05-04  2:12   ` [PATCH v2 05/13] t6429: testcases for remembering renames Elijah Newren via GitGitGadget
2021-05-04  2:12   ` [PATCH v2 06/13] merge-ort: add data structures for in-memory caching of rename detection Elijah Newren via GitGitGadget
2021-05-17 13:41     ` Derrick Stolee
2021-05-04  2:12   ` [PATCH v2 07/13] merge-ort: populate caches of rename detection results Elijah Newren via GitGitGadget
2021-05-17 13:51     ` Derrick Stolee
2021-05-04  2:12   ` [PATCH v2 08/13] merge-ort: add code to check for whether cached renames can be reused Elijah Newren via GitGitGadget
2021-05-04  2:12   ` [PATCH v2 09/13] merge-ort: avoid accidental API mis-use Elijah Newren via GitGitGadget
2021-05-04  2:12   ` [PATCH v2 10/13] merge-ort: preserve cached renames for the appropriate side Elijah Newren via GitGitGadget
2021-05-04  2:12   ` [PATCH v2 11/13] merge-ort: add helper functions for using cached renames Elijah Newren via GitGitGadget
2021-05-04  2:12   ` [PATCH v2 12/13] merge-ort: handle interactions of caching and rename/rename(1to1) cases Elijah Newren via GitGitGadget
2021-05-04  2:12   ` [PATCH v2 13/13] merge-ort, diffcore-rename: employ cached renames when possible Elijah Newren via GitGitGadget
2021-05-14 17:37   ` [PATCH v2 00/13] Optimization batch 11: avoid repeatedly detecting same renames Elijah Newren
2021-05-14 21:04     ` Derrick Stolee

git@vger.kernel.org list mirror (unofficial, one of many)

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 git git/ https://public-inbox.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://7fh6tueqddpjyxjmgtdiueylzoqt6pt7hec3pukyptlmohoowvhde4yd.onion/inbox.comp.version-control.git
	nntp://ie5yzdi7fg72h7s4sdcztq5evakq23rdt33mfyfcddc5u3ndnw24ogqd.onion/inbox.comp.version-control.git
	nntp://4uok3hntl7oi7b4uf4rtfwefqeexfzil2w6kgk2jn5z2f764irre7byd.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git