git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/8] Optimization batch 9: avoid detecting irrelevant renames
@ 2021-02-28  3:58 Elijah Newren via GitGitGadget
  2021-02-28  3:58 ` [PATCH 1/8] diffcore-rename: enable filtering possible rename sources Elijah Newren via GitGitGadget
                   ` (10 more replies)
  0 siblings, 11 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-28  3:58 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren

This series depends textually on ort-perf-batch-8, but semantically it's
almost completely unrelated and can be reviewed without any familiarity with
any of the previous patch series.

=== Basic Optimization idea ===

This series determines paths which meet special cases where detection of
renames is irrelevant, where the irrelevance is due to the fact that the
merge machinery will arrive at the same result regardless of whether a
rename is detected for any of those paths. This series represents
"Optimization #2" from my Git Merge 2020 talk[1], though this series has
some improvements on the optimization relative to what I had at that time.

The basic idea here is that if side A of history:

 * only modifies/adds/deletes a few files
 * adds new files to few if any of the directories that side B deleted or
   renamed

then when we do rename detection on side B we can avoid even looking at most
(and perhaps even all) paths that side B deleted. Since commits being
rebased or cherry-picked tend to only modify a few files, this optimization
tends to be particularly effective for rebases and cherry-picks.

Basing rename detection on what the other side of history did to a file
means that extra information needs to be fed from merge-ort to
diffcore-rename in order to take advantage of such an optimization.

=== Comparison to previous series ===

This series differs from my two previous optimizations[2][3] (focusing on
basename-guided rename detection) in two important aspects:

 * there are no behavioral changes (there is no heuristic involved)

 * this optimization is merge specific (it does not help the diff/status/log
   family of commands, just merge/rebase/cherry-pick and such)

=== Results ===

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28), the
changes in just this series improves the performance as follows:

                     Before Series           After Series
no-renames:       12.596 s ±  0.061 s     5.680 s ±  0.096 s
mega-renames:    130.465 s ±  0.259 s    13.812 s ±  0.162 s
just-one-mega:     3.958 s ±  0.010 s   506.0  ms ±  3.9  ms


However, interestingly, if we had ignored the basename-guided rename
detection optimizations[2][3], then this optimization series would have
improved the performance as follows:

               Before Basename Series   After Just This Series
no-renames:      13.815 s ±  0.062 s      5.728 s ±  0.104 s
mega-renames:  1799.937 s ±  0.493 s     18.213 s ±  0.139 s
just-one-mega    51.289 s ±  0.019 s    891.9  ms ±  7.0  ms


As a reminder, before any merge-ort/diffcore-rename performance work, the
performance results we started with (as noted in the same commit message)
were:

no-renames-am:      6.940 s ±  0.485 s
no-renames:        18.912 s ±  0.174 s
mega-renames:    5964.031 s ± 10.459 s
just-one-mega:    149.583 s ±  0.751 s


=== Competition between optimizations ===

We now have three major rename-related optimizations:

 * exact rename detection
 * basename-guided rename detection[2][3]
 * skip-because-unnecessary (this series)

It is possible for all three to potentially apply for specific paths (they
do for the majority of renamed paths in our testcases), but we cannot use
more than one for any given path. It turns out that the priority we give
each optimization is very important and can drastically affect performance.
We get best results by prioritizing them as follows:

 1. exact rename detection
 2. skip-because-unnecessary
 3. basename-guided rename detection

The third-to-last patch of this series also discusses this ordering and
another minor variant of the skip-because-unnecessary optimization that was
tried (and resulted in less effective performance gains than reported here),
as well as some of the preparatory work over the past few years that this
series relies on in order to enable this optimization.

=== Near optimal? ===

You may remember that there was a row labelled "everything else" from the
commit message of 557ac0350d that represented the maximum possible speed-up
from accelerating rename detection alone; as stated in that commit, those
rows represented how fast the code could be if we had somehow infinitely
parallelized the inexact rename detection. However, if you compare those
"maximum speedup" numbers to what we have above, you'll note that the
infinitely parallelized inexact rename detection would have been slightly
slower than the results we have now achieved. (The reason this is possible,
despite the fact that we still spend time in rename detection after our
optimizations, is because we implemented two optimizations outside of
diffcore_rename() along the way.) However, this good news does also come
with a downside -- it means that our remaining optimization potential is
somewhat limited, and subsequent optimization series will have to fight for
much smaller gains.

[1]
https://github.com/newren/presentations/blob/pdfs/merge-performance/merge-performance-slides.pdf
[2]
https://lore.kernel.org/git/pull.843.git.1612651937.gitgitgadget@gmail.com/
[3]
https://lore.kernel.org/git/pull.844.git.1613289544.gitgitgadget@gmail.com/

Elijah Newren (8):
  diffcore-rename: enable filtering possible rename sources
  merge-ort: precompute subset of sources for which we need rename
    detection
  merge-ort: add data structures for an alternate tree traversal
  merge-ort: introduce wrappers for alternate tree traversal
  merge-ort: precompute whether directory rename detection is needed
  merge-ort: use relevant_sources to filter possible rename sources
  merge-ort: skip rename detection entirely if possible
  diffcore-rename: avoid doing basename comparisons for irrelevant
    sources

 diffcore-rename.c |  63 ++++++++++---
 diffcore.h        |   1 +
 merge-ort.c       | 236 +++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 285 insertions(+), 15 deletions(-)


base-commit: 4be565c472088d4144063b736308bf2a57331f45
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-845%2Fnewren%2Fort-perf-batch-9-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-845/newren/ort-perf-batch-9-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/845
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/8] diffcore-rename: enable filtering possible rename sources
  2021-02-28  3:58 [PATCH 0/8] Optimization batch 9: avoid detecting irrelevant renames Elijah Newren via GitGitGadget
@ 2021-02-28  3:58 ` Elijah Newren via GitGitGadget
  2021-02-28  3:58 ` [PATCH 2/8] merge-ort: precompute subset of sources for which we need rename detection Elijah Newren via GitGitGadget
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-28  3:58 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Add the ability to diffcore_rename_extended() to allow external callers
to declare that they only need renames detected for a subset of source
files, and use that information to skip detecting renames for them.

There are two important pieces to this optimization that may not be
obvious at first glance:

  * We do not require callers to just filter the filepairs out
    to remove the non-relevant sources, because exact rename detection
    is fast and when it finds a match it can remove both a source and a
    destination whereas the relevant_sources filter can only remove a
    source.

  * We need to filter out the source pairs in a preliminary pass instead
    of adding a
       strset_contains(relevant_sources, one->path)
    check within the nested matrix loop.  The reason for that is if we
    have 30k renames, doing 30k * 30k = 900M strset_contains() calls
    becomes extraordinarily expensive and defeats the performance gains
    from this change; we only want to do 30k such calls instead.

If callers pass NULL for relevant_sources, that is special cases to
treat all sources as relevant.  Since all callers currently pass NULL,
this optimization does not yet have any effect.  Subsequent commits will
have merge-ort compute a set of relevant_sources to restrict which
sources we detect renames for, and have merge-ort pass that set of
relevant_sources to diffcore_rename_extended().

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 26 +++++++++++++++++++-------
 diffcore.h        |  1 +
 merge-ort.c       |  1 +
 3 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 1fe902ed2af0..7f6115fd9018 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -991,11 +991,12 @@ static int find_renames(struct diff_score *mx,
 	return count;
 }
 
-static void remove_unneeded_paths_from_src(int detecting_copies)
+static void remove_unneeded_paths_from_src(int detecting_copies,
+					   struct strset *interesting)
 {
 	int i, new_num_src;
 
-	if (detecting_copies)
+	if (detecting_copies && !interesting)
 		return; /* nothing to remove */
 	if (break_idx)
 		return; /* culling incompatible with break detection */
@@ -1022,12 +1023,18 @@ static void remove_unneeded_paths_from_src(int detecting_copies)
 	 *      from rename_src here.
 	 */
 	for (i = 0, new_num_src = 0; i < rename_src_nr; i++) {
+		struct diff_filespec *one = rename_src[i].p->one;
+
 		/*
 		 * renames are stored in rename_dst, so if a rename has
 		 * already been detected using this source, we can just
 		 * remove the source knowing rename_dst has its info.
 		 */
-		if (rename_src[i].p->one->rename_used)
+		if (!detecting_copies && one->rename_used)
+			continue;
+
+		/* If we don't care about the source path, skip it */
+		if (interesting && !strset_contains(interesting, one->path))
 			continue;
 
 		if (new_num_src < i)
@@ -1040,6 +1047,7 @@ static void remove_unneeded_paths_from_src(int detecting_copies)
 }
 
 void diffcore_rename_extended(struct diff_options *options,
+			      struct strset *relevant_sources,
 			      struct strset *dirs_removed,
 			      struct strmap *dir_rename_count)
 {
@@ -1060,6 +1068,8 @@ void diffcore_rename_extended(struct diff_options *options,
 	want_copies = (detect_rename == DIFF_DETECT_COPY);
 	if (dirs_removed && (break_idx || want_copies))
 		BUG("dirs_removed incompatible with break/copy detection");
+	if (break_idx && relevant_sources)
+		BUG("break detection incompatible with source specification");
 	if (!minimum_score)
 		minimum_score = DEFAULT_RENAME_SCORE;
 
@@ -1127,9 +1137,10 @@ void diffcore_rename_extended(struct diff_options *options,
 		/*
 		 * Cull sources:
 		 *   - remove ones corresponding to exact renames
+		 *   - remove ones not found in relevant_sources
 		 */
 		trace2_region_enter("diff", "cull after exact", options->repo);
-		remove_unneeded_paths_from_src(want_copies);
+		remove_unneeded_paths_from_src(want_copies, relevant_sources);
 		trace2_region_leave("diff", "cull after exact", options->repo);
 	} else {
 		/* Determine minimum score to match basenames */
@@ -1148,7 +1159,7 @@ void diffcore_rename_extended(struct diff_options *options,
 		 *   - remove ones involved in renames (found via exact match)
 		 */
 		trace2_region_enter("diff", "cull after exact", options->repo);
-		remove_unneeded_paths_from_src(want_copies);
+		remove_unneeded_paths_from_src(want_copies, NULL);
 		trace2_region_leave("diff", "cull after exact", options->repo);
 
 		/* Preparation for basename-driven matching. */
@@ -1167,9 +1178,10 @@ void diffcore_rename_extended(struct diff_options *options,
 		/*
 		 * Cull sources, again:
 		 *   - remove ones involved in renames (found via basenames)
+		 *   - remove ones not found in relevant_sources
 		 */
 		trace2_region_enter("diff", "cull basename", options->repo);
-		remove_unneeded_paths_from_src(want_copies);
+		remove_unneeded_paths_from_src(want_copies, relevant_sources);
 		trace2_region_leave("diff", "cull basename", options->repo);
 	}
 
@@ -1342,5 +1354,5 @@ void diffcore_rename_extended(struct diff_options *options,
 
 void diffcore_rename(struct diff_options *options)
 {
-	diffcore_rename_extended(options, NULL, NULL);
+	diffcore_rename_extended(options, NULL, NULL, NULL);
 }
diff --git a/diffcore.h b/diffcore.h
index c6ba64abd198..737c93a6cc79 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -166,6 +166,7 @@ void partial_clear_dir_rename_count(struct strmap *dir_rename_count);
 void diffcore_break(struct repository *, int);
 void diffcore_rename(struct diff_options *);
 void diffcore_rename_extended(struct diff_options *options,
+			      struct strset *relevant_sources,
 			      struct strset *dirs_removed,
 			      struct strmap *dir_rename_count);
 void diffcore_merge_broken(void);
diff --git a/merge-ort.c b/merge-ort.c
index 467404cc0a35..aba0b9fa54c3 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -2029,6 +2029,7 @@ static void detect_regular_renames(struct merge_options *opt,
 	diff_queued_diff = renames->pairs[side_index];
 	trace2_region_enter("diff", "diffcore_rename", opt->repo);
 	diffcore_rename_extended(&diff_opts,
+				 NULL,
 				 &renames->dirs_removed[side_index],
 				 &renames->dir_rename_count[side_index]);
 	trace2_region_leave("diff", "diffcore_rename", opt->repo);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 2/8] merge-ort: precompute subset of sources for which we need rename detection
  2021-02-28  3:58 [PATCH 0/8] Optimization batch 9: avoid detecting irrelevant renames Elijah Newren via GitGitGadget
  2021-02-28  3:58 ` [PATCH 1/8] diffcore-rename: enable filtering possible rename sources Elijah Newren via GitGitGadget
@ 2021-02-28  3:58 ` Elijah Newren via GitGitGadget
  2021-02-28  3:58 ` [PATCH 3/8] merge-ort: add data structures for an alternate tree traversal Elijah Newren via GitGitGadget
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-28  3:58 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

rename detection works by trying to pair all file deletions (or
"sources") with all file additions (or "destinations"), checking
similarity, and then marking the sufficiently similar ones as renames.
This can be expensive if there are many sources and destinations on a
given side of history as it results in an N x M comparison matrix.
However, there are many cases where we can compute in advance that
detecting renames for some of the sources provides no useful information
and thus that we can exclude those sources from the matrix.

To see why, first note that the merge machinery uses detected renames in
two ways:

   * directory rename detection: when one side of history renames a
       directory, and the other side of history adds new files to that
       directory, we want to be able to warn the user about the need to
       chose whether those new files stay in the old directory or move
       to the new one.

   * three-way content merging: in order to do three-way content merging
       of files, we need three different file versions.  If one side of
       history renamed a file, then some of the content for the file is
       found under a different path than in the merge base or on the
       other side of history.

This commit concentrates just on the three-way content merging; it will
punt and mark all sources as needed for directory rename detection, and
leave it to future commits to narrow that down more.

The point of three-way content merging is to reconcile changes made on
*both* sides of history.  What if the file wasn't modified on both
sides?  There are two possibilities:

   * If it wasn't modified on the renamed side:
       -> then we get to do exact rename detection, which is cheap.

   * If it wasn't modified on the unrenamed side:
       -> then detection of a rename for that source file is irrelevant

That latter claim might be surprising at first, so let's walk through a
case to show why rename detection for that source file is irrelevant.
Let's use two filenames, old.c & new.c, with the following abbreviated
object ids (and where the value '000000' is used to denote that the file
is missing in that commit):

                 old.c     new.c
   MERGE_BASE:   01d01d    000000
   MERGE_SIDE1:  01d01d    000000
   MERGE_SIDE2:  000000    5e1ec7

If the rename *isn't* detected:
   then old.c looks like it was unmodified on one side and deleted on
   the other and should thus be removed.  new.c looks like a new file we
   should keep as-is.

If the rename *is* detected:
   then a three-way content merge is done.  Since the version of the
   file in MERGE_BASE and MERGE_SIDE1 are identical, the three-way merge
   will produce exactly the version of the file whose abbreviated
   object id is 5e1ec7.  It will record that file at the path new.c,
   while removing old.c from the directory.

Note that these two results are identical -- a single file named 'new.c'
with object id 5e1ec7.  In other words, it doesn't matter if the rename
is detected in the case where the file is unmodified on the unrenamed
side.

Use this information to compute whether we need rename detection for
each source created in add_pair().

It's probably worth noting that there used to be a few other edge or
corner cases besides three-way content merges and directory rename
detection where lack of rename detection could have affected the result,
but those cases actually highlighted where conflict resolution methods
were not consistent with each other.  Fixing those inconsistencies were
thus critically important to enabling this optimization.  That work
involved the following:

 * bringing consistency to add/add, rename/add, and rename/rename
    conflict types, as done back in the topic merged at commit
    ac193e0e0a ("Merge branch 'en/merge-path-collision'", 2019-01-04),
    and further extended in commits 2a7c16c980 ("t6422, t6426: be more
    flexible for add/add conflicts involving renames", 2020-08-10) and
    e8eb99d4a6 ("t642[23]: be more flexible for add/add conflicts
    involving pair renames", 2020-08-10)

  * making rename/delete more consistent with modify/delete
    as done in commits 1f3c9ba707 ("t6425: be more flexible with
    rename/delete conflict messages", 2020-08-10) and 727c75b23f
    ("t6404, t6423: expect improved rename/delete handling in ort
    backend", 2020-10-26)

Since the set of relevant_sources we compute has not yet been narrowed
down for directory rename detection, we do not pass it to
diffcore_rename_extended() yet.  That will be done after subsequent
commits narrow down the list of relevant_sources needed for directory
rename detection reasons.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 35 ++++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index aba0b9fa54c3..83aa4c08121f 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -88,6 +88,20 @@ struct rename_info {
 	 */
 	struct strmap dir_renames[3];
 
+	/*
+	 * relevant_sources: deleted paths for which we need rename detection
+	 *
+	 * relevant_sources is a set of deleted paths on each side of
+	 * history for which we need rename detection.  If a path is deleted
+	 * on one side of history, we need to detect if it is part of a
+	 * rename if either
+	 *    * we need to detect renames for an ancestor directory
+	 *    * the file is modified/deleted on the other side of history
+	 * If neither of those are true, we can skip rename detection for
+	 * that path.
+	 */
+	struct strset relevant_sources[3];
+
 	/*
 	 * needed_limit: value needed for inexact rename detection to run
 	 *
@@ -358,6 +372,8 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 			strmap_clear(&renames->dir_rename_count[i], 1);
 
 		strmap_func(&renames->dir_renames[i], 0);
+
+		strset_func(&renames->relevant_sources[i]);
 	}
 
 	if (!reinitialize) {
@@ -533,12 +549,21 @@ static void add_pair(struct merge_options *opt,
 		     struct name_entry *names,
 		     const char *pathname,
 		     unsigned side,
-		     unsigned is_add /* if false, is_delete */)
+		     unsigned is_add /* if false, is_delete */,
+		     unsigned match_mask)
 {
 	struct diff_filespec *one, *two;
 	struct rename_info *renames = &opt->priv->renames;
 	int names_idx = is_add ? side : 0;
 
+	if (!is_add) {
+		unsigned content_relevant = (match_mask == 0);
+		unsigned location_relevant = 1; /* FIXME: compute this */
+
+		if (content_relevant || location_relevant)
+			strset_add(&renames->relevant_sources[side], pathname);
+	}
+
 	one = alloc_filespec(pathname);
 	two = alloc_filespec(pathname);
 	fill_filespec(is_add ? two : one,
@@ -575,11 +600,13 @@ static void collect_rename_info(struct merge_options *opt,
 
 		/* Check for deletion on side */
 		if ((filemask & 1) && !(filemask & side_mask))
-			add_pair(opt, names, fullname, side, 0 /* delete */);
+			add_pair(opt, names, fullname, side, 0 /* delete */,
+				 match_mask & filemask);
 
 		/* Check for addition on side */
 		if (!(filemask & 1) && (filemask & side_mask))
-			add_pair(opt, names, fullname, side, 1 /* add */);
+			add_pair(opt, names, fullname, side, 1 /* add */,
+				 match_mask & filemask);
 	}
 }
 
@@ -3228,6 +3255,8 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
 					 NULL, 1);
 		strmap_init_with_options(&renames->dir_renames[i],
 					 NULL, 0);
+		strset_init_with_options(&renames->relevant_sources[i],
+					 NULL, 0);
 	}
 
 	/*
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 3/8] merge-ort: add data structures for an alternate tree traversal
  2021-02-28  3:58 [PATCH 0/8] Optimization batch 9: avoid detecting irrelevant renames Elijah Newren via GitGitGadget
  2021-02-28  3:58 ` [PATCH 1/8] diffcore-rename: enable filtering possible rename sources Elijah Newren via GitGitGadget
  2021-02-28  3:58 ` [PATCH 2/8] merge-ort: precompute subset of sources for which we need rename detection Elijah Newren via GitGitGadget
@ 2021-02-28  3:58 ` Elijah Newren via GitGitGadget
  2021-02-28  3:58 ` [PATCH 4/8] merge-ort: introduce wrappers for " Elijah Newren via GitGitGadget
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-28  3:58 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

In order to determine whether directory rename detection is needed, we
as a pre-requisite need a way to traverse through all the files in a
given tree before visiting any directories within that tree.
traverse_trees() only iterates through the entries in the order they
appear, so add some data structures that will store all the entries as
we iterate through them in traverse_trees(), which will allow us to
re-traverse them in our desired order.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index 83aa4c08121f..d49cfa8b030b 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -51,6 +51,12 @@ enum merge_side {
 	MERGE_SIDE2 = 2
 };
 
+struct traversal_callback_data {
+	unsigned long mask;
+	unsigned long dirmask;
+	struct name_entry names[3];
+};
+
 struct rename_info {
 	/*
 	 * All variables that are arrays of size 3 correspond to data tracked
@@ -102,6 +108,22 @@ struct rename_info {
 	 */
 	struct strset relevant_sources[3];
 
+	/*
+	 * callback_data_*: supporting data structures for alternate traversal
+	 *
+	 * We sometimes need to be able to traverse through all the files
+	 * in a given tree before all immediate subdirectories within that
+	 * tree.  Since traverse_trees() doesn't do that naturally, we have
+	 * a traverse_trees_wrapper() that stores any immediate
+	 * subdirectories while traversing files, then traverses the
+	 * immediate subdirectories later.  These callback_data* variables
+	 * store the information for the subdirectories so that we can do
+	 * that traversal order.
+	 */
+	struct traversal_callback_data *callback_data;
+	int callback_data_nr, callback_data_alloc;
+	char *callback_data_traverse_path;
+
 	/*
 	 * needed_limit: value needed for inexact rename detection to run
 	 *
@@ -396,6 +418,10 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		}
 		strmap_clear(&opti->output, 0);
 	}
+
+	/* Clean out callback_data as well. */
+	FREE_AND_NULL(renames->callback_data);
+	renames->callback_data_nr = renames->callback_data_alloc = 0;
 }
 
 static int err(struct merge_options *opt, const char *err, ...)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 4/8] merge-ort: introduce wrappers for alternate tree traversal
  2021-02-28  3:58 [PATCH 0/8] Optimization batch 9: avoid detecting irrelevant renames Elijah Newren via GitGitGadget
                   ` (2 preceding siblings ...)
  2021-02-28  3:58 ` [PATCH 3/8] merge-ort: add data structures for an alternate tree traversal Elijah Newren via GitGitGadget
@ 2021-02-28  3:58 ` Elijah Newren via GitGitGadget
  2021-02-28  3:58 ` [PATCH 5/8] merge-ort: precompute whether directory rename detection is needed Elijah Newren via GitGitGadget
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-28  3:58 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Add traverse_trees_wrapper() and traverse_trees_wrapper_callback()
functions.  The former runs traverse_trees() with info->fn set to
traverse_trees_wrapper_callback, in order to simply save all the entries
without processing or recursing into any of them.  This step allows
extra computation to be done (e.g. checking some condition across all
files) that can be used later.  Then, after that is completed, it
iterates over all the saved entries and calls the original info->fn
callback with the saved data.

Currently, this does nothing more than marginally slowing down the tree
traversal since we do not take advantage of the opportunity to compute
anything special in traverse_trees_wrapper_callback(), and thus the real
callback will be called identically as it would have been without this
extra wrapper.  However, a subsequent commit will add some special
computation of some values that the real callback will be able to use.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 72 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index d49cfa8b030b..bd2b93a31141 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -512,6 +512,78 @@ static char *unique_path(struct strmap *existing_paths,
 
 /*** Function Grouping: functions related to collect_merge_info() ***/
 
+static int traverse_trees_wrapper_callback(int n,
+					   unsigned long mask,
+					   unsigned long dirmask,
+					   struct name_entry *names,
+					   struct traverse_info *info)
+{
+	struct merge_options *opt = info->data;
+	struct rename_info *renames = &opt->priv->renames;
+
+	assert(n==3);
+
+	if (!renames->callback_data_traverse_path)
+		renames->callback_data_traverse_path = xstrdup(info->traverse_path);
+
+	ALLOC_GROW(renames->callback_data, renames->callback_data_nr + 1,
+		   renames->callback_data_alloc);
+	renames->callback_data[renames->callback_data_nr].mask = mask;
+	renames->callback_data[renames->callback_data_nr].dirmask = dirmask;
+	COPY_ARRAY(renames->callback_data[renames->callback_data_nr].names,
+		   names, 3);
+	renames->callback_data_nr++;
+
+	return mask;
+}
+
+/*
+ * Much like traverse_trees(), BUT:
+ *   - read all the tree entries FIRST, saving them
+ *   - note that the above step provides an opportunity to compute necessary
+ *     additional details before the "real" traversal
+ *   - loop through the saved entries and call the original callback on them
+ */
+MAYBE_UNUSED
+static int traverse_trees_wrapper(struct index_state *istate,
+				  int n,
+				  struct tree_desc *t,
+				  struct traverse_info *info)
+{
+	int ret, i, old_offset;
+	traverse_callback_t old_fn;
+	char *old_callback_data_traverse_path;
+	struct merge_options *opt = info->data;
+	struct rename_info *renames = &opt->priv->renames;
+
+	old_callback_data_traverse_path = renames->callback_data_traverse_path;
+	old_fn = info->fn;
+	old_offset = renames->callback_data_nr;
+
+	renames->callback_data_traverse_path = NULL;
+	info->fn = traverse_trees_wrapper_callback;
+	ret = traverse_trees(istate, n, t, info);
+	if (ret < 0)
+		return ret;
+
+	info->traverse_path = renames->callback_data_traverse_path;
+	info->fn = old_fn;
+	for (i = old_offset; i < renames->callback_data_nr; ++i) {
+		info->fn(n,
+			 renames->callback_data[i].mask,
+			 renames->callback_data[i].dirmask,
+			 renames->callback_data[i].names,
+			 info);
+
+	}
+
+	renames->callback_data_nr = old_offset;
+	free(renames->callback_data_traverse_path);
+	renames->callback_data_traverse_path = old_callback_data_traverse_path;
+	info->traverse_path = NULL;
+	return 0;
+}
+
 static void setup_path_info(struct merge_options *opt,
 			    struct string_list_item *result,
 			    const char *current_dir_name,
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 5/8] merge-ort: precompute whether directory rename detection is needed
  2021-02-28  3:58 [PATCH 0/8] Optimization batch 9: avoid detecting irrelevant renames Elijah Newren via GitGitGadget
                   ` (3 preceding siblings ...)
  2021-02-28  3:58 ` [PATCH 4/8] merge-ort: introduce wrappers for " Elijah Newren via GitGitGadget
@ 2021-02-28  3:58 ` Elijah Newren via GitGitGadget
  2021-02-28  3:58 ` [PATCH 6/8] merge-ort: use relevant_sources to filter possible rename sources Elijah Newren via GitGitGadget
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-28  3:58 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

The point of directory rename detection is that if one side of history
renames a directory, and the other side adds new files under the old
directory, then the merge can move those new files into the new
directory.  This leads to the following important observation:

  * If the other side does not add any new files under the old
    directory, we do not need to detect any renames for that directory.

Similarly, directory rename detection had an important requirement:

  * If a directory still exists on one side of history, it has not been
    renamed on that side of history.  (See section 4 of t6423 or
    Documentation/technical/directory-rename-detection.txt for more
    details).

Using these two bits of information, we note that directory rename
detection is only needed in cases where (1) directories exist in the
merge base and on one side of history (i.e. dirmask == 3 or dirmask ==
5), and (2) where there is some new file added to that directory on the
side where it still exists (thus where the file has filemask == 2 or
filemask == 4, respectively).  This has to be done in two steps, because
we have the dirmask when we are first considering the directory, and
won't get the filemasks for the files within it until we recurse into
that directory.  So, we save
  dir_rename_mask = dirmask - 1
when we hit a directory that is missing on one side, and then later look
for cases of
  filemask == dir_rename_mask

One final note is that as soon as we hit a directory that needs
directory rename detection, we will need to detect renames in all
subdirectories of that directory as well due to the "majority rules"
decision when files are renamed into different directory hierarchies.
We arbitrarily use the special value of 0x07 to record when we've hit
such a directory.

The combination of all the above mean that we introduce a variable
named dir_rename_mask (couldn't think of a better name) which has one
of the following values as we traverse into a directory:
   * 0x00: directory rename detection not needed
   * 0x02 or 0x04: directory rename detection only needed if files added
   * 0x07: directory rename detection definitely needed

We then pass this value through to add_pairs() so that it can mark
location_relevant as true only when dir_rename_mask is 0x07.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 61 insertions(+), 6 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index bd2b93a31141..27acaa7380be 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -108,6 +108,14 @@ struct rename_info {
 	 */
 	struct strset relevant_sources[3];
 
+	/*
+	 * dir_rename_mask:
+	 *   0: optimization removing unmodified potential rename source okay
+	 *   2 or 4: optimization okay, but must check for files added to dir
+	 *   7: optimization forbidden; need rename source in case of dir rename
+	 */
+	unsigned dir_rename_mask:3;
+
 	/*
 	 * callback_data_*: supporting data structures for alternate traversal
 	 *
@@ -419,6 +427,8 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		strmap_clear(&opti->output, 0);
 	}
 
+	renames->dir_rename_mask = 0;
+
 	/* Clean out callback_data as well. */
 	FREE_AND_NULL(renames->callback_data);
 	renames->callback_data_nr = renames->callback_data_alloc = 0;
@@ -520,12 +530,16 @@ static int traverse_trees_wrapper_callback(int n,
 {
 	struct merge_options *opt = info->data;
 	struct rename_info *renames = &opt->priv->renames;
+	unsigned filemask = mask & ~dirmask;
 
 	assert(n==3);
 
 	if (!renames->callback_data_traverse_path)
 		renames->callback_data_traverse_path = xstrdup(info->traverse_path);
 
+	if (filemask && filemask == renames->dir_rename_mask)
+		renames->dir_rename_mask = 0x07;
+
 	ALLOC_GROW(renames->callback_data, renames->callback_data_nr + 1,
 		   renames->callback_data_alloc);
 	renames->callback_data[renames->callback_data_nr].mask = mask;
@@ -544,7 +558,6 @@ static int traverse_trees_wrapper_callback(int n,
  *     additional details before the "real" traversal
  *   - loop through the saved entries and call the original callback on them
  */
-MAYBE_UNUSED
 static int traverse_trees_wrapper(struct index_state *istate,
 				  int n,
 				  struct tree_desc *t,
@@ -556,6 +569,8 @@ static int traverse_trees_wrapper(struct index_state *istate,
 	struct merge_options *opt = info->data;
 	struct rename_info *renames = &opt->priv->renames;
 
+	assert(renames->dir_rename_mask == 2 || renames->dir_rename_mask == 4);
+
 	old_callback_data_traverse_path = renames->callback_data_traverse_path;
 	old_fn = info->fn;
 	old_offset = renames->callback_data_nr;
@@ -648,7 +663,8 @@ static void add_pair(struct merge_options *opt,
 		     const char *pathname,
 		     unsigned side,
 		     unsigned is_add /* if false, is_delete */,
-		     unsigned match_mask)
+		     unsigned match_mask,
+		     unsigned dir_rename_mask)
 {
 	struct diff_filespec *one, *two;
 	struct rename_info *renames = &opt->priv->renames;
@@ -656,7 +672,7 @@ static void add_pair(struct merge_options *opt,
 
 	if (!is_add) {
 		unsigned content_relevant = (match_mask == 0);
-		unsigned location_relevant = 1; /* FIXME: compute this */
+		unsigned location_relevant = (dir_rename_mask == 0x07);
 
 		if (content_relevant || location_relevant)
 			strset_add(&renames->relevant_sources[side], pathname);
@@ -680,6 +696,36 @@ static void collect_rename_info(struct merge_options *opt,
 	struct rename_info *renames = &opt->priv->renames;
 	unsigned side;
 
+	/*
+	 * Update dir_rename_mask (determines ignore-rename-source validity)
+	 *
+	 * dir_rename_mask helps us keep track of when directory rename
+	 * detection may be relevant.  Basically, whenver a directory is
+	 * removed on one side of history, and a file is added to that
+	 * directory on the other side of history, directory rename
+	 * detection is relevant (meaning we have to detect renames for all
+	 * files within that directory to deduce where the directory
+	 * moved).  Also, whenever a directory needs directory rename
+	 * detection, due to the "majority rules" choice for where to move
+	 * it (see t6423 testcase 1f), we also need to detect renames for
+	 * all files within subdirectories of that directory as well.
+	 *
+	 * Here we haven't looked at files within the directory yet, we are
+	 * just looking at the directory itself.  So, if we aren't yet in
+	 * a case where a parent directory needed directory rename detection
+	 * (i.e. dir_rename_mask != 0x07), and if the directory was removed
+	 * on one side of history, record the mask of the other side of
+	 * history in dir_rename_mask.
+	 */
+	if (renames->dir_rename_mask != 0x07 &&
+	    (dirmask == 3 || dirmask == 5)) {
+		/* simple sanity check */
+		assert(renames->dir_rename_mask == 0 ||
+		       renames->dir_rename_mask == (dirmask & ~1));
+		/* update dir_rename_mask; have it record mask of new side */
+		renames->dir_rename_mask = (dirmask & ~1);
+	}
+
 	/* Update dirs_removed, as needed */
 	if (dirmask == 1 || dirmask == 3 || dirmask == 5) {
 		/* absent_mask = 0x07 - dirmask; sides = absent_mask/2 */
@@ -699,12 +745,14 @@ static void collect_rename_info(struct merge_options *opt,
 		/* Check for deletion on side */
 		if ((filemask & 1) && !(filemask & side_mask))
 			add_pair(opt, names, fullname, side, 0 /* delete */,
-				 match_mask & filemask);
+				 match_mask & filemask,
+				 renames->dir_rename_mask);
 
 		/* Check for addition on side */
 		if (!(filemask & 1) && (filemask & side_mask))
 			add_pair(opt, names, fullname, side, 1 /* add */,
-				 match_mask & filemask);
+				 match_mask & filemask,
+				 renames->dir_rename_mask);
 	}
 }
 
@@ -722,12 +770,14 @@ static int collect_merge_info_callback(int n,
 	 */
 	struct merge_options *opt = info->data;
 	struct merge_options_internal *opti = opt->priv;
+	struct rename_info *renames = &opt->priv->renames;
 	struct string_list_item pi;  /* Path Info */
 	struct conflict_info *ci; /* typed alias to pi.util (which is void*) */
 	struct name_entry *p;
 	size_t len;
 	char *fullpath;
 	const char *dirname = opti->current_dir_name;
+	unsigned prev_dir_rename_mask = renames->dir_rename_mask;
 	unsigned filemask = mask & ~dirmask;
 	unsigned match_mask = 0; /* will be updated below */
 	unsigned mbase_null = !(mask & 1);
@@ -868,8 +918,13 @@ static int collect_merge_info_callback(int n,
 
 		original_dir_name = opti->current_dir_name;
 		opti->current_dir_name = pi.string;
-		ret = traverse_trees(NULL, 3, t, &newinfo);
+		if (renames->dir_rename_mask == 0 ||
+		    renames->dir_rename_mask == 0x07)
+			ret = traverse_trees(NULL, 3, t, &newinfo);
+		else
+			ret = traverse_trees_wrapper(NULL, 3, t, &newinfo);
 		opti->current_dir_name = original_dir_name;
+		renames->dir_rename_mask = prev_dir_rename_mask;
 
 		for (i = MERGE_BASE; i <= MERGE_SIDE2; i++)
 			free(buf[i]);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 6/8] merge-ort: use relevant_sources to filter possible rename sources
  2021-02-28  3:58 [PATCH 0/8] Optimization batch 9: avoid detecting irrelevant renames Elijah Newren via GitGitGadget
                   ` (4 preceding siblings ...)
  2021-02-28  3:58 ` [PATCH 5/8] merge-ort: precompute whether directory rename detection is needed Elijah Newren via GitGitGadget
@ 2021-02-28  3:58 ` Elijah Newren via GitGitGadget
  2021-02-28  3:58 ` [PATCH 7/8] merge-ort: skip rename detection entirely if possible Elijah Newren via GitGitGadget
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-28  3:58 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

The past several commits determined conditions when rename sources might
be needed, and filled a relevant_sources strset with those paths.  Pass
these along to diffcore_rename_extended() to use to limit the sources
that we need to detect renames for.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       12.596 s ±  0.061 s     6.003 s ±  0.048 s
    mega-renames:    130.465 s ±  0.259 s   114.009 s ±  0.236 s
    just-one-mega:     3.958 s ±  0.010 s     3.489 s ±  0.017 s

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/merge-ort.c b/merge-ort.c
index 27acaa7380be..66892c63cee2 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -2209,7 +2209,7 @@ static void detect_regular_renames(struct merge_options *opt,
 	diff_queued_diff = renames->pairs[side_index];
 	trace2_region_enter("diff", "diffcore_rename", opt->repo);
 	diffcore_rename_extended(&diff_opts,
-				 NULL,
+				 &renames->relevant_sources[side_index],
 				 &renames->dirs_removed[side_index],
 				 &renames->dir_rename_count[side_index]);
 	trace2_region_leave("diff", "diffcore_rename", opt->repo);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 7/8] merge-ort: skip rename detection entirely if possible
  2021-02-28  3:58 [PATCH 0/8] Optimization batch 9: avoid detecting irrelevant renames Elijah Newren via GitGitGadget
                   ` (5 preceding siblings ...)
  2021-02-28  3:58 ` [PATCH 6/8] merge-ort: use relevant_sources to filter possible rename sources Elijah Newren via GitGitGadget
@ 2021-02-28  3:58 ` Elijah Newren via GitGitGadget
  2021-02-28  3:58 ` [PATCH 8/8] diffcore-rename: avoid doing basename comparisons for irrelevant sources Elijah Newren via GitGitGadget
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-28  3:58 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

diffcore_rename_extended() will do a bunch of setup, then check for
exact renames, then abort before inexact rename detection if there are
no more sources or destinations that need to be matched.  It will
sometimes be the case, however, that either
  * we start with neither any sources or destinations
  * we start with no *relevant* sources
In the first of these two cases, the setup and exact rename detection
will be very cheap since there are 0 files to operate on.  In the second
case, it is quite possible to have thousands of files with none of the
source ones being relevant.  Avoid calling diffcore_rename_extended() or
even some of the setup before diffcore_rename_extended() when we can
determine that rename detection is unnecessary.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:        6.003 s ±  0.048 s     5.708 s ±  0.111 s
    mega-renames:    114.009 s ±  0.236 s   102.171 s ±  0.440 s
    just-one-mega:     3.489 s ±  0.017 s     3.471 s ±  0.015 s

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index 66892c63cee2..8602c7b8936f 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -2158,6 +2158,19 @@ static int process_renames(struct merge_options *opt,
 	return clean_merge;
 }
 
+static inline int possible_side_renames(struct rename_info *renames,
+					unsigned side_index)
+{
+	return renames->pairs[side_index].nr > 0 &&
+	       !strset_empty(&renames->relevant_sources[side_index]);
+}
+
+static inline int possible_renames(struct rename_info *renames)
+{
+	return possible_side_renames(renames, 1) ||
+	       possible_side_renames(renames, 2);
+}
+
 static void resolve_diffpair_statuses(struct diff_queue_struct *q)
 {
 	/*
@@ -2194,6 +2207,16 @@ static void detect_regular_renames(struct merge_options *opt,
 	struct diff_options diff_opts;
 	struct rename_info *renames = &opt->priv->renames;
 
+	if (!possible_side_renames(renames, side_index)) {
+		/*
+		 * No rename detection needed for this side, but we still need
+		 * to make sure 'adds' are marked correctly in case the other
+		 * side had directory renames.
+		 */
+		resolve_diffpair_statuses(&renames->pairs[side_index]);
+		return;
+	}
+
 	repo_diff_setup(opt->repo, &diff_opts);
 	diff_opts.flags.recursive = 1;
 	diff_opts.flags.rename_empty = 0;
@@ -2311,6 +2334,8 @@ static int detect_and_process_renames(struct merge_options *opt,
 	int need_dir_renames, s, clean = 1;
 
 	memset(&combined, 0, sizeof(combined));
+	if (!possible_renames(renames))
+		goto cleanup;
 
 	trace2_region_enter("merge", "regular renames", opt->repo);
 	detect_regular_renames(opt, MERGE_SIDE1);
@@ -2345,6 +2370,26 @@ static int detect_and_process_renames(struct merge_options *opt,
 	clean &= process_renames(opt, &combined);
 	trace2_region_leave("merge", "process renames", opt->repo);
 
+	goto simple_cleanup; /* collect_renames() handles some of cleanup */
+
+cleanup:
+	/*
+	 * Free now unneeded filepairs, which would have been handled
+	 * in collect_renames() normally but we're about to skip that
+	 * code...
+	 */
+	for (s = MERGE_SIDE1; s <= MERGE_SIDE2; s++) {
+		struct diff_queue_struct *side_pairs;
+		int i;
+
+		side_pairs = &renames->pairs[s];
+		for (i = 0; i < side_pairs->nr; ++i) {
+			struct diff_filepair *p = side_pairs->queue[i];
+			diff_free_filepair(p);
+		}
+	}
+
+simple_cleanup:
 	/* Free memory for renames->pairs[] and combined */
 	for (s = MERGE_SIDE1; s <= MERGE_SIDE2; s++) {
 		free(renames->pairs[s].queue);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 8/8] diffcore-rename: avoid doing basename comparisons for irrelevant sources
  2021-02-28  3:58 [PATCH 0/8] Optimization batch 9: avoid detecting irrelevant renames Elijah Newren via GitGitGadget
                   ` (6 preceding siblings ...)
  2021-02-28  3:58 ` [PATCH 7/8] merge-ort: skip rename detection entirely if possible Elijah Newren via GitGitGadget
@ 2021-02-28  3:58 ` Elijah Newren via GitGitGadget
  2021-03-01 16:39 ` [PATCH 0/8] Optimization batch 9: avoid detecting irrelevant renames Elijah Newren
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-28  3:58 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

The basename comparison optimization implemented in
find_basename_matches() is very beneficial since it allows a source to
sometimes only be compared with one other file instead of N other files.
When a match is found, both a source and destination can be removed from
the matrix of inexact rename comparisons.  In contrast, the irrelevant
source optimization only allows us to remove a source from the matrix of
inexact rename comparisons...but it has the advantage of allowing a
source file to not even be loaded into memory at all and be compared to
0 other files.  Generally, not even comparing is a bigger performance
win, so when both optimizations could apply, prefer to use the
irrelevant-source optimization.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:        5.708 s ±  0.111 s     5.680 s ±  0.096 s
    mega-renames:    102.171 s ±  0.440 s    13.812 s ±  0.162 s
    just-one-mega:     3.471 s ±  0.015 s   506.0  ms ±  3.9  ms

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 37 +++++++++++++++++++++++++++++++++----
 1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 7f6115fd9018..e8508541be14 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -527,6 +527,7 @@ static void update_dir_rename_counts(struct dir_rename_info *info,
 }
 
 static void initialize_dir_rename_info(struct dir_rename_info *info,
+				       struct strset *relevant_sources,
 				       struct strset *dirs_removed,
 				       struct strmap *dir_rename_count)
 {
@@ -534,7 +535,7 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 	struct strmap_entry *entry;
 	int i;
 
-	if (!dirs_removed) {
+	if (!dirs_removed && !relevant_sources) {
 		info->setup = 0;
 		return;
 	}
@@ -549,7 +550,20 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 	strmap_init_with_options(&info->dir_rename_guess, NULL, 0);
 
 	/* Setup info->relevant_source_dirs */
-	info->relevant_source_dirs = dirs_removed;
+	info->relevant_source_dirs = NULL;
+	if (dirs_removed || !relevant_sources) {
+		info->relevant_source_dirs = dirs_removed; /* might be NULL */
+	} else {
+		info->relevant_source_dirs = xmalloc(sizeof(struct strintmap));
+		strset_init(info->relevant_source_dirs);
+		strset_for_each_entry(relevant_sources, &iter, entry) {
+			char *dirname = get_dirname(entry->key);
+			if (!dirs_removed ||
+			    strset_contains(dirs_removed, dirname))
+				strset_add(info->relevant_source_dirs, dirname);
+			free(dirname);
+		}
+	}
 
 	/*
 	 * Loop setting up both info->idx_map, and doing setup of
@@ -627,6 +641,13 @@ static void cleanup_dir_rename_info(struct dir_rename_info *info,
 	/* dir_rename_guess */
 	strmap_clear(&info->dir_rename_guess, 1);
 
+	/* relevant_source_dirs */
+	if (info->relevant_source_dirs &&
+	    info->relevant_source_dirs != dirs_removed) {
+		strset_clear(info->relevant_source_dirs);
+		FREE_AND_NULL(info->relevant_source_dirs);
+	}
+
 	/* dir_rename_count */
 	if (!keep_dir_rename_count) {
 		partial_clear_dir_rename_count(info->dir_rename_count);
@@ -749,6 +770,7 @@ static int idx_possible_rename(char *filename, struct dir_rename_info *info)
 static int find_basename_matches(struct diff_options *options,
 				 int minimum_score,
 				 struct dir_rename_info *info,
+				 struct strset *relevant_sources,
 				 struct strset *dirs_removed)
 {
 	/*
@@ -839,6 +861,11 @@ static int find_basename_matches(struct diff_options *options,
 		intptr_t src_index;
 		intptr_t dst_index;
 
+		/* Skip irrelevant sources */
+		if (relevant_sources &&
+		    !strset_contains(relevant_sources, filename))
+			continue;
+
 		/*
 		 * If the basename is unique among remaining sources, then
 		 * src_index will equal 'i' and we can attempt to match it
@@ -1164,7 +1191,7 @@ void diffcore_rename_extended(struct diff_options *options,
 
 		/* Preparation for basename-driven matching. */
 		trace2_region_enter("diff", "dir rename setup", options->repo);
-		initialize_dir_rename_info(&info,
+		initialize_dir_rename_info(&info, relevant_sources,
 					   dirs_removed, dir_rename_count);
 		trace2_region_leave("diff", "dir rename setup", options->repo);
 
@@ -1172,7 +1199,9 @@ void diffcore_rename_extended(struct diff_options *options,
 		trace2_region_enter("diff", "basename matches", options->repo);
 		rename_count += find_basename_matches(options,
 						      min_basename_score,
-						      &info, dirs_removed);
+						      &info,
+						      relevant_sources,
+						      dirs_removed);
 		trace2_region_leave("diff", "basename matches", options->repo);
 
 		/*
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/8] Optimization batch 9: avoid detecting irrelevant renames
  2021-02-28  3:58 [PATCH 0/8] Optimization batch 9: avoid detecting irrelevant renames Elijah Newren via GitGitGadget
                   ` (7 preceding siblings ...)
  2021-02-28  3:58 ` [PATCH 8/8] diffcore-rename: avoid doing basename comparisons for irrelevant sources Elijah Newren via GitGitGadget
@ 2021-03-01 16:39 ` Elijah Newren
  2021-03-04  7:54 ` Elijah Newren
  2021-03-09  0:09 ` [PATCH v2 " Elijah Newren via GitGitGadget
  10 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren @ 2021-03-01 16:39 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: Git Mailing List, Derrick Stolee, Jonathan Tan, Taylor Blau

On Sat, Feb 27, 2021 at 7:58 PM Elijah Newren via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> This series depends textually on ort-perf-batch-8, but semantically it's
> almost completely unrelated and can be reviewed without any familiarity with
> any of the previous patch series.
>
> === Basic Optimization idea ===
>
> This series determines paths which meet special cases where detection of
> renames is irrelevant, where the irrelevance is due to the fact that the
> merge machinery will arrive at the same result regardless of whether a
> rename is detected for any of those paths. This series represents
> "Optimization #2" from my Git Merge 2020 talk[1], though this series has
> some improvements on the optimization relative to what I had at that time.
>
> The basic idea here is that if side A of history:
>
>  * only modifies/adds/deletes a few files
>  * adds new files to few if any of the directories that side B deleted or
>    renamed
>
> then when we do rename detection on side B we can avoid even looking at most
> (and perhaps even all) paths that side B deleted. Since commits being
> rebased or cherry-picked tend to only modify a few files, this optimization
> tends to be particularly effective for rebases and cherry-picks.
>
> Basing rename detection on what the other side of history did to a file
> means that extra information needs to be fed from merge-ort to
> diffcore-rename in order to take advantage of such an optimization.
>
> === Comparison to previous series ===
>
> This series differs from my two previous optimizations[2][3] (focusing on
> basename-guided rename detection) in two important aspects:
>
>  * there are no behavioral changes (there is no heuristic involved)
>
>  * this optimization is merge specific (it does not help the diff/status/log
>    family of commands, just merge/rebase/cherry-pick and such)
>
> === Results ===
>
> For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
> performance work; instrument with trace2_region_* calls", 2020-10-28), the
> changes in just this series improves the performance as follows:
>
>                      Before Series           After Series
> no-renames:       12.596 s ±  0.061 s     5.680 s ±  0.096 s
> mega-renames:    130.465 s ±  0.259 s    13.812 s ±  0.162 s
> just-one-mega:     3.958 s ±  0.010 s   506.0  ms ±  3.9  ms
>
>
> However, interestingly, if we had ignored the basename-guided rename
> detection optimizations[2][3], then this optimization series would have
> improved the performance as follows:
>
>                Before Basename Series   After Just This Series
> no-renames:      13.815 s ±  0.062 s      5.728 s ±  0.104 s
> mega-renames:  1799.937 s ±  0.493 s     18.213 s ±  0.139 s
> just-one-mega    51.289 s ±  0.019 s    891.9  ms ±  7.0  ms
>
>
> As a reminder, before any merge-ort/diffcore-rename performance work, the
> performance results we started with (as noted in the same commit message)
> were:
>
> no-renames-am:      6.940 s ±  0.485 s
> no-renames:        18.912 s ±  0.174 s
> mega-renames:    5964.031 s ± 10.459 s
> just-one-mega:    149.583 s ±  0.751 s
>
>
> === Competition between optimizations ===
>
> We now have three major rename-related optimizations:
>
>  * exact rename detection
>  * basename-guided rename detection[2][3]
>  * skip-because-unnecessary (this series)
>
> It is possible for all three to potentially apply for specific paths (they
> do for the majority of renamed paths in our testcases), but we cannot use
> more than one for any given path. It turns out that the priority we give
> each optimization is very important and can drastically affect performance.
> We get best results by prioritizing them as follows:
>
>  1. exact rename detection
>  2. skip-because-unnecessary
>  3. basename-guided rename detection
>
> The third-to-last patch of this series also discusses this ordering and
> another minor variant of the skip-because-unnecessary optimization that was
> tried (and resulted in less effective performance gains than reported here),
> as well as some of the preparatory work over the past few years that this
> series relies on in order to enable this optimization.

Oops.  When I restructured the series I carefully re-read the commit
messages to make sure I didn't forget to update one...but I apparently
forgot to update the cover letter.  The discussion was actually split
across a few patches by the refactoring, and what is now the
third-to-last patch doesn't contain any of that discussion.

> === Near optimal? ===
>
> You may remember that there was a row labelled "everything else" from the
> commit message of 557ac0350d that represented the maximum possible speed-up
> from accelerating rename detection alone; as stated in that commit, those
> rows represented how fast the code could be if we had somehow infinitely
> parallelized the inexact rename detection. However, if you compare those
> "maximum speedup" numbers to what we have above, you'll note that the
> infinitely parallelized inexact rename detection would have been slightly
> slower than the results we have now achieved. (The reason this is possible,
> despite the fact that we still spend time in rename detection after our
> optimizations, is because we implemented two optimizations outside of
> diffcore_rename() along the way.) However, this good news does also come
> with a downside -- it means that our remaining optimization potential is
> somewhat limited, and subsequent optimization series will have to fight for
> much smaller gains.
>
> [1]
> https://github.com/newren/presentations/blob/pdfs/merge-performance/merge-performance-slides.pdf
> [2]
> https://lore.kernel.org/git/pull.843.git.1612651937.gitgitgadget@gmail.com/
> [3]
> https://lore.kernel.org/git/pull.844.git.1613289544.gitgitgadget@gmail.com/
>
> Elijah Newren (8):
>   diffcore-rename: enable filtering possible rename sources
>   merge-ort: precompute subset of sources for which we need rename
>     detection
>   merge-ort: add data structures for an alternate tree traversal
>   merge-ort: introduce wrappers for alternate tree traversal
>   merge-ort: precompute whether directory rename detection is needed
>   merge-ort: use relevant_sources to filter possible rename sources
>   merge-ort: skip rename detection entirely if possible
>   diffcore-rename: avoid doing basename comparisons for irrelevant
>     sources
>
>  diffcore-rename.c |  63 ++++++++++---
>  diffcore.h        |   1 +
>  merge-ort.c       | 236 +++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 285 insertions(+), 15 deletions(-)
>
>
> base-commit: 4be565c472088d4144063b736308bf2a57331f45
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-845%2Fnewren%2Fort-perf-batch-9-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-845/newren/ort-perf-batch-9-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/845
> --
> gitgitgadget

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 0/8] Optimization batch 9: avoid detecting irrelevant renames
  2021-02-28  3:58 [PATCH 0/8] Optimization batch 9: avoid detecting irrelevant renames Elijah Newren via GitGitGadget
                   ` (8 preceding siblings ...)
  2021-03-01 16:39 ` [PATCH 0/8] Optimization batch 9: avoid detecting irrelevant renames Elijah Newren
@ 2021-03-04  7:54 ` Elijah Newren
  2021-03-09  0:09 ` [PATCH v2 " Elijah Newren via GitGitGadget
  10 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren @ 2021-03-04  7:54 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: Git Mailing List, Derrick Stolee, Jonathan Tan, Taylor Blau

Hi,

On Sat, Feb 27, 2021 at 7:58 PM Elijah Newren via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> This series depends textually on ort-perf-batch-8, but semantically it's
> almost completely unrelated and can be reviewed without any familiarity with
> any of the previous patch series.
>
> === Basic Optimization idea ===
>
> The basic idea here is that if side A of history:
>
>  * only modifies/adds/deletes a few files
>  * adds new files to few if any of the directories that side B deleted or
>    renamed
>
> then when we do rename detection on side B we can avoid even looking at most
> (and perhaps even all) paths that side B deleted. Since commits being
> rebased or cherry-picked tend to only modify a few files, this optimization
> tends to be particularly effective for rebases and cherry-picks.

After thinking it over a bit more, there's a much better way to put
this summary:

    We only need expensive rename detection on the subset of files
changed on *both* sides (for the most part).

This is because:

1. The primary reason for rename detection in merges is enabling
three-way content merges
2. The purpose of three-way content merges is reconciling changes when
*both* sides of history modified some file
3. If a file was only modified by the side that renamed the file, then
detecting the rename is irrelevant; we'll get the same answer without
knowing about the rename.
4. (Well...there are rare cases where we need the rename for reasons
other than three-way content merges.  Patch 5 explains those.)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 0/8] Optimization batch 9: avoid detecting irrelevant renames
  2021-02-28  3:58 [PATCH 0/8] Optimization batch 9: avoid detecting irrelevant renames Elijah Newren via GitGitGadget
                   ` (9 preceding siblings ...)
  2021-03-04  7:54 ` Elijah Newren
@ 2021-03-09  0:09 ` Elijah Newren via GitGitGadget
  2021-03-09  0:09   ` [PATCH v2 1/8] diffcore-rename: enable filtering possible rename sources Elijah Newren via GitGitGadget
                     ` (11 more replies)
  10 siblings, 12 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-09  0:09 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren

This series depends textually on ort-perf-batch-8, but semantically it's
almost completely unrelated and can be reviewed without any familiarity with
any of the previous patch series.

There are no changes since v1; it's just a resend just over a week later to
bump it so it isn't lost.

=== Basic Optimization idea ===

This series determines paths which meet special cases where detection of
renames is irrelevant, where the irrelevance is due to the fact that the
merge machinery will arrive at the same result regardless of whether a
rename is detected for any of those paths. This series represents
"Optimization #2" from my Git Merge 2020 talk[1], though this series has
some improvements on the optimization relative to what I had at that time.

The basic idea here is:

We only need expensive rename detection on the subset of files changed on
both sides of history (for the most part).

This is because:

 1. The primary reason for rename detection in merges is enabling three-way
    content merges
 2. The purpose of three-way content merges is reconciling changes when

both sides of history modified some file 3. If a file was only modified by
the side that renamed the file, then detecting the rename is irrelevant;
we'll get the same answer without knowing about the rename. 4. (Well...there
are rare cases where we need the rename for reasons other than three-way
content merges. Patch 5 explains those.)

Since commits being rebased or cherry-picked tend to only modify a few
files, this optimization tends to be particularly effective for rebases and
cherry-picks.

Basing rename detection on what the other side of history did to a file
means that extra information needs to be fed from merge-ort to
diffcore-rename in order to take advantage of such an optimization.

=== Comparison to previous series ===

This series differs from my two previous optimizations[2][3] (focusing on
basename-guided rename detection) in two important aspects:

 * there are no behavioral changes (there is no heuristic involved)
 * this optimization is merge specific (it does not help the diff/status/log
   family of commands, just merge/rebase/cherry-pick and such)

=== Results ===

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28), the
changes in just this series improves the performance as follows:

                     Before Series           After Series
no-renames:       12.596 s ±  0.061 s     5.680 s ±  0.096 s
mega-renames:    130.465 s ±  0.259 s    13.812 s ±  0.162 s
just-one-mega:     3.958 s ±  0.010 s   506.0  ms ±  3.9  ms


However, interestingly, if we had ignored the basename-guided rename
detection optimizations[2][3], then this optimization series would have
improved the performance as follows:

               Before Basename Series   After Just This Series
no-renames:      13.815 s ±  0.062 s      5.728 s ±  0.104 s
mega-renames:  1799.937 s ±  0.493 s     18.213 s ±  0.139 s
just-one-mega    51.289 s ±  0.019 s    891.9  ms ±  7.0  ms


As a reminder, before any merge-ort/diffcore-rename performance work, the
performance results we started with (as noted in the same commit message)
were:

no-renames-am:      6.940 s ±  0.485 s
no-renames:        18.912 s ±  0.174 s
mega-renames:    5964.031 s ± 10.459 s
just-one-mega:    149.583 s ±  0.751 s


=== Competition between optimizations ===

We now have three major rename-related optimizations:

 * exact rename detection
 * basename-guided rename detection[2][3]
 * skip-because-unnecessary (this series)

It is possible for all three to potentially apply for specific paths (they
do for the majority of renamed paths in our testcases), but we cannot use
more than one for any given path. It turns out that the priority we give
each optimization is very important and can drastically affect performance.
We get best results by prioritizing them as follows:

 1. exact rename detection
 2. skip-because-unnecessary
 3. basename-guided rename detection

Some of the commit messages discuss this ordering and another minor variant
of the skip-because-unnecessary optimization that was tried (and resulted in
less effective performance gains than reported here), as well as some of the
preparatory work over the past few years that this series relies on in order
to enable this optimization.

=== Near optimal? ===

You may remember that there was a row labelled "everything else" from the
commit message of 557ac0350d that represented the maximum possible speed-up
from accelerating rename detection alone; as stated in that commit, those
rows represented how fast the code could be if we had somehow infinitely
parallelized the inexact rename detection. However, if you compare those
"maximum speedup" numbers to what we have above, you'll note that the
infinitely parallelized inexact rename detection would have been slightly
slower than the results we have now achieved. (The reason this is possible,
despite the fact that we still spend time in rename detection after our
optimizations, is because we implemented two optimizations outside of
diffcore_rename() along the way.) However, this good news does also come
with a downside -- it means that our remaining optimization potential is
somewhat limited, and subsequent optimization series will have to fight for
much smaller gains.

[1]
https://github.com/newren/presentations/blob/pdfs/merge-performance/merge-performance-slides.pdf
[2]
https://lore.kernel.org/git/pull.843.git.1612651937.gitgitgadget@gmail.com/
[3]
https://lore.kernel.org/git/pull.844.git.1613289544.gitgitgadget@gmail.com/

Elijah Newren (8):
  diffcore-rename: enable filtering possible rename sources
  merge-ort: precompute subset of sources for which we need rename
    detection
  merge-ort: add data structures for an alternate tree traversal
  merge-ort: introduce wrappers for alternate tree traversal
  merge-ort: precompute whether directory rename detection is needed
  merge-ort: use relevant_sources to filter possible rename sources
  merge-ort: skip rename detection entirely if possible
  diffcore-rename: avoid doing basename comparisons for irrelevant
    sources

 diffcore-rename.c |  63 ++++++++++---
 diffcore.h        |   1 +
 merge-ort.c       | 236 +++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 285 insertions(+), 15 deletions(-)


base-commit: 4be565c472088d4144063b736308bf2a57331f45
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-845%2Fnewren%2Fort-perf-batch-9-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-845/newren/ort-perf-batch-9-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/845

Range-diff vs v1:

 1:  064fa5de1e20 = 1:  dab8e3c6aee5 diffcore-rename: enable filtering possible rename sources
 2:  69b42a41e83a = 2:  33c231331744 merge-ort: precompute subset of sources for which we need rename detection
 3:  042ce66011ef = 3:  89b43c9f75a0 merge-ort: add data structures for an alternate tree traversal
 4:  7673e4c23bbb = 4:  6497050c0012 merge-ort: introduce wrappers for alternate tree traversal
 5:  8dbf0a452545 = 5:  608d5a4c6ad7 merge-ort: precompute whether directory rename detection is needed
 6:  6b20977a5a81 = 6:  d62fdee45ad3 merge-ort: use relevant_sources to filter possible rename sources
 7:  d5486ab28462 = 7:  cd931286f24d merge-ort: skip rename detection entirely if possible
 8:  8fed92b62f37 = 8:  c443ba8abb89 diffcore-rename: avoid doing basename comparisons for irrelevant sources

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v2 1/8] diffcore-rename: enable filtering possible rename sources
  2021-03-09  0:09 ` [PATCH v2 " Elijah Newren via GitGitGadget
@ 2021-03-09  0:09   ` Elijah Newren via GitGitGadget
  2021-03-09 22:21     ` Derrick Stolee
  2021-03-09  0:09   ` [PATCH v2 2/8] merge-ort: precompute subset of sources for which we need rename detection Elijah Newren via GitGitGadget
                     ` (10 subsequent siblings)
  11 siblings, 1 reply; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-09  0:09 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Add the ability to diffcore_rename_extended() to allow external callers
to declare that they only need renames detected for a subset of source
files, and use that information to skip detecting renames for them.

There are two important pieces to this optimization that may not be
obvious at first glance:

  * We do not require callers to just filter the filepairs out
    to remove the non-relevant sources, because exact rename detection
    is fast and when it finds a match it can remove both a source and a
    destination whereas the relevant_sources filter can only remove a
    source.

  * We need to filter out the source pairs in a preliminary pass instead
    of adding a
       strset_contains(relevant_sources, one->path)
    check within the nested matrix loop.  The reason for that is if we
    have 30k renames, doing 30k * 30k = 900M strset_contains() calls
    becomes extraordinarily expensive and defeats the performance gains
    from this change; we only want to do 30k such calls instead.

If callers pass NULL for relevant_sources, that is special cases to
treat all sources as relevant.  Since all callers currently pass NULL,
this optimization does not yet have any effect.  Subsequent commits will
have merge-ort compute a set of relevant_sources to restrict which
sources we detect renames for, and have merge-ort pass that set of
relevant_sources to diffcore_rename_extended().

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 26 +++++++++++++++++++-------
 diffcore.h        |  1 +
 merge-ort.c       |  1 +
 3 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 1fe902ed2af0..7f6115fd9018 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -991,11 +991,12 @@ static int find_renames(struct diff_score *mx,
 	return count;
 }
 
-static void remove_unneeded_paths_from_src(int detecting_copies)
+static void remove_unneeded_paths_from_src(int detecting_copies,
+					   struct strset *interesting)
 {
 	int i, new_num_src;
 
-	if (detecting_copies)
+	if (detecting_copies && !interesting)
 		return; /* nothing to remove */
 	if (break_idx)
 		return; /* culling incompatible with break detection */
@@ -1022,12 +1023,18 @@ static void remove_unneeded_paths_from_src(int detecting_copies)
 	 *      from rename_src here.
 	 */
 	for (i = 0, new_num_src = 0; i < rename_src_nr; i++) {
+		struct diff_filespec *one = rename_src[i].p->one;
+
 		/*
 		 * renames are stored in rename_dst, so if a rename has
 		 * already been detected using this source, we can just
 		 * remove the source knowing rename_dst has its info.
 		 */
-		if (rename_src[i].p->one->rename_used)
+		if (!detecting_copies && one->rename_used)
+			continue;
+
+		/* If we don't care about the source path, skip it */
+		if (interesting && !strset_contains(interesting, one->path))
 			continue;
 
 		if (new_num_src < i)
@@ -1040,6 +1047,7 @@ static void remove_unneeded_paths_from_src(int detecting_copies)
 }
 
 void diffcore_rename_extended(struct diff_options *options,
+			      struct strset *relevant_sources,
 			      struct strset *dirs_removed,
 			      struct strmap *dir_rename_count)
 {
@@ -1060,6 +1068,8 @@ void diffcore_rename_extended(struct diff_options *options,
 	want_copies = (detect_rename == DIFF_DETECT_COPY);
 	if (dirs_removed && (break_idx || want_copies))
 		BUG("dirs_removed incompatible with break/copy detection");
+	if (break_idx && relevant_sources)
+		BUG("break detection incompatible with source specification");
 	if (!minimum_score)
 		minimum_score = DEFAULT_RENAME_SCORE;
 
@@ -1127,9 +1137,10 @@ void diffcore_rename_extended(struct diff_options *options,
 		/*
 		 * Cull sources:
 		 *   - remove ones corresponding to exact renames
+		 *   - remove ones not found in relevant_sources
 		 */
 		trace2_region_enter("diff", "cull after exact", options->repo);
-		remove_unneeded_paths_from_src(want_copies);
+		remove_unneeded_paths_from_src(want_copies, relevant_sources);
 		trace2_region_leave("diff", "cull after exact", options->repo);
 	} else {
 		/* Determine minimum score to match basenames */
@@ -1148,7 +1159,7 @@ void diffcore_rename_extended(struct diff_options *options,
 		 *   - remove ones involved in renames (found via exact match)
 		 */
 		trace2_region_enter("diff", "cull after exact", options->repo);
-		remove_unneeded_paths_from_src(want_copies);
+		remove_unneeded_paths_from_src(want_copies, NULL);
 		trace2_region_leave("diff", "cull after exact", options->repo);
 
 		/* Preparation for basename-driven matching. */
@@ -1167,9 +1178,10 @@ void diffcore_rename_extended(struct diff_options *options,
 		/*
 		 * Cull sources, again:
 		 *   - remove ones involved in renames (found via basenames)
+		 *   - remove ones not found in relevant_sources
 		 */
 		trace2_region_enter("diff", "cull basename", options->repo);
-		remove_unneeded_paths_from_src(want_copies);
+		remove_unneeded_paths_from_src(want_copies, relevant_sources);
 		trace2_region_leave("diff", "cull basename", options->repo);
 	}
 
@@ -1342,5 +1354,5 @@ void diffcore_rename_extended(struct diff_options *options,
 
 void diffcore_rename(struct diff_options *options)
 {
-	diffcore_rename_extended(options, NULL, NULL);
+	diffcore_rename_extended(options, NULL, NULL, NULL);
 }
diff --git a/diffcore.h b/diffcore.h
index c6ba64abd198..737c93a6cc79 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -166,6 +166,7 @@ void partial_clear_dir_rename_count(struct strmap *dir_rename_count);
 void diffcore_break(struct repository *, int);
 void diffcore_rename(struct diff_options *);
 void diffcore_rename_extended(struct diff_options *options,
+			      struct strset *relevant_sources,
 			      struct strset *dirs_removed,
 			      struct strmap *dir_rename_count);
 void diffcore_merge_broken(void);
diff --git a/merge-ort.c b/merge-ort.c
index 467404cc0a35..aba0b9fa54c3 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -2029,6 +2029,7 @@ static void detect_regular_renames(struct merge_options *opt,
 	diff_queued_diff = renames->pairs[side_index];
 	trace2_region_enter("diff", "diffcore_rename", opt->repo);
 	diffcore_rename_extended(&diff_opts,
+				 NULL,
 				 &renames->dirs_removed[side_index],
 				 &renames->dir_rename_count[side_index]);
 	trace2_region_leave("diff", "diffcore_rename", opt->repo);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 2/8] merge-ort: precompute subset of sources for which we need rename detection
  2021-03-09  0:09 ` [PATCH v2 " Elijah Newren via GitGitGadget
  2021-03-09  0:09   ` [PATCH v2 1/8] diffcore-rename: enable filtering possible rename sources Elijah Newren via GitGitGadget
@ 2021-03-09  0:09   ` Elijah Newren via GitGitGadget
  2021-03-09  0:09   ` [PATCH v2 3/8] merge-ort: add data structures for an alternate tree traversal Elijah Newren via GitGitGadget
                     ` (9 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-09  0:09 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

rename detection works by trying to pair all file deletions (or
"sources") with all file additions (or "destinations"), checking
similarity, and then marking the sufficiently similar ones as renames.
This can be expensive if there are many sources and destinations on a
given side of history as it results in an N x M comparison matrix.
However, there are many cases where we can compute in advance that
detecting renames for some of the sources provides no useful information
and thus that we can exclude those sources from the matrix.

To see why, first note that the merge machinery uses detected renames in
two ways:

   * directory rename detection: when one side of history renames a
       directory, and the other side of history adds new files to that
       directory, we want to be able to warn the user about the need to
       chose whether those new files stay in the old directory or move
       to the new one.

   * three-way content merging: in order to do three-way content merging
       of files, we need three different file versions.  If one side of
       history renamed a file, then some of the content for the file is
       found under a different path than in the merge base or on the
       other side of history.

This commit concentrates just on the three-way content merging; it will
punt and mark all sources as needed for directory rename detection, and
leave it to future commits to narrow that down more.

The point of three-way content merging is to reconcile changes made on
*both* sides of history.  What if the file wasn't modified on both
sides?  There are two possibilities:

   * If it wasn't modified on the renamed side:
       -> then we get to do exact rename detection, which is cheap.

   * If it wasn't modified on the unrenamed side:
       -> then detection of a rename for that source file is irrelevant

That latter claim might be surprising at first, so let's walk through a
case to show why rename detection for that source file is irrelevant.
Let's use two filenames, old.c & new.c, with the following abbreviated
object ids (and where the value '000000' is used to denote that the file
is missing in that commit):

                 old.c     new.c
   MERGE_BASE:   01d01d    000000
   MERGE_SIDE1:  01d01d    000000
   MERGE_SIDE2:  000000    5e1ec7

If the rename *isn't* detected:
   then old.c looks like it was unmodified on one side and deleted on
   the other and should thus be removed.  new.c looks like a new file we
   should keep as-is.

If the rename *is* detected:
   then a three-way content merge is done.  Since the version of the
   file in MERGE_BASE and MERGE_SIDE1 are identical, the three-way merge
   will produce exactly the version of the file whose abbreviated
   object id is 5e1ec7.  It will record that file at the path new.c,
   while removing old.c from the directory.

Note that these two results are identical -- a single file named 'new.c'
with object id 5e1ec7.  In other words, it doesn't matter if the rename
is detected in the case where the file is unmodified on the unrenamed
side.

Use this information to compute whether we need rename detection for
each source created in add_pair().

It's probably worth noting that there used to be a few other edge or
corner cases besides three-way content merges and directory rename
detection where lack of rename detection could have affected the result,
but those cases actually highlighted where conflict resolution methods
were not consistent with each other.  Fixing those inconsistencies were
thus critically important to enabling this optimization.  That work
involved the following:

 * bringing consistency to add/add, rename/add, and rename/rename
    conflict types, as done back in the topic merged at commit
    ac193e0e0a ("Merge branch 'en/merge-path-collision'", 2019-01-04),
    and further extended in commits 2a7c16c980 ("t6422, t6426: be more
    flexible for add/add conflicts involving renames", 2020-08-10) and
    e8eb99d4a6 ("t642[23]: be more flexible for add/add conflicts
    involving pair renames", 2020-08-10)

  * making rename/delete more consistent with modify/delete
    as done in commits 1f3c9ba707 ("t6425: be more flexible with
    rename/delete conflict messages", 2020-08-10) and 727c75b23f
    ("t6404, t6423: expect improved rename/delete handling in ort
    backend", 2020-10-26)

Since the set of relevant_sources we compute has not yet been narrowed
down for directory rename detection, we do not pass it to
diffcore_rename_extended() yet.  That will be done after subsequent
commits narrow down the list of relevant_sources needed for directory
rename detection reasons.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 35 ++++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index aba0b9fa54c3..83aa4c08121f 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -88,6 +88,20 @@ struct rename_info {
 	 */
 	struct strmap dir_renames[3];
 
+	/*
+	 * relevant_sources: deleted paths for which we need rename detection
+	 *
+	 * relevant_sources is a set of deleted paths on each side of
+	 * history for which we need rename detection.  If a path is deleted
+	 * on one side of history, we need to detect if it is part of a
+	 * rename if either
+	 *    * we need to detect renames for an ancestor directory
+	 *    * the file is modified/deleted on the other side of history
+	 * If neither of those are true, we can skip rename detection for
+	 * that path.
+	 */
+	struct strset relevant_sources[3];
+
 	/*
 	 * needed_limit: value needed for inexact rename detection to run
 	 *
@@ -358,6 +372,8 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 			strmap_clear(&renames->dir_rename_count[i], 1);
 
 		strmap_func(&renames->dir_renames[i], 0);
+
+		strset_func(&renames->relevant_sources[i]);
 	}
 
 	if (!reinitialize) {
@@ -533,12 +549,21 @@ static void add_pair(struct merge_options *opt,
 		     struct name_entry *names,
 		     const char *pathname,
 		     unsigned side,
-		     unsigned is_add /* if false, is_delete */)
+		     unsigned is_add /* if false, is_delete */,
+		     unsigned match_mask)
 {
 	struct diff_filespec *one, *two;
 	struct rename_info *renames = &opt->priv->renames;
 	int names_idx = is_add ? side : 0;
 
+	if (!is_add) {
+		unsigned content_relevant = (match_mask == 0);
+		unsigned location_relevant = 1; /* FIXME: compute this */
+
+		if (content_relevant || location_relevant)
+			strset_add(&renames->relevant_sources[side], pathname);
+	}
+
 	one = alloc_filespec(pathname);
 	two = alloc_filespec(pathname);
 	fill_filespec(is_add ? two : one,
@@ -575,11 +600,13 @@ static void collect_rename_info(struct merge_options *opt,
 
 		/* Check for deletion on side */
 		if ((filemask & 1) && !(filemask & side_mask))
-			add_pair(opt, names, fullname, side, 0 /* delete */);
+			add_pair(opt, names, fullname, side, 0 /* delete */,
+				 match_mask & filemask);
 
 		/* Check for addition on side */
 		if (!(filemask & 1) && (filemask & side_mask))
-			add_pair(opt, names, fullname, side, 1 /* add */);
+			add_pair(opt, names, fullname, side, 1 /* add */,
+				 match_mask & filemask);
 	}
 }
 
@@ -3228,6 +3255,8 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
 					 NULL, 1);
 		strmap_init_with_options(&renames->dir_renames[i],
 					 NULL, 0);
+		strset_init_with_options(&renames->relevant_sources[i],
+					 NULL, 0);
 	}
 
 	/*
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 3/8] merge-ort: add data structures for an alternate tree traversal
  2021-03-09  0:09 ` [PATCH v2 " Elijah Newren via GitGitGadget
  2021-03-09  0:09   ` [PATCH v2 1/8] diffcore-rename: enable filtering possible rename sources Elijah Newren via GitGitGadget
  2021-03-09  0:09   ` [PATCH v2 2/8] merge-ort: precompute subset of sources for which we need rename detection Elijah Newren via GitGitGadget
@ 2021-03-09  0:09   ` Elijah Newren via GitGitGadget
  2021-03-09  0:09   ` [PATCH v2 4/8] merge-ort: introduce wrappers for " Elijah Newren via GitGitGadget
                     ` (8 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-09  0:09 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

In order to determine whether directory rename detection is needed, we
as a pre-requisite need a way to traverse through all the files in a
given tree before visiting any directories within that tree.
traverse_trees() only iterates through the entries in the order they
appear, so add some data structures that will store all the entries as
we iterate through them in traverse_trees(), which will allow us to
re-traverse them in our desired order.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index 83aa4c08121f..d49cfa8b030b 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -51,6 +51,12 @@ enum merge_side {
 	MERGE_SIDE2 = 2
 };
 
+struct traversal_callback_data {
+	unsigned long mask;
+	unsigned long dirmask;
+	struct name_entry names[3];
+};
+
 struct rename_info {
 	/*
 	 * All variables that are arrays of size 3 correspond to data tracked
@@ -102,6 +108,22 @@ struct rename_info {
 	 */
 	struct strset relevant_sources[3];
 
+	/*
+	 * callback_data_*: supporting data structures for alternate traversal
+	 *
+	 * We sometimes need to be able to traverse through all the files
+	 * in a given tree before all immediate subdirectories within that
+	 * tree.  Since traverse_trees() doesn't do that naturally, we have
+	 * a traverse_trees_wrapper() that stores any immediate
+	 * subdirectories while traversing files, then traverses the
+	 * immediate subdirectories later.  These callback_data* variables
+	 * store the information for the subdirectories so that we can do
+	 * that traversal order.
+	 */
+	struct traversal_callback_data *callback_data;
+	int callback_data_nr, callback_data_alloc;
+	char *callback_data_traverse_path;
+
 	/*
 	 * needed_limit: value needed for inexact rename detection to run
 	 *
@@ -396,6 +418,10 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		}
 		strmap_clear(&opti->output, 0);
 	}
+
+	/* Clean out callback_data as well. */
+	FREE_AND_NULL(renames->callback_data);
+	renames->callback_data_nr = renames->callback_data_alloc = 0;
 }
 
 static int err(struct merge_options *opt, const char *err, ...)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 4/8] merge-ort: introduce wrappers for alternate tree traversal
  2021-03-09  0:09 ` [PATCH v2 " Elijah Newren via GitGitGadget
                     ` (2 preceding siblings ...)
  2021-03-09  0:09   ` [PATCH v2 3/8] merge-ort: add data structures for an alternate tree traversal Elijah Newren via GitGitGadget
@ 2021-03-09  0:09   ` Elijah Newren via GitGitGadget
  2021-03-09 23:06     ` Derrick Stolee
  2021-03-09  0:09   ` [PATCH v2 5/8] merge-ort: precompute whether directory rename detection is needed Elijah Newren via GitGitGadget
                     ` (7 subsequent siblings)
  11 siblings, 1 reply; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-09  0:09 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Add traverse_trees_wrapper() and traverse_trees_wrapper_callback()
functions.  The former runs traverse_trees() with info->fn set to
traverse_trees_wrapper_callback, in order to simply save all the entries
without processing or recursing into any of them.  This step allows
extra computation to be done (e.g. checking some condition across all
files) that can be used later.  Then, after that is completed, it
iterates over all the saved entries and calls the original info->fn
callback with the saved data.

Currently, this does nothing more than marginally slowing down the tree
traversal since we do not take advantage of the opportunity to compute
anything special in traverse_trees_wrapper_callback(), and thus the real
callback will be called identically as it would have been without this
extra wrapper.  However, a subsequent commit will add some special
computation of some values that the real callback will be able to use.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 72 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index d49cfa8b030b..bd2b93a31141 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -512,6 +512,78 @@ static char *unique_path(struct strmap *existing_paths,
 
 /*** Function Grouping: functions related to collect_merge_info() ***/
 
+static int traverse_trees_wrapper_callback(int n,
+					   unsigned long mask,
+					   unsigned long dirmask,
+					   struct name_entry *names,
+					   struct traverse_info *info)
+{
+	struct merge_options *opt = info->data;
+	struct rename_info *renames = &opt->priv->renames;
+
+	assert(n==3);
+
+	if (!renames->callback_data_traverse_path)
+		renames->callback_data_traverse_path = xstrdup(info->traverse_path);
+
+	ALLOC_GROW(renames->callback_data, renames->callback_data_nr + 1,
+		   renames->callback_data_alloc);
+	renames->callback_data[renames->callback_data_nr].mask = mask;
+	renames->callback_data[renames->callback_data_nr].dirmask = dirmask;
+	COPY_ARRAY(renames->callback_data[renames->callback_data_nr].names,
+		   names, 3);
+	renames->callback_data_nr++;
+
+	return mask;
+}
+
+/*
+ * Much like traverse_trees(), BUT:
+ *   - read all the tree entries FIRST, saving them
+ *   - note that the above step provides an opportunity to compute necessary
+ *     additional details before the "real" traversal
+ *   - loop through the saved entries and call the original callback on them
+ */
+MAYBE_UNUSED
+static int traverse_trees_wrapper(struct index_state *istate,
+				  int n,
+				  struct tree_desc *t,
+				  struct traverse_info *info)
+{
+	int ret, i, old_offset;
+	traverse_callback_t old_fn;
+	char *old_callback_data_traverse_path;
+	struct merge_options *opt = info->data;
+	struct rename_info *renames = &opt->priv->renames;
+
+	old_callback_data_traverse_path = renames->callback_data_traverse_path;
+	old_fn = info->fn;
+	old_offset = renames->callback_data_nr;
+
+	renames->callback_data_traverse_path = NULL;
+	info->fn = traverse_trees_wrapper_callback;
+	ret = traverse_trees(istate, n, t, info);
+	if (ret < 0)
+		return ret;
+
+	info->traverse_path = renames->callback_data_traverse_path;
+	info->fn = old_fn;
+	for (i = old_offset; i < renames->callback_data_nr; ++i) {
+		info->fn(n,
+			 renames->callback_data[i].mask,
+			 renames->callback_data[i].dirmask,
+			 renames->callback_data[i].names,
+			 info);
+
+	}
+
+	renames->callback_data_nr = old_offset;
+	free(renames->callback_data_traverse_path);
+	renames->callback_data_traverse_path = old_callback_data_traverse_path;
+	info->traverse_path = NULL;
+	return 0;
+}
+
 static void setup_path_info(struct merge_options *opt,
 			    struct string_list_item *result,
 			    const char *current_dir_name,
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 5/8] merge-ort: precompute whether directory rename detection is needed
  2021-03-09  0:09 ` [PATCH v2 " Elijah Newren via GitGitGadget
                     ` (3 preceding siblings ...)
  2021-03-09  0:09   ` [PATCH v2 4/8] merge-ort: introduce wrappers for " Elijah Newren via GitGitGadget
@ 2021-03-09  0:09   ` Elijah Newren via GitGitGadget
  2021-03-09  0:09   ` [PATCH v2 6/8] merge-ort: use relevant_sources to filter possible rename sources Elijah Newren via GitGitGadget
                     ` (6 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-09  0:09 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

The point of directory rename detection is that if one side of history
renames a directory, and the other side adds new files under the old
directory, then the merge can move those new files into the new
directory.  This leads to the following important observation:

  * If the other side does not add any new files under the old
    directory, we do not need to detect any renames for that directory.

Similarly, directory rename detection had an important requirement:

  * If a directory still exists on one side of history, it has not been
    renamed on that side of history.  (See section 4 of t6423 or
    Documentation/technical/directory-rename-detection.txt for more
    details).

Using these two bits of information, we note that directory rename
detection is only needed in cases where (1) directories exist in the
merge base and on one side of history (i.e. dirmask == 3 or dirmask ==
5), and (2) where there is some new file added to that directory on the
side where it still exists (thus where the file has filemask == 2 or
filemask == 4, respectively).  This has to be done in two steps, because
we have the dirmask when we are first considering the directory, and
won't get the filemasks for the files within it until we recurse into
that directory.  So, we save
  dir_rename_mask = dirmask - 1
when we hit a directory that is missing on one side, and then later look
for cases of
  filemask == dir_rename_mask

One final note is that as soon as we hit a directory that needs
directory rename detection, we will need to detect renames in all
subdirectories of that directory as well due to the "majority rules"
decision when files are renamed into different directory hierarchies.
We arbitrarily use the special value of 0x07 to record when we've hit
such a directory.

The combination of all the above mean that we introduce a variable
named dir_rename_mask (couldn't think of a better name) which has one
of the following values as we traverse into a directory:
   * 0x00: directory rename detection not needed
   * 0x02 or 0x04: directory rename detection only needed if files added
   * 0x07: directory rename detection definitely needed

We then pass this value through to add_pairs() so that it can mark
location_relevant as true only when dir_rename_mask is 0x07.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 61 insertions(+), 6 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index bd2b93a31141..27acaa7380be 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -108,6 +108,14 @@ struct rename_info {
 	 */
 	struct strset relevant_sources[3];
 
+	/*
+	 * dir_rename_mask:
+	 *   0: optimization removing unmodified potential rename source okay
+	 *   2 or 4: optimization okay, but must check for files added to dir
+	 *   7: optimization forbidden; need rename source in case of dir rename
+	 */
+	unsigned dir_rename_mask:3;
+
 	/*
 	 * callback_data_*: supporting data structures for alternate traversal
 	 *
@@ -419,6 +427,8 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		strmap_clear(&opti->output, 0);
 	}
 
+	renames->dir_rename_mask = 0;
+
 	/* Clean out callback_data as well. */
 	FREE_AND_NULL(renames->callback_data);
 	renames->callback_data_nr = renames->callback_data_alloc = 0;
@@ -520,12 +530,16 @@ static int traverse_trees_wrapper_callback(int n,
 {
 	struct merge_options *opt = info->data;
 	struct rename_info *renames = &opt->priv->renames;
+	unsigned filemask = mask & ~dirmask;
 
 	assert(n==3);
 
 	if (!renames->callback_data_traverse_path)
 		renames->callback_data_traverse_path = xstrdup(info->traverse_path);
 
+	if (filemask && filemask == renames->dir_rename_mask)
+		renames->dir_rename_mask = 0x07;
+
 	ALLOC_GROW(renames->callback_data, renames->callback_data_nr + 1,
 		   renames->callback_data_alloc);
 	renames->callback_data[renames->callback_data_nr].mask = mask;
@@ -544,7 +558,6 @@ static int traverse_trees_wrapper_callback(int n,
  *     additional details before the "real" traversal
  *   - loop through the saved entries and call the original callback on them
  */
-MAYBE_UNUSED
 static int traverse_trees_wrapper(struct index_state *istate,
 				  int n,
 				  struct tree_desc *t,
@@ -556,6 +569,8 @@ static int traverse_trees_wrapper(struct index_state *istate,
 	struct merge_options *opt = info->data;
 	struct rename_info *renames = &opt->priv->renames;
 
+	assert(renames->dir_rename_mask == 2 || renames->dir_rename_mask == 4);
+
 	old_callback_data_traverse_path = renames->callback_data_traverse_path;
 	old_fn = info->fn;
 	old_offset = renames->callback_data_nr;
@@ -648,7 +663,8 @@ static void add_pair(struct merge_options *opt,
 		     const char *pathname,
 		     unsigned side,
 		     unsigned is_add /* if false, is_delete */,
-		     unsigned match_mask)
+		     unsigned match_mask,
+		     unsigned dir_rename_mask)
 {
 	struct diff_filespec *one, *two;
 	struct rename_info *renames = &opt->priv->renames;
@@ -656,7 +672,7 @@ static void add_pair(struct merge_options *opt,
 
 	if (!is_add) {
 		unsigned content_relevant = (match_mask == 0);
-		unsigned location_relevant = 1; /* FIXME: compute this */
+		unsigned location_relevant = (dir_rename_mask == 0x07);
 
 		if (content_relevant || location_relevant)
 			strset_add(&renames->relevant_sources[side], pathname);
@@ -680,6 +696,36 @@ static void collect_rename_info(struct merge_options *opt,
 	struct rename_info *renames = &opt->priv->renames;
 	unsigned side;
 
+	/*
+	 * Update dir_rename_mask (determines ignore-rename-source validity)
+	 *
+	 * dir_rename_mask helps us keep track of when directory rename
+	 * detection may be relevant.  Basically, whenver a directory is
+	 * removed on one side of history, and a file is added to that
+	 * directory on the other side of history, directory rename
+	 * detection is relevant (meaning we have to detect renames for all
+	 * files within that directory to deduce where the directory
+	 * moved).  Also, whenever a directory needs directory rename
+	 * detection, due to the "majority rules" choice for where to move
+	 * it (see t6423 testcase 1f), we also need to detect renames for
+	 * all files within subdirectories of that directory as well.
+	 *
+	 * Here we haven't looked at files within the directory yet, we are
+	 * just looking at the directory itself.  So, if we aren't yet in
+	 * a case where a parent directory needed directory rename detection
+	 * (i.e. dir_rename_mask != 0x07), and if the directory was removed
+	 * on one side of history, record the mask of the other side of
+	 * history in dir_rename_mask.
+	 */
+	if (renames->dir_rename_mask != 0x07 &&
+	    (dirmask == 3 || dirmask == 5)) {
+		/* simple sanity check */
+		assert(renames->dir_rename_mask == 0 ||
+		       renames->dir_rename_mask == (dirmask & ~1));
+		/* update dir_rename_mask; have it record mask of new side */
+		renames->dir_rename_mask = (dirmask & ~1);
+	}
+
 	/* Update dirs_removed, as needed */
 	if (dirmask == 1 || dirmask == 3 || dirmask == 5) {
 		/* absent_mask = 0x07 - dirmask; sides = absent_mask/2 */
@@ -699,12 +745,14 @@ static void collect_rename_info(struct merge_options *opt,
 		/* Check for deletion on side */
 		if ((filemask & 1) && !(filemask & side_mask))
 			add_pair(opt, names, fullname, side, 0 /* delete */,
-				 match_mask & filemask);
+				 match_mask & filemask,
+				 renames->dir_rename_mask);
 
 		/* Check for addition on side */
 		if (!(filemask & 1) && (filemask & side_mask))
 			add_pair(opt, names, fullname, side, 1 /* add */,
-				 match_mask & filemask);
+				 match_mask & filemask,
+				 renames->dir_rename_mask);
 	}
 }
 
@@ -722,12 +770,14 @@ static int collect_merge_info_callback(int n,
 	 */
 	struct merge_options *opt = info->data;
 	struct merge_options_internal *opti = opt->priv;
+	struct rename_info *renames = &opt->priv->renames;
 	struct string_list_item pi;  /* Path Info */
 	struct conflict_info *ci; /* typed alias to pi.util (which is void*) */
 	struct name_entry *p;
 	size_t len;
 	char *fullpath;
 	const char *dirname = opti->current_dir_name;
+	unsigned prev_dir_rename_mask = renames->dir_rename_mask;
 	unsigned filemask = mask & ~dirmask;
 	unsigned match_mask = 0; /* will be updated below */
 	unsigned mbase_null = !(mask & 1);
@@ -868,8 +918,13 @@ static int collect_merge_info_callback(int n,
 
 		original_dir_name = opti->current_dir_name;
 		opti->current_dir_name = pi.string;
-		ret = traverse_trees(NULL, 3, t, &newinfo);
+		if (renames->dir_rename_mask == 0 ||
+		    renames->dir_rename_mask == 0x07)
+			ret = traverse_trees(NULL, 3, t, &newinfo);
+		else
+			ret = traverse_trees_wrapper(NULL, 3, t, &newinfo);
 		opti->current_dir_name = original_dir_name;
+		renames->dir_rename_mask = prev_dir_rename_mask;
 
 		for (i = MERGE_BASE; i <= MERGE_SIDE2; i++)
 			free(buf[i]);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 6/8] merge-ort: use relevant_sources to filter possible rename sources
  2021-03-09  0:09 ` [PATCH v2 " Elijah Newren via GitGitGadget
                     ` (4 preceding siblings ...)
  2021-03-09  0:09   ` [PATCH v2 5/8] merge-ort: precompute whether directory rename detection is needed Elijah Newren via GitGitGadget
@ 2021-03-09  0:09   ` Elijah Newren via GitGitGadget
  2021-03-09  0:09   ` [PATCH v2 7/8] merge-ort: skip rename detection entirely if possible Elijah Newren via GitGitGadget
                     ` (5 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-09  0:09 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

The past several commits determined conditions when rename sources might
be needed, and filled a relevant_sources strset with those paths.  Pass
these along to diffcore_rename_extended() to use to limit the sources
that we need to detect renames for.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       12.596 s ±  0.061 s     6.003 s ±  0.048 s
    mega-renames:    130.465 s ±  0.259 s   114.009 s ±  0.236 s
    just-one-mega:     3.958 s ±  0.010 s     3.489 s ±  0.017 s

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/merge-ort.c b/merge-ort.c
index 27acaa7380be..66892c63cee2 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -2209,7 +2209,7 @@ static void detect_regular_renames(struct merge_options *opt,
 	diff_queued_diff = renames->pairs[side_index];
 	trace2_region_enter("diff", "diffcore_rename", opt->repo);
 	diffcore_rename_extended(&diff_opts,
-				 NULL,
+				 &renames->relevant_sources[side_index],
 				 &renames->dirs_removed[side_index],
 				 &renames->dir_rename_count[side_index]);
 	trace2_region_leave("diff", "diffcore_rename", opt->repo);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 7/8] merge-ort: skip rename detection entirely if possible
  2021-03-09  0:09 ` [PATCH v2 " Elijah Newren via GitGitGadget
                     ` (5 preceding siblings ...)
  2021-03-09  0:09   ` [PATCH v2 6/8] merge-ort: use relevant_sources to filter possible rename sources Elijah Newren via GitGitGadget
@ 2021-03-09  0:09   ` Elijah Newren via GitGitGadget
  2021-03-09 22:51     ` Derrick Stolee
  2021-03-09  0:09   ` [PATCH v2 8/8] diffcore-rename: avoid doing basename comparisons for irrelevant sources Elijah Newren via GitGitGadget
                     ` (4 subsequent siblings)
  11 siblings, 1 reply; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-09  0:09 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

diffcore_rename_extended() will do a bunch of setup, then check for
exact renames, then abort before inexact rename detection if there are
no more sources or destinations that need to be matched.  It will
sometimes be the case, however, that either
  * we start with neither any sources or destinations
  * we start with no *relevant* sources
In the first of these two cases, the setup and exact rename detection
will be very cheap since there are 0 files to operate on.  In the second
case, it is quite possible to have thousands of files with none of the
source ones being relevant.  Avoid calling diffcore_rename_extended() or
even some of the setup before diffcore_rename_extended() when we can
determine that rename detection is unnecessary.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:        6.003 s ±  0.048 s     5.708 s ±  0.111 s
    mega-renames:    114.009 s ±  0.236 s   102.171 s ±  0.440 s
    just-one-mega:     3.489 s ±  0.017 s     3.471 s ±  0.015 s

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index 66892c63cee2..8602c7b8936f 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -2158,6 +2158,19 @@ static int process_renames(struct merge_options *opt,
 	return clean_merge;
 }
 
+static inline int possible_side_renames(struct rename_info *renames,
+					unsigned side_index)
+{
+	return renames->pairs[side_index].nr > 0 &&
+	       !strset_empty(&renames->relevant_sources[side_index]);
+}
+
+static inline int possible_renames(struct rename_info *renames)
+{
+	return possible_side_renames(renames, 1) ||
+	       possible_side_renames(renames, 2);
+}
+
 static void resolve_diffpair_statuses(struct diff_queue_struct *q)
 {
 	/*
@@ -2194,6 +2207,16 @@ static void detect_regular_renames(struct merge_options *opt,
 	struct diff_options diff_opts;
 	struct rename_info *renames = &opt->priv->renames;
 
+	if (!possible_side_renames(renames, side_index)) {
+		/*
+		 * No rename detection needed for this side, but we still need
+		 * to make sure 'adds' are marked correctly in case the other
+		 * side had directory renames.
+		 */
+		resolve_diffpair_statuses(&renames->pairs[side_index]);
+		return;
+	}
+
 	repo_diff_setup(opt->repo, &diff_opts);
 	diff_opts.flags.recursive = 1;
 	diff_opts.flags.rename_empty = 0;
@@ -2311,6 +2334,8 @@ static int detect_and_process_renames(struct merge_options *opt,
 	int need_dir_renames, s, clean = 1;
 
 	memset(&combined, 0, sizeof(combined));
+	if (!possible_renames(renames))
+		goto cleanup;
 
 	trace2_region_enter("merge", "regular renames", opt->repo);
 	detect_regular_renames(opt, MERGE_SIDE1);
@@ -2345,6 +2370,26 @@ static int detect_and_process_renames(struct merge_options *opt,
 	clean &= process_renames(opt, &combined);
 	trace2_region_leave("merge", "process renames", opt->repo);
 
+	goto simple_cleanup; /* collect_renames() handles some of cleanup */
+
+cleanup:
+	/*
+	 * Free now unneeded filepairs, which would have been handled
+	 * in collect_renames() normally but we're about to skip that
+	 * code...
+	 */
+	for (s = MERGE_SIDE1; s <= MERGE_SIDE2; s++) {
+		struct diff_queue_struct *side_pairs;
+		int i;
+
+		side_pairs = &renames->pairs[s];
+		for (i = 0; i < side_pairs->nr; ++i) {
+			struct diff_filepair *p = side_pairs->queue[i];
+			diff_free_filepair(p);
+		}
+	}
+
+simple_cleanup:
 	/* Free memory for renames->pairs[] and combined */
 	for (s = MERGE_SIDE1; s <= MERGE_SIDE2; s++) {
 		free(renames->pairs[s].queue);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v2 8/8] diffcore-rename: avoid doing basename comparisons for irrelevant sources
  2021-03-09  0:09 ` [PATCH v2 " Elijah Newren via GitGitGadget
                     ` (6 preceding siblings ...)
  2021-03-09  0:09   ` [PATCH v2 7/8] merge-ort: skip rename detection entirely if possible Elijah Newren via GitGitGadget
@ 2021-03-09  0:09   ` Elijah Newren via GitGitGadget
  2021-03-09 22:08   ` [PATCH v2 0/8] Optimization batch 9: avoid detecting irrelevant renames Derrick Stolee
                     ` (3 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-09  0:09 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

The basename comparison optimization implemented in
find_basename_matches() is very beneficial since it allows a source to
sometimes only be compared with one other file instead of N other files.
When a match is found, both a source and destination can be removed from
the matrix of inexact rename comparisons.  In contrast, the irrelevant
source optimization only allows us to remove a source from the matrix of
inexact rename comparisons...but it has the advantage of allowing a
source file to not even be loaded into memory at all and be compared to
0 other files.  Generally, not even comparing is a bigger performance
win, so when both optimizations could apply, prefer to use the
irrelevant-source optimization.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:        5.708 s ±  0.111 s     5.680 s ±  0.096 s
    mega-renames:    102.171 s ±  0.440 s    13.812 s ±  0.162 s
    just-one-mega:     3.471 s ±  0.015 s   506.0  ms ±  3.9  ms

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 37 +++++++++++++++++++++++++++++++++----
 1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 7f6115fd9018..e8508541be14 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -527,6 +527,7 @@ static void update_dir_rename_counts(struct dir_rename_info *info,
 }
 
 static void initialize_dir_rename_info(struct dir_rename_info *info,
+				       struct strset *relevant_sources,
 				       struct strset *dirs_removed,
 				       struct strmap *dir_rename_count)
 {
@@ -534,7 +535,7 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 	struct strmap_entry *entry;
 	int i;
 
-	if (!dirs_removed) {
+	if (!dirs_removed && !relevant_sources) {
 		info->setup = 0;
 		return;
 	}
@@ -549,7 +550,20 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 	strmap_init_with_options(&info->dir_rename_guess, NULL, 0);
 
 	/* Setup info->relevant_source_dirs */
-	info->relevant_source_dirs = dirs_removed;
+	info->relevant_source_dirs = NULL;
+	if (dirs_removed || !relevant_sources) {
+		info->relevant_source_dirs = dirs_removed; /* might be NULL */
+	} else {
+		info->relevant_source_dirs = xmalloc(sizeof(struct strintmap));
+		strset_init(info->relevant_source_dirs);
+		strset_for_each_entry(relevant_sources, &iter, entry) {
+			char *dirname = get_dirname(entry->key);
+			if (!dirs_removed ||
+			    strset_contains(dirs_removed, dirname))
+				strset_add(info->relevant_source_dirs, dirname);
+			free(dirname);
+		}
+	}
 
 	/*
 	 * Loop setting up both info->idx_map, and doing setup of
@@ -627,6 +641,13 @@ static void cleanup_dir_rename_info(struct dir_rename_info *info,
 	/* dir_rename_guess */
 	strmap_clear(&info->dir_rename_guess, 1);
 
+	/* relevant_source_dirs */
+	if (info->relevant_source_dirs &&
+	    info->relevant_source_dirs != dirs_removed) {
+		strset_clear(info->relevant_source_dirs);
+		FREE_AND_NULL(info->relevant_source_dirs);
+	}
+
 	/* dir_rename_count */
 	if (!keep_dir_rename_count) {
 		partial_clear_dir_rename_count(info->dir_rename_count);
@@ -749,6 +770,7 @@ static int idx_possible_rename(char *filename, struct dir_rename_info *info)
 static int find_basename_matches(struct diff_options *options,
 				 int minimum_score,
 				 struct dir_rename_info *info,
+				 struct strset *relevant_sources,
 				 struct strset *dirs_removed)
 {
 	/*
@@ -839,6 +861,11 @@ static int find_basename_matches(struct diff_options *options,
 		intptr_t src_index;
 		intptr_t dst_index;
 
+		/* Skip irrelevant sources */
+		if (relevant_sources &&
+		    !strset_contains(relevant_sources, filename))
+			continue;
+
 		/*
 		 * If the basename is unique among remaining sources, then
 		 * src_index will equal 'i' and we can attempt to match it
@@ -1164,7 +1191,7 @@ void diffcore_rename_extended(struct diff_options *options,
 
 		/* Preparation for basename-driven matching. */
 		trace2_region_enter("diff", "dir rename setup", options->repo);
-		initialize_dir_rename_info(&info,
+		initialize_dir_rename_info(&info, relevant_sources,
 					   dirs_removed, dir_rename_count);
 		trace2_region_leave("diff", "dir rename setup", options->repo);
 
@@ -1172,7 +1199,9 @@ void diffcore_rename_extended(struct diff_options *options,
 		trace2_region_enter("diff", "basename matches", options->repo);
 		rename_count += find_basename_matches(options,
 						      min_basename_score,
-						      &info, dirs_removed);
+						      &info,
+						      relevant_sources,
+						      dirs_removed);
 		trace2_region_leave("diff", "basename matches", options->repo);
 
 		/*
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 0/8] Optimization batch 9: avoid detecting irrelevant renames
  2021-03-09  0:09 ` [PATCH v2 " Elijah Newren via GitGitGadget
                     ` (7 preceding siblings ...)
  2021-03-09  0:09   ` [PATCH v2 8/8] diffcore-rename: avoid doing basename comparisons for irrelevant sources Elijah Newren via GitGitGadget
@ 2021-03-09 22:08   ` Derrick Stolee
  2021-03-10 15:08   ` Derrick Stolee
                     ` (2 subsequent siblings)
  11 siblings, 0 replies; 39+ messages in thread
From: Derrick Stolee @ 2021-03-09 22:08 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren

On 3/8/2021 7:09 PM, Elijah Newren via GitGitGadget wrote:> The basic idea here is:
> 
> We only need expensive rename detection on the subset of files changed on
> both sides of history (for the most part).
> 
> This is because:
> 
>  1. The primary reason for rename detection in merges is enabling three-way
>     content merges
>  2. The purpose of three-way content merges is reconciling changes when
> 
> both sides of history modified some file 3. If a file was only modified by
> the side that renamed the file, then detecting the rename is irrelevant;
> we'll get the same answer without knowing about the rename. 4. (Well...there
> are rare cases where we need the rename for reasons other than three-way
> content merges. Patch 5 explains those.)

Makes sense. Don't compute information you won't need. I look forward to
trying to figure out the special cases here.
 
>                      Before Series           After Series
> no-renames:       12.596 s ±  0.061 s     5.680 s ±  0.096 s
> mega-renames:    130.465 s ±  0.259 s    13.812 s ±  0.162 s
> just-one-mega:     3.958 s ±  0.010 s   506.0  ms ±  3.9  ms

These are _very_ impressive numbers for such a "simple" idea.
 
> However, interestingly, if we had ignored the basename-guided rename
> detection optimizations[2][3], then this optimization series would have
> improved the performance as follows:
> 
>                Before Basename Series   After Just This Series
> no-renames:      13.815 s ±  0.062 s      5.728 s ±  0.104 s
> mega-renames:  1799.937 s ±  0.493 s     18.213 s ±  0.139 s
> just-one-mega    51.289 s ±  0.019 s    891.9  ms ±  7.0  ms

And here it is even more impressive. I see that your optimizations are
having combined effects but also are doing valuable things on their
own.

> We get best results by prioritizing them as follows:
> 
>  1. exact rename detection
>  2. skip-because-unnecessary
>  3. basename-guided rename detection

This makes sense to me, since even the basename-guided rename is
doing some non-trivial work. It would be good to reduce that
effort.

> it means that our remaining optimization potential is
> somewhat limited, and subsequent optimization series will have to fight for
> much smaller gains.

This is a good place to end up. Let the code rest for a bit after
we are done here, and maybe we'll find new cases to care about
later. We could chase the long tail forever, but these steps are
a huge accomplishment!

Getting to reading now.

-Stolee

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/8] diffcore-rename: enable filtering possible rename sources
  2021-03-09  0:09   ` [PATCH v2 1/8] diffcore-rename: enable filtering possible rename sources Elijah Newren via GitGitGadget
@ 2021-03-09 22:21     ` Derrick Stolee
  2021-03-09 22:40       ` Elijah Newren
  0 siblings, 1 reply; 39+ messages in thread
From: Derrick Stolee @ 2021-03-09 22:21 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren

On 3/8/2021 7:09 PM, Elijah Newren via GitGitGadget wrote:> @@ -1127,9 +1137,10 @@ void diffcore_rename_extended(struct diff_options *options,
>  		/*
>  		 * Cull sources:
>  		 *   - remove ones corresponding to exact renames
> +		 *   - remove ones not found in relevant_sources
>  		 */
>  		trace2_region_enter("diff", "cull after exact", options->repo);
> -		remove_unneeded_paths_from_src(want_copies);
> +		remove_unneeded_paths_from_src(want_copies, relevant_sources);
>  		trace2_region_leave("diff", "cull after exact", options->repo);

In this case, we are checking for copies. Perhaps there is a reason
why we want to include relevant_sources in that case, so I'll look
for it later.

>  	} else {
>  		/* Determine minimum score to match basenames */
> @@ -1148,7 +1159,7 @@ void diffcore_rename_extended(struct diff_options *options,
>  		 *   - remove ones involved in renames (found via exact match)
>  		 */
>  		trace2_region_enter("diff", "cull after exact", options->repo);
> -		remove_unneeded_paths_from_src(want_copies);
> +		remove_unneeded_paths_from_src(want_copies, NULL);
>  		trace2_region_leave("diff", "cull after exact", options->repo);
>  
>  		/* Preparation for basename-driven matching. */
> @@ -1167,9 +1178,10 @@ void diffcore_rename_extended(struct diff_options *options,
>  		/*
>  		 * Cull sources, again:
>  		 *   - remove ones involved in renames (found via basenames)
> +		 *   - remove ones not found in relevant_sources
>  		 */
>  		trace2_region_enter("diff", "cull basename", options->repo);
> -		remove_unneeded_paths_from_src(want_copies);
> +		remove_unneeded_paths_from_src(want_copies, relevant_sources);
>  		trace2_region_leave("diff", "cull basename", options->repo);

This seems backwards from your cover letter. You are using the exact renames
_and_ basename matches to remove the unneeded paths. Why are we not stripping
out the relevant_sources in the call further up, before we call
find_basename_matches()?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/8] diffcore-rename: enable filtering possible rename sources
  2021-03-09 22:21     ` Derrick Stolee
@ 2021-03-09 22:40       ` Elijah Newren
  2021-03-09 22:45         ` Derrick Stolee
  0 siblings, 1 reply; 39+ messages in thread
From: Elijah Newren @ 2021-03-09 22:40 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason

On Tue, Mar 9, 2021 at 2:21 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 3/8/2021 7:09 PM, Elijah Newren via GitGitGadget wrote:> @@ -1127,9 +1137,10 @@ void diffcore_rename_extended(struct diff_options *options,
> >               /*
> >                * Cull sources:
> >                *   - remove ones corresponding to exact renames
> > +              *   - remove ones not found in relevant_sources
> >                */
> >               trace2_region_enter("diff", "cull after exact", options->repo);
> > -             remove_unneeded_paths_from_src(want_copies);
> > +             remove_unneeded_paths_from_src(want_copies, relevant_sources);
> >               trace2_region_leave("diff", "cull after exact", options->repo);
>
> In this case, we are checking for copies. Perhaps there is a reason
> why we want to include relevant_sources in that case, so I'll look
> for it later.

There is no current caller making use of this.  I could have just
marked the two options as incompatible and passed NULL here for
relevant sources, but...I didn't see a compelling reason to.  If some
caller wanted copy detection and were only interested in copies from a
subset of sources, they could take advantage of relevant_sources in
the same way to limit to those paths that are relevant.  I guess it
comes down to whether you view this as designing the code for future
theoretical cases, or explicitly taking steps to limit future
possibilities.  I don't have a strong opinion here, but tried to
enable other future uses since diffcore-rename.c is used by many other
callers too.

> >       } else {
> >               /* Determine minimum score to match basenames */
> > @@ -1148,7 +1159,7 @@ void diffcore_rename_extended(struct diff_options *options,
> >                *   - remove ones involved in renames (found via exact match)
> >                */
> >               trace2_region_enter("diff", "cull after exact", options->repo);
> > -             remove_unneeded_paths_from_src(want_copies);
> > +             remove_unneeded_paths_from_src(want_copies, NULL);
> >               trace2_region_leave("diff", "cull after exact", options->repo);
> >
> >               /* Preparation for basename-driven matching. */
> > @@ -1167,9 +1178,10 @@ void diffcore_rename_extended(struct diff_options *options,
> >               /*
> >                * Cull sources, again:
> >                *   - remove ones involved in renames (found via basenames)
> > +              *   - remove ones not found in relevant_sources
> >                */
> >               trace2_region_enter("diff", "cull basename", options->repo);
> > -             remove_unneeded_paths_from_src(want_copies);
> > +             remove_unneeded_paths_from_src(want_copies, relevant_sources);
> >               trace2_region_leave("diff", "cull basename", options->repo);
>
> This seems backwards from your cover letter. You are using the exact renames
> _and_ basename matches to remove the unneeded paths. Why are we not stripping
> out the relevant_sources in the call further up, before we call
> find_basename_matches()?

Yeah, good flag.  I should add a comment to the commit message about
how these are still needed in initialize_dir_rename_info() for
basename-guided directory rename detection...and will also play a role
in some of the special cases where renames are needed for more than
three-way content merges.  And add a comment about how by the end of
the series, relevant_sources will be passed to find_basename_matches()
so that it can skip over those paths.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 1/8] diffcore-rename: enable filtering possible rename sources
  2021-03-09 22:40       ` Elijah Newren
@ 2021-03-09 22:45         ` Derrick Stolee
  0 siblings, 0 replies; 39+ messages in thread
From: Derrick Stolee @ 2021-03-09 22:45 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason

On 3/9/2021 5:40 PM, Elijah Newren wrote:
> On Tue, Mar 9, 2021 at 2:21 PM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 3/8/2021 7:09 PM, Elijah Newren via GitGitGadget wrote:
>>> @@ -1167,9 +1178,10 @@ void diffcore_rename_extended(struct diff_options *options,
>>>               /*
>>>                * Cull sources, again:
>>>                *   - remove ones involved in renames (found via basenames)
>>> +              *   - remove ones not found in relevant_sources
>>>                */
>>>               trace2_region_enter("diff", "cull basename", options->repo);
>>> -             remove_unneeded_paths_from_src(want_copies);
>>> +             remove_unneeded_paths_from_src(want_copies, relevant_sources);
>>>               trace2_region_leave("diff", "cull basename", options->repo);
>>
>> This seems backwards from your cover letter. You are using the exact renames
>> _and_ basename matches to remove the unneeded paths. Why are we not stripping
>> out the relevant_sources in the call further up, before we call
>> find_basename_matches()?
> 
> Yeah, good flag.  I should add a comment to the commit message about
> how these are still needed in initialize_dir_rename_info() for
> basename-guided directory rename detection...and will also play a role
> in some of the special cases where renames are needed for more than
> three-way content merges.  And add a comment about how by the end of
> the series, relevant_sources will be passed to find_basename_matches()
> so that it can skip over those paths.
 
Ah, so by the end, find_basename_matches() will be coupled to the
relevant_sources set directory instead of relying on the culling
happening earlier? Interesting. It seems to me like it would be
better to have them be less coupled by relying on the culling
before calling find_basename_matches(), but I'll keep reading to
see your good reason for not doing that.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 7/8] merge-ort: skip rename detection entirely if possible
  2021-03-09  0:09   ` [PATCH v2 7/8] merge-ort: skip rename detection entirely if possible Elijah Newren via GitGitGadget
@ 2021-03-09 22:51     ` Derrick Stolee
  2021-03-09 22:57       ` Elijah Newren
  0 siblings, 1 reply; 39+ messages in thread
From: Derrick Stolee @ 2021-03-09 22:51 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren

On 3/8/2021 7:09 PM, Elijah Newren via GitGitGadget wrote:> +	goto simple_cleanup; /* collect_renames() handles some of cleanup */
> +
> +cleanup:
> +	/*
> +	 * Free now unneeded filepairs, which would have been handled
> +	 * in collect_renames() normally but we're about to skip that
> +	 * code...
> +	 */

"but we're about to skip that code"

Haven't you skipped it already, earlier in the above part? This should
instead say "but we skipped that code", right?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 7/8] merge-ort: skip rename detection entirely if possible
  2021-03-09 22:51     ` Derrick Stolee
@ 2021-03-09 22:57       ` Elijah Newren
  0 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren @ 2021-03-09 22:57 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason

On Tue, Mar 9, 2021 at 2:51 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 3/8/2021 7:09 PM, Elijah Newren via GitGitGadget wrote:> +   goto simple_cleanup; /* collect_renames() handles some of cleanup */
> > +
> > +cleanup:
> > +     /*
> > +      * Free now unneeded filepairs, which would have been handled
> > +      * in collect_renames() normally but we're about to skip that
> > +      * code...
> > +      */
>
> "but we're about to skip that code"
>
> Haven't you skipped it already, earlier in the above part? This should
> instead say "but we skipped that code", right?

Oh, indeed.  I wonder if I moved that comment around and didn't update
it.  Anyway yes, I'll fix it.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 4/8] merge-ort: introduce wrappers for alternate tree traversal
  2021-03-09  0:09   ` [PATCH v2 4/8] merge-ort: introduce wrappers for " Elijah Newren via GitGitGadget
@ 2021-03-09 23:06     ` Derrick Stolee
  0 siblings, 0 replies; 39+ messages in thread
From: Derrick Stolee @ 2021-03-09 23:06 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren

On 3/8/2021 7:09 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
> 
> Add traverse_trees_wrapper() and traverse_trees_wrapper_callback()
> functions.  The former runs traverse_trees() with info->fn set to
> traverse_trees_wrapper_callback, in order to simply save all the entries
> without processing or recursing into any of them.  This step allows
> extra computation to be done (e.g. checking some condition across all
> files) that can be used later.  Then, after that is completed, it
> iterates over all the saved entries and calls the original info->fn
> callback with the saved data.
> 
> Currently, this does nothing more than marginally slowing down the tree
> traversal since we do not take advantage of the opportunity to compute
> anything special in traverse_trees_wrapper_callback(), and thus the real
> callback will be called identically as it would have been without this
> extra wrapper.  However, a subsequent commit will add some special
> computation of some values that the real callback will be able to use.

I'm glad that you use the normal callback in the appropriate case.

It took me a couple reads to understand this last sentence, but I think
I'm with you now.

> +	info->traverse_path = renames->callback_data_traverse_path;
> +	info->fn = old_fn;
> +	for (i = old_offset; i < renames->callback_data_nr; ++i) {
> +		info->fn(n,
> +			 renames->callback_data[i].mask,
> +			 renames->callback_data[i].dirmask,
> +			 renames->callback_data[i].names,
> +			 info);
> +

nit: extraneous newline.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 0/8] Optimization batch 9: avoid detecting irrelevant renames
  2021-03-09  0:09 ` [PATCH v2 " Elijah Newren via GitGitGadget
                     ` (8 preceding siblings ...)
  2021-03-09 22:08   ` [PATCH v2 0/8] Optimization batch 9: avoid detecting irrelevant renames Derrick Stolee
@ 2021-03-10 15:08   ` Derrick Stolee
  2021-03-11  0:38   ` [PATCH v3 " Elijah Newren via GitGitGadget
  2021-03-15 17:10   ` [PATCH v2 " Elijah Newren
  11 siblings, 0 replies; 39+ messages in thread
From: Derrick Stolee @ 2021-03-10 15:08 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren

On 3/8/2021 7:09 PM, Elijah Newren via GitGitGadget wrote:
> This series depends textually on ort-perf-batch-8, but semantically it's
> almost completely unrelated and can be reviewed without any familiarity with
> any of the previous patch series.
> 
> There are no changes since v1; it's just a resend just over a week later to
> bump it so it isn't lost.
> 
> === Basic Optimization idea ===
> 
> This series determines paths which meet special cases where detection of
> renames is irrelevant, where the irrelevance is due to the fact that the
> merge machinery will arrive at the same result regardless of whether a
> rename is detected for any of those paths. This series represents
> "Optimization #2" from my Git Merge 2020 talk[1], though this series has
> some improvements on the optimization relative to what I had at that time.
> 
> The basic idea here is:
> 
> We only need expensive rename detection on the subset of files changed on
> both sides of history (for the most part).

I've taken time this morning to re-read some of the patches. I have a
grasp on the idea of the optimization and the code looks well presented
and correct.

The only issue I have is that there are no additional tests to ensure that
these scenarios are being tested and are checked to return the correct
results. Naturally, it seems we are not testing the ORT strategy by default,
and if I do enable it, that causes test failures still.

I wonder how much we should be merging these optimizations before the full
test suite can pass with ORT as the default. Then, we can check to see if
small mutations can be caught by the test suite. In particular, everything
in this optimization seems to revolve around this condition in add_pair():

		if (content_relevant || location_relevant)
			strset_add(&renames->relevant_sources[side], pathname);

I'd prefer to have test cases that cover all four options for the two boolean
values on this operator. Mostly, I'd like to know that if I delete either side
of the || operator (or change it to &&) then we would have a test failure.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v3 0/8] Optimization batch 9: avoid detecting irrelevant renames
  2021-03-09  0:09 ` [PATCH v2 " Elijah Newren via GitGitGadget
                     ` (9 preceding siblings ...)
  2021-03-10 15:08   ` Derrick Stolee
@ 2021-03-11  0:38   ` Elijah Newren via GitGitGadget
  2021-03-11  0:38     ` [PATCH v3 1/8] diffcore-rename: enable filtering possible rename sources Elijah Newren via GitGitGadget
                       ` (8 more replies)
  2021-03-15 17:10   ` [PATCH v2 " Elijah Newren
  11 siblings, 9 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-11  0:38 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Derrick Stolee, Elijah Newren

This series determines paths which meet special cases where detection of
renames is irrelevant, where the irrelevance is due to the fact that the
merge machinery will arrive at the same result regardless of whether a
rename is detected for any of those paths.

Changes since v2:

 * Added an extended comment about filtering order to the first patch, to
   address Stolee's question
 * Added a new testcase that will fail if the critical "if (content_relevant
   || location_relevant)" check only checks for one of those two or the
   intersection of those two, as suggested by Stolee
 * Fixed the other minor problems Stolee pointed out in his review (stray
   newline, wording of comment)

Elijah Newren (8):
  diffcore-rename: enable filtering possible rename sources
  merge-ort: precompute subset of sources for which we need rename
    detection
  merge-ort: add data structures for an alternate tree traversal
  merge-ort: introduce wrappers for alternate tree traversal
  merge-ort: precompute whether directory rename detection is needed
  merge-ort: use relevant_sources to filter possible rename sources
  merge-ort: skip rename detection entirely if possible
  diffcore-rename: avoid doing basename comparisons for irrelevant
    sources

 diffcore-rename.c                   |  63 ++++++--
 diffcore.h                          |   1 +
 merge-ort.c                         | 234 +++++++++++++++++++++++++++-
 t/t6423-merge-rename-directories.sh |  71 +++++++++
 4 files changed, 354 insertions(+), 15 deletions(-)


base-commit: 4be565c472088d4144063b736308bf2a57331f45
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-845%2Fnewren%2Fort-perf-batch-9-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-845/newren/ort-perf-batch-9-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/845

Range-diff vs v2:

 1:  dab8e3c6aee5 ! 1:  3b253f5a63ed diffcore-rename: enable filtering possible rename sources
     @@ Commit message
          sources we detect renames for, and have merge-ort pass that set of
          relevant_sources to diffcore_rename_extended().
      
     +    A note about filtering order:
     +
     +    Some may be curious why we don't filter out irrelevant sources at the
     +    same time we filter out exact renames.  While that technically could be
     +    done at this point, there are two reasons to defer it:
     +
     +    First, was to reinforce a lesson that was too easy to forget.  As I
     +    mentioned above, in the past I filtered irrelevant sources out before
     +    exact rename checking, and then discovered that exact renames' ability
     +    to remove both sources and destinations was an important consideration
     +    and thus doing the filtering after exact rename checking would speed
     +    things up.  Then at some point I realized that basename matching could
     +    also remove both sources and destinations, and decided to put irrelevant
     +    source filtering after basename filtering.  That slowed things down a
     +    lot.  But, despite learning about this important ordering, in later
     +    restructuring I forgot and made the same mistake of putting the
     +    filtering after basename guided rename detection again.  So, I have this
     +    series of patches structured to do the irrelevant filtering last to
     +    start to show how much extra it costs, and then add relevant filtering
     +    in to find_basename_matches() to show how much it speeds things up.
     +    Basically, it's a way to reinforce something that apparently was too
     +    easy to forget, and make sure the commit messages record this lesson.
     +
     +    Second, the items in the "relevant_sources" in this patch series will
     +    include all sources that *might be* relevant.  It has to be conservative
     +    and catch anything that might need a rename, but in the patch series
     +    after this one we'll find ways to weed out more of the *might be*
     +    relevant ones.  Unfortunately, merge-ort does not have sufficient
     +    information to weed those ones out, and there isn't enough information
     +    at the time of filtering exact renames out to remove the extra ones
     +    either.  It has to be deferred.  So the deferral is in part to simplify
     +    some later additions.
     +
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
 2:  33c231331744 ! 2:  d8378b3dde6c merge-ort: precompute subset of sources for which we need rename detection
     @@ Commit message
                 found under a different path than in the merge base or on the
                 other side of history.
      
     -    This commit concentrates just on the three-way content merging; it will
     -    punt and mark all sources as needed for directory rename detection, and
     -    leave it to future commits to narrow that down more.
     +    Add a simple testcase showing the two kinds of reasons renames are
     +    relevant; it's a testcase that will only pass if we detect both kinds of
     +    needed renames.
     +
     +    Other than the testcase added above, this commit concentrates just on
     +    the three-way content merging; it will punt and mark all sources as
     +    needed for directory rename detection, and leave it to future commits to
     +    narrow that down more.
      
          The point of three-way content merging is to reconcile changes made on
          *both* sides of history.  What if the file wasn't modified on both
     @@ merge-ort.c: static void merge_start(struct merge_options *opt, struct merge_res
       	}
       
       	/*
     +
     + ## t/t6423-merge-rename-directories.sh ##
     +@@ t/t6423-merge-rename-directories.sh: test_expect_merge_algorithm failure success '12f: Trivial directory resolve, cac
     + 	)
     + '
     + 
     ++# Testcase 12g, Testcase with two kinds of "relevant" renames
     ++#   Commit O: somefile_O, subdir/{a_O,b_O}
     ++#   Commit A: somefile_A, subdir/{a_O,b_O,c_A}
     ++#   Commit B: newfile_B,  newdir/{a_B,b_B}
     ++#   Expected: newfile_{merged}, newdir/{a_B,b_B,c_A}
     ++
     ++test_setup_12g () {
     ++	test_create_repo 12g &&
     ++	(
     ++		cd 12g &&
     ++
     ++		mkdir -p subdir &&
     ++		test_write_lines upon a time there was a >somefile &&
     ++		test_write_lines 1 2 3 4 5 6 7 8 9 10 >subdir/a &&
     ++		test_write_lines one two three four five six >subdir/b &&
     ++		git add . &&
     ++		test_tick &&
     ++		git commit -m "O" &&
     ++
     ++		git branch O &&
     ++		git branch A &&
     ++		git branch B &&
     ++
     ++		git switch A &&
     ++		test_write_lines once upon a time there was a >somefile &&
     ++		> subdir/c &&
     ++		git add somefile subdir/c &&
     ++		test_tick &&
     ++		git commit -m "A" &&
     ++
     ++		git checkout B &&
     ++		git mv somefile newfile &&
     ++		git mv subdir newdir &&
     ++		echo repo >>newfile &&
     ++		test_write_lines 1 2 3 4 5 6 7 8 9 10 11 >newdir/a &&
     ++		test_write_lines one two three four five six seven >newdir/b &&
     ++		git add newfile newdir &&
     ++		test_tick &&
     ++		git commit -m "B"
     ++	)
     ++}
     ++
     ++test_expect_success '12g: Testcase with two kinds of "relevant" renames' '
     ++	test_setup_12g &&
     ++	(
     ++		cd 12g &&
     ++
     ++		git checkout A^0 &&
     ++
     ++		git -c merge.directoryRenames=true merge -s recursive B^0 &&
     ++
     ++		test_write_lines once upon a time there was a repo >expect &&
     ++		test_cmp expect newfile &&
     ++
     ++		git ls-files -s >out &&
     ++		test_line_count = 4 out &&
     ++
     ++		git rev-parse >actual \
     ++			HEAD:newdir/a  HEAD:newdir/b   HEAD:newdir/c &&
     ++		git rev-parse >expect \
     ++			B:newdir/a     B:newdir/b      A:subdir/c &&
     ++		test_cmp expect actual &&
     ++
     ++		test_must_fail git rev-parse HEAD:subdir/a &&
     ++		test_must_fail git rev-parse HEAD:subdir/b &&
     ++		test_must_fail git rev-parse HEAD:subdir/c &&
     ++		test_path_is_missing subdir/ &&
     ++		test_path_is_file newdir/c
     ++	)
     ++'
     ++
     + ###########################################################################
     + # SECTION 13: Checking informational and conflict messages
     + #
 3:  89b43c9f75a0 = 3:  a57cc981608c merge-ort: add data structures for an alternate tree traversal
 4:  6497050c0012 ! 4:  b595245fe902 merge-ort: introduce wrappers for alternate tree traversal
     @@ merge-ort.c: static char *unique_path(struct strmap *existing_paths,
      +			 renames->callback_data[i].dirmask,
      +			 renames->callback_data[i].names,
      +			 info);
     -+
      +	}
      +
      +	renames->callback_data_nr = old_offset;
 5:  608d5a4c6ad7 = 5:  dc146a867b16 merge-ort: precompute whether directory rename detection is needed
 6:  d62fdee45ad3 = 6:  375c9b36861b merge-ort: use relevant_sources to filter possible rename sources
 7:  cd931286f24d ! 7:  80a0c27a74ad merge-ort: skip rename detection entirely if possible
     @@ merge-ort.c: static int detect_and_process_renames(struct merge_options *opt,
      +cleanup:
      +	/*
      +	 * Free now unneeded filepairs, which would have been handled
     -+	 * in collect_renames() normally but we're about to skip that
     -+	 * code...
     ++	 * in collect_renames() normally but we skipped that code.
      +	 */
      +	for (s = MERGE_SIDE1; s <= MERGE_SIDE2; s++) {
      +		struct diff_queue_struct *side_pairs;
 8:  c443ba8abb89 = 8:  98b0c7de5e70 diffcore-rename: avoid doing basename comparisons for irrelevant sources

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v3 1/8] diffcore-rename: enable filtering possible rename sources
  2021-03-11  0:38   ` [PATCH v3 " Elijah Newren via GitGitGadget
@ 2021-03-11  0:38     ` Elijah Newren via GitGitGadget
  2021-03-11  0:38     ` [PATCH v3 2/8] merge-ort: precompute subset of sources for which we need rename detection Elijah Newren via GitGitGadget
                       ` (7 subsequent siblings)
  8 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-11  0:38 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Add the ability to diffcore_rename_extended() to allow external callers
to declare that they only need renames detected for a subset of source
files, and use that information to skip detecting renames for them.

There are two important pieces to this optimization that may not be
obvious at first glance:

  * We do not require callers to just filter the filepairs out
    to remove the non-relevant sources, because exact rename detection
    is fast and when it finds a match it can remove both a source and a
    destination whereas the relevant_sources filter can only remove a
    source.

  * We need to filter out the source pairs in a preliminary pass instead
    of adding a
       strset_contains(relevant_sources, one->path)
    check within the nested matrix loop.  The reason for that is if we
    have 30k renames, doing 30k * 30k = 900M strset_contains() calls
    becomes extraordinarily expensive and defeats the performance gains
    from this change; we only want to do 30k such calls instead.

If callers pass NULL for relevant_sources, that is special cases to
treat all sources as relevant.  Since all callers currently pass NULL,
this optimization does not yet have any effect.  Subsequent commits will
have merge-ort compute a set of relevant_sources to restrict which
sources we detect renames for, and have merge-ort pass that set of
relevant_sources to diffcore_rename_extended().

A note about filtering order:

Some may be curious why we don't filter out irrelevant sources at the
same time we filter out exact renames.  While that technically could be
done at this point, there are two reasons to defer it:

First, was to reinforce a lesson that was too easy to forget.  As I
mentioned above, in the past I filtered irrelevant sources out before
exact rename checking, and then discovered that exact renames' ability
to remove both sources and destinations was an important consideration
and thus doing the filtering after exact rename checking would speed
things up.  Then at some point I realized that basename matching could
also remove both sources and destinations, and decided to put irrelevant
source filtering after basename filtering.  That slowed things down a
lot.  But, despite learning about this important ordering, in later
restructuring I forgot and made the same mistake of putting the
filtering after basename guided rename detection again.  So, I have this
series of patches structured to do the irrelevant filtering last to
start to show how much extra it costs, and then add relevant filtering
in to find_basename_matches() to show how much it speeds things up.
Basically, it's a way to reinforce something that apparently was too
easy to forget, and make sure the commit messages record this lesson.

Second, the items in the "relevant_sources" in this patch series will
include all sources that *might be* relevant.  It has to be conservative
and catch anything that might need a rename, but in the patch series
after this one we'll find ways to weed out more of the *might be*
relevant ones.  Unfortunately, merge-ort does not have sufficient
information to weed those ones out, and there isn't enough information
at the time of filtering exact renames out to remove the extra ones
either.  It has to be deferred.  So the deferral is in part to simplify
some later additions.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 26 +++++++++++++++++++-------
 diffcore.h        |  1 +
 merge-ort.c       |  1 +
 3 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 1fe902ed2af0..7f6115fd9018 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -991,11 +991,12 @@ static int find_renames(struct diff_score *mx,
 	return count;
 }
 
-static void remove_unneeded_paths_from_src(int detecting_copies)
+static void remove_unneeded_paths_from_src(int detecting_copies,
+					   struct strset *interesting)
 {
 	int i, new_num_src;
 
-	if (detecting_copies)
+	if (detecting_copies && !interesting)
 		return; /* nothing to remove */
 	if (break_idx)
 		return; /* culling incompatible with break detection */
@@ -1022,12 +1023,18 @@ static void remove_unneeded_paths_from_src(int detecting_copies)
 	 *      from rename_src here.
 	 */
 	for (i = 0, new_num_src = 0; i < rename_src_nr; i++) {
+		struct diff_filespec *one = rename_src[i].p->one;
+
 		/*
 		 * renames are stored in rename_dst, so if a rename has
 		 * already been detected using this source, we can just
 		 * remove the source knowing rename_dst has its info.
 		 */
-		if (rename_src[i].p->one->rename_used)
+		if (!detecting_copies && one->rename_used)
+			continue;
+
+		/* If we don't care about the source path, skip it */
+		if (interesting && !strset_contains(interesting, one->path))
 			continue;
 
 		if (new_num_src < i)
@@ -1040,6 +1047,7 @@ static void remove_unneeded_paths_from_src(int detecting_copies)
 }
 
 void diffcore_rename_extended(struct diff_options *options,
+			      struct strset *relevant_sources,
 			      struct strset *dirs_removed,
 			      struct strmap *dir_rename_count)
 {
@@ -1060,6 +1068,8 @@ void diffcore_rename_extended(struct diff_options *options,
 	want_copies = (detect_rename == DIFF_DETECT_COPY);
 	if (dirs_removed && (break_idx || want_copies))
 		BUG("dirs_removed incompatible with break/copy detection");
+	if (break_idx && relevant_sources)
+		BUG("break detection incompatible with source specification");
 	if (!minimum_score)
 		minimum_score = DEFAULT_RENAME_SCORE;
 
@@ -1127,9 +1137,10 @@ void diffcore_rename_extended(struct diff_options *options,
 		/*
 		 * Cull sources:
 		 *   - remove ones corresponding to exact renames
+		 *   - remove ones not found in relevant_sources
 		 */
 		trace2_region_enter("diff", "cull after exact", options->repo);
-		remove_unneeded_paths_from_src(want_copies);
+		remove_unneeded_paths_from_src(want_copies, relevant_sources);
 		trace2_region_leave("diff", "cull after exact", options->repo);
 	} else {
 		/* Determine minimum score to match basenames */
@@ -1148,7 +1159,7 @@ void diffcore_rename_extended(struct diff_options *options,
 		 *   - remove ones involved in renames (found via exact match)
 		 */
 		trace2_region_enter("diff", "cull after exact", options->repo);
-		remove_unneeded_paths_from_src(want_copies);
+		remove_unneeded_paths_from_src(want_copies, NULL);
 		trace2_region_leave("diff", "cull after exact", options->repo);
 
 		/* Preparation for basename-driven matching. */
@@ -1167,9 +1178,10 @@ void diffcore_rename_extended(struct diff_options *options,
 		/*
 		 * Cull sources, again:
 		 *   - remove ones involved in renames (found via basenames)
+		 *   - remove ones not found in relevant_sources
 		 */
 		trace2_region_enter("diff", "cull basename", options->repo);
-		remove_unneeded_paths_from_src(want_copies);
+		remove_unneeded_paths_from_src(want_copies, relevant_sources);
 		trace2_region_leave("diff", "cull basename", options->repo);
 	}
 
@@ -1342,5 +1354,5 @@ void diffcore_rename_extended(struct diff_options *options,
 
 void diffcore_rename(struct diff_options *options)
 {
-	diffcore_rename_extended(options, NULL, NULL);
+	diffcore_rename_extended(options, NULL, NULL, NULL);
 }
diff --git a/diffcore.h b/diffcore.h
index c6ba64abd198..737c93a6cc79 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -166,6 +166,7 @@ void partial_clear_dir_rename_count(struct strmap *dir_rename_count);
 void diffcore_break(struct repository *, int);
 void diffcore_rename(struct diff_options *);
 void diffcore_rename_extended(struct diff_options *options,
+			      struct strset *relevant_sources,
 			      struct strset *dirs_removed,
 			      struct strmap *dir_rename_count);
 void diffcore_merge_broken(void);
diff --git a/merge-ort.c b/merge-ort.c
index 467404cc0a35..aba0b9fa54c3 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -2029,6 +2029,7 @@ static void detect_regular_renames(struct merge_options *opt,
 	diff_queued_diff = renames->pairs[side_index];
 	trace2_region_enter("diff", "diffcore_rename", opt->repo);
 	diffcore_rename_extended(&diff_opts,
+				 NULL,
 				 &renames->dirs_removed[side_index],
 				 &renames->dir_rename_count[side_index]);
 	trace2_region_leave("diff", "diffcore_rename", opt->repo);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 2/8] merge-ort: precompute subset of sources for which we need rename detection
  2021-03-11  0:38   ` [PATCH v3 " Elijah Newren via GitGitGadget
  2021-03-11  0:38     ` [PATCH v3 1/8] diffcore-rename: enable filtering possible rename sources Elijah Newren via GitGitGadget
@ 2021-03-11  0:38     ` Elijah Newren via GitGitGadget
  2021-03-11  0:38     ` [PATCH v3 3/8] merge-ort: add data structures for an alternate tree traversal Elijah Newren via GitGitGadget
                       ` (6 subsequent siblings)
  8 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-11  0:38 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

rename detection works by trying to pair all file deletions (or
"sources") with all file additions (or "destinations"), checking
similarity, and then marking the sufficiently similar ones as renames.
This can be expensive if there are many sources and destinations on a
given side of history as it results in an N x M comparison matrix.
However, there are many cases where we can compute in advance that
detecting renames for some of the sources provides no useful information
and thus that we can exclude those sources from the matrix.

To see why, first note that the merge machinery uses detected renames in
two ways:

   * directory rename detection: when one side of history renames a
       directory, and the other side of history adds new files to that
       directory, we want to be able to warn the user about the need to
       chose whether those new files stay in the old directory or move
       to the new one.

   * three-way content merging: in order to do three-way content merging
       of files, we need three different file versions.  If one side of
       history renamed a file, then some of the content for the file is
       found under a different path than in the merge base or on the
       other side of history.

Add a simple testcase showing the two kinds of reasons renames are
relevant; it's a testcase that will only pass if we detect both kinds of
needed renames.

Other than the testcase added above, this commit concentrates just on
the three-way content merging; it will punt and mark all sources as
needed for directory rename detection, and leave it to future commits to
narrow that down more.

The point of three-way content merging is to reconcile changes made on
*both* sides of history.  What if the file wasn't modified on both
sides?  There are two possibilities:

   * If it wasn't modified on the renamed side:
       -> then we get to do exact rename detection, which is cheap.

   * If it wasn't modified on the unrenamed side:
       -> then detection of a rename for that source file is irrelevant

That latter claim might be surprising at first, so let's walk through a
case to show why rename detection for that source file is irrelevant.
Let's use two filenames, old.c & new.c, with the following abbreviated
object ids (and where the value '000000' is used to denote that the file
is missing in that commit):

                 old.c     new.c
   MERGE_BASE:   01d01d    000000
   MERGE_SIDE1:  01d01d    000000
   MERGE_SIDE2:  000000    5e1ec7

If the rename *isn't* detected:
   then old.c looks like it was unmodified on one side and deleted on
   the other and should thus be removed.  new.c looks like a new file we
   should keep as-is.

If the rename *is* detected:
   then a three-way content merge is done.  Since the version of the
   file in MERGE_BASE and MERGE_SIDE1 are identical, the three-way merge
   will produce exactly the version of the file whose abbreviated
   object id is 5e1ec7.  It will record that file at the path new.c,
   while removing old.c from the directory.

Note that these two results are identical -- a single file named 'new.c'
with object id 5e1ec7.  In other words, it doesn't matter if the rename
is detected in the case where the file is unmodified on the unrenamed
side.

Use this information to compute whether we need rename detection for
each source created in add_pair().

It's probably worth noting that there used to be a few other edge or
corner cases besides three-way content merges and directory rename
detection where lack of rename detection could have affected the result,
but those cases actually highlighted where conflict resolution methods
were not consistent with each other.  Fixing those inconsistencies were
thus critically important to enabling this optimization.  That work
involved the following:

 * bringing consistency to add/add, rename/add, and rename/rename
    conflict types, as done back in the topic merged at commit
    ac193e0e0a ("Merge branch 'en/merge-path-collision'", 2019-01-04),
    and further extended in commits 2a7c16c980 ("t6422, t6426: be more
    flexible for add/add conflicts involving renames", 2020-08-10) and
    e8eb99d4a6 ("t642[23]: be more flexible for add/add conflicts
    involving pair renames", 2020-08-10)

  * making rename/delete more consistent with modify/delete
    as done in commits 1f3c9ba707 ("t6425: be more flexible with
    rename/delete conflict messages", 2020-08-10) and 727c75b23f
    ("t6404, t6423: expect improved rename/delete handling in ort
    backend", 2020-10-26)

Since the set of relevant_sources we compute has not yet been narrowed
down for directory rename detection, we do not pass it to
diffcore_rename_extended() yet.  That will be done after subsequent
commits narrow down the list of relevant_sources needed for directory
rename detection reasons.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c                         | 35 ++++++++++++--
 t/t6423-merge-rename-directories.sh | 71 +++++++++++++++++++++++++++++
 2 files changed, 103 insertions(+), 3 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index aba0b9fa54c3..83aa4c08121f 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -88,6 +88,20 @@ struct rename_info {
 	 */
 	struct strmap dir_renames[3];
 
+	/*
+	 * relevant_sources: deleted paths for which we need rename detection
+	 *
+	 * relevant_sources is a set of deleted paths on each side of
+	 * history for which we need rename detection.  If a path is deleted
+	 * on one side of history, we need to detect if it is part of a
+	 * rename if either
+	 *    * we need to detect renames for an ancestor directory
+	 *    * the file is modified/deleted on the other side of history
+	 * If neither of those are true, we can skip rename detection for
+	 * that path.
+	 */
+	struct strset relevant_sources[3];
+
 	/*
 	 * needed_limit: value needed for inexact rename detection to run
 	 *
@@ -358,6 +372,8 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 			strmap_clear(&renames->dir_rename_count[i], 1);
 
 		strmap_func(&renames->dir_renames[i], 0);
+
+		strset_func(&renames->relevant_sources[i]);
 	}
 
 	if (!reinitialize) {
@@ -533,12 +549,21 @@ static void add_pair(struct merge_options *opt,
 		     struct name_entry *names,
 		     const char *pathname,
 		     unsigned side,
-		     unsigned is_add /* if false, is_delete */)
+		     unsigned is_add /* if false, is_delete */,
+		     unsigned match_mask)
 {
 	struct diff_filespec *one, *two;
 	struct rename_info *renames = &opt->priv->renames;
 	int names_idx = is_add ? side : 0;
 
+	if (!is_add) {
+		unsigned content_relevant = (match_mask == 0);
+		unsigned location_relevant = 1; /* FIXME: compute this */
+
+		if (content_relevant || location_relevant)
+			strset_add(&renames->relevant_sources[side], pathname);
+	}
+
 	one = alloc_filespec(pathname);
 	two = alloc_filespec(pathname);
 	fill_filespec(is_add ? two : one,
@@ -575,11 +600,13 @@ static void collect_rename_info(struct merge_options *opt,
 
 		/* Check for deletion on side */
 		if ((filemask & 1) && !(filemask & side_mask))
-			add_pair(opt, names, fullname, side, 0 /* delete */);
+			add_pair(opt, names, fullname, side, 0 /* delete */,
+				 match_mask & filemask);
 
 		/* Check for addition on side */
 		if (!(filemask & 1) && (filemask & side_mask))
-			add_pair(opt, names, fullname, side, 1 /* add */);
+			add_pair(opt, names, fullname, side, 1 /* add */,
+				 match_mask & filemask);
 	}
 }
 
@@ -3228,6 +3255,8 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
 					 NULL, 1);
 		strmap_init_with_options(&renames->dir_renames[i],
 					 NULL, 0);
+		strset_init_with_options(&renames->relevant_sources[i],
+					 NULL, 0);
 	}
 
 	/*
diff --git a/t/t6423-merge-rename-directories.sh b/t/t6423-merge-rename-directories.sh
index 4ab133f489ca..4c568050dd27 100755
--- a/t/t6423-merge-rename-directories.sh
+++ b/t/t6423-merge-rename-directories.sh
@@ -4895,6 +4895,77 @@ test_expect_merge_algorithm failure success '12f: Trivial directory resolve, cac
 	)
 '
 
+# Testcase 12g, Testcase with two kinds of "relevant" renames
+#   Commit O: somefile_O, subdir/{a_O,b_O}
+#   Commit A: somefile_A, subdir/{a_O,b_O,c_A}
+#   Commit B: newfile_B,  newdir/{a_B,b_B}
+#   Expected: newfile_{merged}, newdir/{a_B,b_B,c_A}
+
+test_setup_12g () {
+	test_create_repo 12g &&
+	(
+		cd 12g &&
+
+		mkdir -p subdir &&
+		test_write_lines upon a time there was a >somefile &&
+		test_write_lines 1 2 3 4 5 6 7 8 9 10 >subdir/a &&
+		test_write_lines one two three four five six >subdir/b &&
+		git add . &&
+		test_tick &&
+		git commit -m "O" &&
+
+		git branch O &&
+		git branch A &&
+		git branch B &&
+
+		git switch A &&
+		test_write_lines once upon a time there was a >somefile &&
+		> subdir/c &&
+		git add somefile subdir/c &&
+		test_tick &&
+		git commit -m "A" &&
+
+		git checkout B &&
+		git mv somefile newfile &&
+		git mv subdir newdir &&
+		echo repo >>newfile &&
+		test_write_lines 1 2 3 4 5 6 7 8 9 10 11 >newdir/a &&
+		test_write_lines one two three four five six seven >newdir/b &&
+		git add newfile newdir &&
+		test_tick &&
+		git commit -m "B"
+	)
+}
+
+test_expect_success '12g: Testcase with two kinds of "relevant" renames' '
+	test_setup_12g &&
+	(
+		cd 12g &&
+
+		git checkout A^0 &&
+
+		git -c merge.directoryRenames=true merge -s recursive B^0 &&
+
+		test_write_lines once upon a time there was a repo >expect &&
+		test_cmp expect newfile &&
+
+		git ls-files -s >out &&
+		test_line_count = 4 out &&
+
+		git rev-parse >actual \
+			HEAD:newdir/a  HEAD:newdir/b   HEAD:newdir/c &&
+		git rev-parse >expect \
+			B:newdir/a     B:newdir/b      A:subdir/c &&
+		test_cmp expect actual &&
+
+		test_must_fail git rev-parse HEAD:subdir/a &&
+		test_must_fail git rev-parse HEAD:subdir/b &&
+		test_must_fail git rev-parse HEAD:subdir/c &&
+		test_path_is_missing subdir/ &&
+		test_path_is_file newdir/c
+	)
+'
+
 ###########################################################################
 # SECTION 13: Checking informational and conflict messages
 #
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 3/8] merge-ort: add data structures for an alternate tree traversal
  2021-03-11  0:38   ` [PATCH v3 " Elijah Newren via GitGitGadget
  2021-03-11  0:38     ` [PATCH v3 1/8] diffcore-rename: enable filtering possible rename sources Elijah Newren via GitGitGadget
  2021-03-11  0:38     ` [PATCH v3 2/8] merge-ort: precompute subset of sources for which we need rename detection Elijah Newren via GitGitGadget
@ 2021-03-11  0:38     ` Elijah Newren via GitGitGadget
  2021-03-11  0:38     ` [PATCH v3 4/8] merge-ort: introduce wrappers for " Elijah Newren via GitGitGadget
                       ` (5 subsequent siblings)
  8 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-11  0:38 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

In order to determine whether directory rename detection is needed, we
as a pre-requisite need a way to traverse through all the files in a
given tree before visiting any directories within that tree.
traverse_trees() only iterates through the entries in the order they
appear, so add some data structures that will store all the entries as
we iterate through them in traverse_trees(), which will allow us to
re-traverse them in our desired order.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index 83aa4c08121f..d49cfa8b030b 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -51,6 +51,12 @@ enum merge_side {
 	MERGE_SIDE2 = 2
 };
 
+struct traversal_callback_data {
+	unsigned long mask;
+	unsigned long dirmask;
+	struct name_entry names[3];
+};
+
 struct rename_info {
 	/*
 	 * All variables that are arrays of size 3 correspond to data tracked
@@ -102,6 +108,22 @@ struct rename_info {
 	 */
 	struct strset relevant_sources[3];
 
+	/*
+	 * callback_data_*: supporting data structures for alternate traversal
+	 *
+	 * We sometimes need to be able to traverse through all the files
+	 * in a given tree before all immediate subdirectories within that
+	 * tree.  Since traverse_trees() doesn't do that naturally, we have
+	 * a traverse_trees_wrapper() that stores any immediate
+	 * subdirectories while traversing files, then traverses the
+	 * immediate subdirectories later.  These callback_data* variables
+	 * store the information for the subdirectories so that we can do
+	 * that traversal order.
+	 */
+	struct traversal_callback_data *callback_data;
+	int callback_data_nr, callback_data_alloc;
+	char *callback_data_traverse_path;
+
 	/*
 	 * needed_limit: value needed for inexact rename detection to run
 	 *
@@ -396,6 +418,10 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		}
 		strmap_clear(&opti->output, 0);
 	}
+
+	/* Clean out callback_data as well. */
+	FREE_AND_NULL(renames->callback_data);
+	renames->callback_data_nr = renames->callback_data_alloc = 0;
 }
 
 static int err(struct merge_options *opt, const char *err, ...)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 4/8] merge-ort: introduce wrappers for alternate tree traversal
  2021-03-11  0:38   ` [PATCH v3 " Elijah Newren via GitGitGadget
                       ` (2 preceding siblings ...)
  2021-03-11  0:38     ` [PATCH v3 3/8] merge-ort: add data structures for an alternate tree traversal Elijah Newren via GitGitGadget
@ 2021-03-11  0:38     ` Elijah Newren via GitGitGadget
  2021-03-11  0:38     ` [PATCH v3 5/8] merge-ort: precompute whether directory rename detection is needed Elijah Newren via GitGitGadget
                       ` (4 subsequent siblings)
  8 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-11  0:38 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Add traverse_trees_wrapper() and traverse_trees_wrapper_callback()
functions.  The former runs traverse_trees() with info->fn set to
traverse_trees_wrapper_callback, in order to simply save all the entries
without processing or recursing into any of them.  This step allows
extra computation to be done (e.g. checking some condition across all
files) that can be used later.  Then, after that is completed, it
iterates over all the saved entries and calls the original info->fn
callback with the saved data.

Currently, this does nothing more than marginally slowing down the tree
traversal since we do not take advantage of the opportunity to compute
anything special in traverse_trees_wrapper_callback(), and thus the real
callback will be called identically as it would have been without this
extra wrapper.  However, a subsequent commit will add some special
computation of some values that the real callback will be able to use.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index d49cfa8b030b..f8f7d06d481a 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -512,6 +512,77 @@ static char *unique_path(struct strmap *existing_paths,
 
 /*** Function Grouping: functions related to collect_merge_info() ***/
 
+static int traverse_trees_wrapper_callback(int n,
+					   unsigned long mask,
+					   unsigned long dirmask,
+					   struct name_entry *names,
+					   struct traverse_info *info)
+{
+	struct merge_options *opt = info->data;
+	struct rename_info *renames = &opt->priv->renames;
+
+	assert(n==3);
+
+	if (!renames->callback_data_traverse_path)
+		renames->callback_data_traverse_path = xstrdup(info->traverse_path);
+
+	ALLOC_GROW(renames->callback_data, renames->callback_data_nr + 1,
+		   renames->callback_data_alloc);
+	renames->callback_data[renames->callback_data_nr].mask = mask;
+	renames->callback_data[renames->callback_data_nr].dirmask = dirmask;
+	COPY_ARRAY(renames->callback_data[renames->callback_data_nr].names,
+		   names, 3);
+	renames->callback_data_nr++;
+
+	return mask;
+}
+
+/*
+ * Much like traverse_trees(), BUT:
+ *   - read all the tree entries FIRST, saving them
+ *   - note that the above step provides an opportunity to compute necessary
+ *     additional details before the "real" traversal
+ *   - loop through the saved entries and call the original callback on them
+ */
+MAYBE_UNUSED
+static int traverse_trees_wrapper(struct index_state *istate,
+				  int n,
+				  struct tree_desc *t,
+				  struct traverse_info *info)
+{
+	int ret, i, old_offset;
+	traverse_callback_t old_fn;
+	char *old_callback_data_traverse_path;
+	struct merge_options *opt = info->data;
+	struct rename_info *renames = &opt->priv->renames;
+
+	old_callback_data_traverse_path = renames->callback_data_traverse_path;
+	old_fn = info->fn;
+	old_offset = renames->callback_data_nr;
+
+	renames->callback_data_traverse_path = NULL;
+	info->fn = traverse_trees_wrapper_callback;
+	ret = traverse_trees(istate, n, t, info);
+	if (ret < 0)
+		return ret;
+
+	info->traverse_path = renames->callback_data_traverse_path;
+	info->fn = old_fn;
+	for (i = old_offset; i < renames->callback_data_nr; ++i) {
+		info->fn(n,
+			 renames->callback_data[i].mask,
+			 renames->callback_data[i].dirmask,
+			 renames->callback_data[i].names,
+			 info);
+	}
+
+	renames->callback_data_nr = old_offset;
+	free(renames->callback_data_traverse_path);
+	renames->callback_data_traverse_path = old_callback_data_traverse_path;
+	info->traverse_path = NULL;
+	return 0;
+}
+
 static void setup_path_info(struct merge_options *opt,
 			    struct string_list_item *result,
 			    const char *current_dir_name,
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 5/8] merge-ort: precompute whether directory rename detection is needed
  2021-03-11  0:38   ` [PATCH v3 " Elijah Newren via GitGitGadget
                       ` (3 preceding siblings ...)
  2021-03-11  0:38     ` [PATCH v3 4/8] merge-ort: introduce wrappers for " Elijah Newren via GitGitGadget
@ 2021-03-11  0:38     ` Elijah Newren via GitGitGadget
  2021-03-11  0:38     ` [PATCH v3 6/8] merge-ort: use relevant_sources to filter possible rename sources Elijah Newren via GitGitGadget
                       ` (3 subsequent siblings)
  8 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-11  0:38 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

The point of directory rename detection is that if one side of history
renames a directory, and the other side adds new files under the old
directory, then the merge can move those new files into the new
directory.  This leads to the following important observation:

  * If the other side does not add any new files under the old
    directory, we do not need to detect any renames for that directory.

Similarly, directory rename detection had an important requirement:

  * If a directory still exists on one side of history, it has not been
    renamed on that side of history.  (See section 4 of t6423 or
    Documentation/technical/directory-rename-detection.txt for more
    details).

Using these two bits of information, we note that directory rename
detection is only needed in cases where (1) directories exist in the
merge base and on one side of history (i.e. dirmask == 3 or dirmask ==
5), and (2) where there is some new file added to that directory on the
side where it still exists (thus where the file has filemask == 2 or
filemask == 4, respectively).  This has to be done in two steps, because
we have the dirmask when we are first considering the directory, and
won't get the filemasks for the files within it until we recurse into
that directory.  So, we save
  dir_rename_mask = dirmask - 1
when we hit a directory that is missing on one side, and then later look
for cases of
  filemask == dir_rename_mask

One final note is that as soon as we hit a directory that needs
directory rename detection, we will need to detect renames in all
subdirectories of that directory as well due to the "majority rules"
decision when files are renamed into different directory hierarchies.
We arbitrarily use the special value of 0x07 to record when we've hit
such a directory.

The combination of all the above mean that we introduce a variable
named dir_rename_mask (couldn't think of a better name) which has one
of the following values as we traverse into a directory:
   * 0x00: directory rename detection not needed
   * 0x02 or 0x04: directory rename detection only needed if files added
   * 0x07: directory rename detection definitely needed

We then pass this value through to add_pairs() so that it can mark
location_relevant as true only when dir_rename_mask is 0x07.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 61 insertions(+), 6 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index f8f7d06d481a..5840832cf3ed 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -108,6 +108,14 @@ struct rename_info {
 	 */
 	struct strset relevant_sources[3];
 
+	/*
+	 * dir_rename_mask:
+	 *   0: optimization removing unmodified potential rename source okay
+	 *   2 or 4: optimization okay, but must check for files added to dir
+	 *   7: optimization forbidden; need rename source in case of dir rename
+	 */
+	unsigned dir_rename_mask:3;
+
 	/*
 	 * callback_data_*: supporting data structures for alternate traversal
 	 *
@@ -419,6 +427,8 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		strmap_clear(&opti->output, 0);
 	}
 
+	renames->dir_rename_mask = 0;
+
 	/* Clean out callback_data as well. */
 	FREE_AND_NULL(renames->callback_data);
 	renames->callback_data_nr = renames->callback_data_alloc = 0;
@@ -520,12 +530,16 @@ static int traverse_trees_wrapper_callback(int n,
 {
 	struct merge_options *opt = info->data;
 	struct rename_info *renames = &opt->priv->renames;
+	unsigned filemask = mask & ~dirmask;
 
 	assert(n==3);
 
 	if (!renames->callback_data_traverse_path)
 		renames->callback_data_traverse_path = xstrdup(info->traverse_path);
 
+	if (filemask && filemask == renames->dir_rename_mask)
+		renames->dir_rename_mask = 0x07;
+
 	ALLOC_GROW(renames->callback_data, renames->callback_data_nr + 1,
 		   renames->callback_data_alloc);
 	renames->callback_data[renames->callback_data_nr].mask = mask;
@@ -544,7 +558,6 @@ static int traverse_trees_wrapper_callback(int n,
  *     additional details before the "real" traversal
  *   - loop through the saved entries and call the original callback on them
  */
-MAYBE_UNUSED
 static int traverse_trees_wrapper(struct index_state *istate,
 				  int n,
 				  struct tree_desc *t,
@@ -556,6 +569,8 @@ static int traverse_trees_wrapper(struct index_state *istate,
 	struct merge_options *opt = info->data;
 	struct rename_info *renames = &opt->priv->renames;
 
+	assert(renames->dir_rename_mask == 2 || renames->dir_rename_mask == 4);
+
 	old_callback_data_traverse_path = renames->callback_data_traverse_path;
 	old_fn = info->fn;
 	old_offset = renames->callback_data_nr;
@@ -647,7 +662,8 @@ static void add_pair(struct merge_options *opt,
 		     const char *pathname,
 		     unsigned side,
 		     unsigned is_add /* if false, is_delete */,
-		     unsigned match_mask)
+		     unsigned match_mask,
+		     unsigned dir_rename_mask)
 {
 	struct diff_filespec *one, *two;
 	struct rename_info *renames = &opt->priv->renames;
@@ -655,7 +671,7 @@ static void add_pair(struct merge_options *opt,
 
 	if (!is_add) {
 		unsigned content_relevant = (match_mask == 0);
-		unsigned location_relevant = 1; /* FIXME: compute this */
+		unsigned location_relevant = (dir_rename_mask == 0x07);
 
 		if (content_relevant || location_relevant)
 			strset_add(&renames->relevant_sources[side], pathname);
@@ -679,6 +695,36 @@ static void collect_rename_info(struct merge_options *opt,
 	struct rename_info *renames = &opt->priv->renames;
 	unsigned side;
 
+	/*
+	 * Update dir_rename_mask (determines ignore-rename-source validity)
+	 *
+	 * dir_rename_mask helps us keep track of when directory rename
+	 * detection may be relevant.  Basically, whenver a directory is
+	 * removed on one side of history, and a file is added to that
+	 * directory on the other side of history, directory rename
+	 * detection is relevant (meaning we have to detect renames for all
+	 * files within that directory to deduce where the directory
+	 * moved).  Also, whenever a directory needs directory rename
+	 * detection, due to the "majority rules" choice for where to move
+	 * it (see t6423 testcase 1f), we also need to detect renames for
+	 * all files within subdirectories of that directory as well.
+	 *
+	 * Here we haven't looked at files within the directory yet, we are
+	 * just looking at the directory itself.  So, if we aren't yet in
+	 * a case where a parent directory needed directory rename detection
+	 * (i.e. dir_rename_mask != 0x07), and if the directory was removed
+	 * on one side of history, record the mask of the other side of
+	 * history in dir_rename_mask.
+	 */
+	if (renames->dir_rename_mask != 0x07 &&
+	    (dirmask == 3 || dirmask == 5)) {
+		/* simple sanity check */
+		assert(renames->dir_rename_mask == 0 ||
+		       renames->dir_rename_mask == (dirmask & ~1));
+		/* update dir_rename_mask; have it record mask of new side */
+		renames->dir_rename_mask = (dirmask & ~1);
+	}
+
 	/* Update dirs_removed, as needed */
 	if (dirmask == 1 || dirmask == 3 || dirmask == 5) {
 		/* absent_mask = 0x07 - dirmask; sides = absent_mask/2 */
@@ -698,12 +744,14 @@ static void collect_rename_info(struct merge_options *opt,
 		/* Check for deletion on side */
 		if ((filemask & 1) && !(filemask & side_mask))
 			add_pair(opt, names, fullname, side, 0 /* delete */,
-				 match_mask & filemask);
+				 match_mask & filemask,
+				 renames->dir_rename_mask);
 
 		/* Check for addition on side */
 		if (!(filemask & 1) && (filemask & side_mask))
 			add_pair(opt, names, fullname, side, 1 /* add */,
-				 match_mask & filemask);
+				 match_mask & filemask,
+				 renames->dir_rename_mask);
 	}
 }
 
@@ -721,12 +769,14 @@ static int collect_merge_info_callback(int n,
 	 */
 	struct merge_options *opt = info->data;
 	struct merge_options_internal *opti = opt->priv;
+	struct rename_info *renames = &opt->priv->renames;
 	struct string_list_item pi;  /* Path Info */
 	struct conflict_info *ci; /* typed alias to pi.util (which is void*) */
 	struct name_entry *p;
 	size_t len;
 	char *fullpath;
 	const char *dirname = opti->current_dir_name;
+	unsigned prev_dir_rename_mask = renames->dir_rename_mask;
 	unsigned filemask = mask & ~dirmask;
 	unsigned match_mask = 0; /* will be updated below */
 	unsigned mbase_null = !(mask & 1);
@@ -867,8 +917,13 @@ static int collect_merge_info_callback(int n,
 
 		original_dir_name = opti->current_dir_name;
 		opti->current_dir_name = pi.string;
-		ret = traverse_trees(NULL, 3, t, &newinfo);
+		if (renames->dir_rename_mask == 0 ||
+		    renames->dir_rename_mask == 0x07)
+			ret = traverse_trees(NULL, 3, t, &newinfo);
+		else
+			ret = traverse_trees_wrapper(NULL, 3, t, &newinfo);
 		opti->current_dir_name = original_dir_name;
+		renames->dir_rename_mask = prev_dir_rename_mask;
 
 		for (i = MERGE_BASE; i <= MERGE_SIDE2; i++)
 			free(buf[i]);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 6/8] merge-ort: use relevant_sources to filter possible rename sources
  2021-03-11  0:38   ` [PATCH v3 " Elijah Newren via GitGitGadget
                       ` (4 preceding siblings ...)
  2021-03-11  0:38     ` [PATCH v3 5/8] merge-ort: precompute whether directory rename detection is needed Elijah Newren via GitGitGadget
@ 2021-03-11  0:38     ` Elijah Newren via GitGitGadget
  2021-03-11  0:38     ` [PATCH v3 7/8] merge-ort: skip rename detection entirely if possible Elijah Newren via GitGitGadget
                       ` (2 subsequent siblings)
  8 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-11  0:38 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

The past several commits determined conditions when rename sources might
be needed, and filled a relevant_sources strset with those paths.  Pass
these along to diffcore_rename_extended() to use to limit the sources
that we need to detect renames for.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       12.596 s ±  0.061 s     6.003 s ±  0.048 s
    mega-renames:    130.465 s ±  0.259 s   114.009 s ±  0.236 s
    just-one-mega:     3.958 s ±  0.010 s     3.489 s ±  0.017 s

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/merge-ort.c b/merge-ort.c
index 5840832cf3ed..eea14024c657 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -2208,7 +2208,7 @@ static void detect_regular_renames(struct merge_options *opt,
 	diff_queued_diff = renames->pairs[side_index];
 	trace2_region_enter("diff", "diffcore_rename", opt->repo);
 	diffcore_rename_extended(&diff_opts,
-				 NULL,
+				 &renames->relevant_sources[side_index],
 				 &renames->dirs_removed[side_index],
 				 &renames->dir_rename_count[side_index]);
 	trace2_region_leave("diff", "diffcore_rename", opt->repo);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 7/8] merge-ort: skip rename detection entirely if possible
  2021-03-11  0:38   ` [PATCH v3 " Elijah Newren via GitGitGadget
                       ` (5 preceding siblings ...)
  2021-03-11  0:38     ` [PATCH v3 6/8] merge-ort: use relevant_sources to filter possible rename sources Elijah Newren via GitGitGadget
@ 2021-03-11  0:38     ` Elijah Newren via GitGitGadget
  2021-03-11  0:38     ` [PATCH v3 8/8] diffcore-rename: avoid doing basename comparisons for irrelevant sources Elijah Newren via GitGitGadget
  2021-03-15 13:57     ` [PATCH v3 0/8] Optimization batch 9: avoid detecting irrelevant renames Derrick Stolee
  8 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-11  0:38 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

diffcore_rename_extended() will do a bunch of setup, then check for
exact renames, then abort before inexact rename detection if there are
no more sources or destinations that need to be matched.  It will
sometimes be the case, however, that either
  * we start with neither any sources or destinations
  * we start with no *relevant* sources
In the first of these two cases, the setup and exact rename detection
will be very cheap since there are 0 files to operate on.  In the second
case, it is quite possible to have thousands of files with none of the
source ones being relevant.  Avoid calling diffcore_rename_extended() or
even some of the setup before diffcore_rename_extended() when we can
determine that rename detection is unnecessary.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:        6.003 s ±  0.048 s     5.708 s ±  0.111 s
    mega-renames:    114.009 s ±  0.236 s   102.171 s ±  0.440 s
    just-one-mega:     3.489 s ±  0.017 s     3.471 s ±  0.015 s

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index eea14024c657..bd089cb9a76f 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -2157,6 +2157,19 @@ static int process_renames(struct merge_options *opt,
 	return clean_merge;
 }
 
+static inline int possible_side_renames(struct rename_info *renames,
+					unsigned side_index)
+{
+	return renames->pairs[side_index].nr > 0 &&
+	       !strset_empty(&renames->relevant_sources[side_index]);
+}
+
+static inline int possible_renames(struct rename_info *renames)
+{
+	return possible_side_renames(renames, 1) ||
+	       possible_side_renames(renames, 2);
+}
+
 static void resolve_diffpair_statuses(struct diff_queue_struct *q)
 {
 	/*
@@ -2193,6 +2206,16 @@ static void detect_regular_renames(struct merge_options *opt,
 	struct diff_options diff_opts;
 	struct rename_info *renames = &opt->priv->renames;
 
+	if (!possible_side_renames(renames, side_index)) {
+		/*
+		 * No rename detection needed for this side, but we still need
+		 * to make sure 'adds' are marked correctly in case the other
+		 * side had directory renames.
+		 */
+		resolve_diffpair_statuses(&renames->pairs[side_index]);
+		return;
+	}
+
 	repo_diff_setup(opt->repo, &diff_opts);
 	diff_opts.flags.recursive = 1;
 	diff_opts.flags.rename_empty = 0;
@@ -2310,6 +2333,8 @@ static int detect_and_process_renames(struct merge_options *opt,
 	int need_dir_renames, s, clean = 1;
 
 	memset(&combined, 0, sizeof(combined));
+	if (!possible_renames(renames))
+		goto cleanup;
 
 	trace2_region_enter("merge", "regular renames", opt->repo);
 	detect_regular_renames(opt, MERGE_SIDE1);
@@ -2344,6 +2369,25 @@ static int detect_and_process_renames(struct merge_options *opt,
 	clean &= process_renames(opt, &combined);
 	trace2_region_leave("merge", "process renames", opt->repo);
 
+	goto simple_cleanup; /* collect_renames() handles some of cleanup */
+
+cleanup:
+	/*
+	 * Free now unneeded filepairs, which would have been handled
+	 * in collect_renames() normally but we skipped that code.
+	 */
+	for (s = MERGE_SIDE1; s <= MERGE_SIDE2; s++) {
+		struct diff_queue_struct *side_pairs;
+		int i;
+
+		side_pairs = &renames->pairs[s];
+		for (i = 0; i < side_pairs->nr; ++i) {
+			struct diff_filepair *p = side_pairs->queue[i];
+			diff_free_filepair(p);
+		}
+	}
+
+simple_cleanup:
 	/* Free memory for renames->pairs[] and combined */
 	for (s = MERGE_SIDE1; s <= MERGE_SIDE2; s++) {
 		free(renames->pairs[s].queue);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 8/8] diffcore-rename: avoid doing basename comparisons for irrelevant sources
  2021-03-11  0:38   ` [PATCH v3 " Elijah Newren via GitGitGadget
                       ` (6 preceding siblings ...)
  2021-03-11  0:38     ` [PATCH v3 7/8] merge-ort: skip rename detection entirely if possible Elijah Newren via GitGitGadget
@ 2021-03-11  0:38     ` Elijah Newren via GitGitGadget
  2021-03-15 13:57     ` [PATCH v3 0/8] Optimization batch 9: avoid detecting irrelevant renames Derrick Stolee
  8 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-03-11  0:38 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

The basename comparison optimization implemented in
find_basename_matches() is very beneficial since it allows a source to
sometimes only be compared with one other file instead of N other files.
When a match is found, both a source and destination can be removed from
the matrix of inexact rename comparisons.  In contrast, the irrelevant
source optimization only allows us to remove a source from the matrix of
inexact rename comparisons...but it has the advantage of allowing a
source file to not even be loaded into memory at all and be compared to
0 other files.  Generally, not even comparing is a bigger performance
win, so when both optimizations could apply, prefer to use the
irrelevant-source optimization.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:        5.708 s ±  0.111 s     5.680 s ±  0.096 s
    mega-renames:    102.171 s ±  0.440 s    13.812 s ±  0.162 s
    just-one-mega:     3.471 s ±  0.015 s   506.0  ms ±  3.9  ms

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 37 +++++++++++++++++++++++++++++++++----
 1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 7f6115fd9018..e8508541be14 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -527,6 +527,7 @@ static void update_dir_rename_counts(struct dir_rename_info *info,
 }
 
 static void initialize_dir_rename_info(struct dir_rename_info *info,
+				       struct strset *relevant_sources,
 				       struct strset *dirs_removed,
 				       struct strmap *dir_rename_count)
 {
@@ -534,7 +535,7 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 	struct strmap_entry *entry;
 	int i;
 
-	if (!dirs_removed) {
+	if (!dirs_removed && !relevant_sources) {
 		info->setup = 0;
 		return;
 	}
@@ -549,7 +550,20 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 	strmap_init_with_options(&info->dir_rename_guess, NULL, 0);
 
 	/* Setup info->relevant_source_dirs */
-	info->relevant_source_dirs = dirs_removed;
+	info->relevant_source_dirs = NULL;
+	if (dirs_removed || !relevant_sources) {
+		info->relevant_source_dirs = dirs_removed; /* might be NULL */
+	} else {
+		info->relevant_source_dirs = xmalloc(sizeof(struct strintmap));
+		strset_init(info->relevant_source_dirs);
+		strset_for_each_entry(relevant_sources, &iter, entry) {
+			char *dirname = get_dirname(entry->key);
+			if (!dirs_removed ||
+			    strset_contains(dirs_removed, dirname))
+				strset_add(info->relevant_source_dirs, dirname);
+			free(dirname);
+		}
+	}
 
 	/*
 	 * Loop setting up both info->idx_map, and doing setup of
@@ -627,6 +641,13 @@ static void cleanup_dir_rename_info(struct dir_rename_info *info,
 	/* dir_rename_guess */
 	strmap_clear(&info->dir_rename_guess, 1);
 
+	/* relevant_source_dirs */
+	if (info->relevant_source_dirs &&
+	    info->relevant_source_dirs != dirs_removed) {
+		strset_clear(info->relevant_source_dirs);
+		FREE_AND_NULL(info->relevant_source_dirs);
+	}
+
 	/* dir_rename_count */
 	if (!keep_dir_rename_count) {
 		partial_clear_dir_rename_count(info->dir_rename_count);
@@ -749,6 +770,7 @@ static int idx_possible_rename(char *filename, struct dir_rename_info *info)
 static int find_basename_matches(struct diff_options *options,
 				 int minimum_score,
 				 struct dir_rename_info *info,
+				 struct strset *relevant_sources,
 				 struct strset *dirs_removed)
 {
 	/*
@@ -839,6 +861,11 @@ static int find_basename_matches(struct diff_options *options,
 		intptr_t src_index;
 		intptr_t dst_index;
 
+		/* Skip irrelevant sources */
+		if (relevant_sources &&
+		    !strset_contains(relevant_sources, filename))
+			continue;
+
 		/*
 		 * If the basename is unique among remaining sources, then
 		 * src_index will equal 'i' and we can attempt to match it
@@ -1164,7 +1191,7 @@ void diffcore_rename_extended(struct diff_options *options,
 
 		/* Preparation for basename-driven matching. */
 		trace2_region_enter("diff", "dir rename setup", options->repo);
-		initialize_dir_rename_info(&info,
+		initialize_dir_rename_info(&info, relevant_sources,
 					   dirs_removed, dir_rename_count);
 		trace2_region_leave("diff", "dir rename setup", options->repo);
 
@@ -1172,7 +1199,9 @@ void diffcore_rename_extended(struct diff_options *options,
 		trace2_region_enter("diff", "basename matches", options->repo);
 		rename_count += find_basename_matches(options,
 						      min_basename_score,
-						      &info, dirs_removed);
+						      &info,
+						      relevant_sources,
+						      dirs_removed);
 		trace2_region_leave("diff", "basename matches", options->repo);
 
 		/*
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 0/8] Optimization batch 9: avoid detecting irrelevant renames
  2021-03-11  0:38   ` [PATCH v3 " Elijah Newren via GitGitGadget
                       ` (7 preceding siblings ...)
  2021-03-11  0:38     ` [PATCH v3 8/8] diffcore-rename: avoid doing basename comparisons for irrelevant sources Elijah Newren via GitGitGadget
@ 2021-03-15 13:57     ` Derrick Stolee
  8 siblings, 0 replies; 39+ messages in thread
From: Derrick Stolee @ 2021-03-15 13:57 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Derrick Stolee, Jonathan Tan, Taylor Blau, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren

On 3/10/2021 7:38 PM, Elijah Newren via GitGitGadget wrote:
> This series determines paths which meet special cases where detection of
> renames is irrelevant, where the irrelevance is due to the fact that the
> merge machinery will arrive at the same result regardless of whether a
> rename is detected for any of those paths.
> 
> Changes since v2:
> 
>  * Added an extended comment about filtering order to the first patch, to
>    address Stolee's question
>  * Added a new testcase that will fail if the critical "if (content_relevant
>    || location_relevant)" check only checks for one of those two or the
>    intersection of those two, as suggested by Stolee
>  * Fixed the other minor problems Stolee pointed out in his review (stray
>    newline, wording of comment)

Thanks for these updates. This version works for me.

-Stolee

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v2 0/8] Optimization batch 9: avoid detecting irrelevant renames
  2021-03-09  0:09 ` [PATCH v2 " Elijah Newren via GitGitGadget
                     ` (10 preceding siblings ...)
  2021-03-11  0:38   ` [PATCH v3 " Elijah Newren via GitGitGadget
@ 2021-03-15 17:10   ` Elijah Newren
  11 siblings, 0 replies; 39+ messages in thread
From: Elijah Newren @ 2021-03-15 17:10 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget
  Cc: Git Mailing List, Derrick Stolee, Jonathan Tan, Taylor Blau,
	Junio C Hamano, Ævar Arnfjörð Bjarmason

On Mon, Mar 8, 2021 at 4:10 PM Elijah Newren via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> This series depends textually on ort-perf-batch-8, but semantically it's
> almost completely unrelated and can be reviewed without any familiarity with
> any of the previous patch series.
>
> There are no changes since v1; it's just a resend just over a week later to
> bump it so it isn't lost.
>
> === Basic Optimization idea ===
>
> This series determines paths which meet special cases where detection of
> renames is irrelevant, where the irrelevance is due to the fact that the
> merge machinery will arrive at the same result regardless of whether a
> rename is detected for any of those paths. This series represents
> "Optimization #2" from my Git Merge 2020 talk[1], though this series has
> some improvements on the optimization relative to what I had at that time.
>
> The basic idea here is:
>
> We only need expensive rename detection on the subset of files changed on
> both sides of history (for the most part).

I know this series was already reviewed and even a subsequent series
was reviewed, but I thought I'd insert a bit of history trivia:

I first submitted this optimization in late 2017 as an RFC at
https://lore.kernel.org/git/20171110222156.23221-9-newren@gmail.com/.
Unfortunately I had only handled the "for the most part" piece, not
the other special cases.  I knew not handling those other cases caused
bugs, but didn't readily find a solution for them at the time.  I kept
mulling it over periodically despite being switched onto other
projects; eventually I weaseled my way into being able to work on it
again and with some effort was able to work out the other necessary
conditions and audit the code to verify I had all the cases covered.
Those "other cases" were much more easily expressed in the context of
merge-ort's data structures than merge-recursive's, in part because
merge-ort's data structures were picked to help me solve this
optimization problem and the various known failing testcases that I
wanted to fix.

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2021-03-15 17:12 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-28  3:58 [PATCH 0/8] Optimization batch 9: avoid detecting irrelevant renames Elijah Newren via GitGitGadget
2021-02-28  3:58 ` [PATCH 1/8] diffcore-rename: enable filtering possible rename sources Elijah Newren via GitGitGadget
2021-02-28  3:58 ` [PATCH 2/8] merge-ort: precompute subset of sources for which we need rename detection Elijah Newren via GitGitGadget
2021-02-28  3:58 ` [PATCH 3/8] merge-ort: add data structures for an alternate tree traversal Elijah Newren via GitGitGadget
2021-02-28  3:58 ` [PATCH 4/8] merge-ort: introduce wrappers for " Elijah Newren via GitGitGadget
2021-02-28  3:58 ` [PATCH 5/8] merge-ort: precompute whether directory rename detection is needed Elijah Newren via GitGitGadget
2021-02-28  3:58 ` [PATCH 6/8] merge-ort: use relevant_sources to filter possible rename sources Elijah Newren via GitGitGadget
2021-02-28  3:58 ` [PATCH 7/8] merge-ort: skip rename detection entirely if possible Elijah Newren via GitGitGadget
2021-02-28  3:58 ` [PATCH 8/8] diffcore-rename: avoid doing basename comparisons for irrelevant sources Elijah Newren via GitGitGadget
2021-03-01 16:39 ` [PATCH 0/8] Optimization batch 9: avoid detecting irrelevant renames Elijah Newren
2021-03-04  7:54 ` Elijah Newren
2021-03-09  0:09 ` [PATCH v2 " Elijah Newren via GitGitGadget
2021-03-09  0:09   ` [PATCH v2 1/8] diffcore-rename: enable filtering possible rename sources Elijah Newren via GitGitGadget
2021-03-09 22:21     ` Derrick Stolee
2021-03-09 22:40       ` Elijah Newren
2021-03-09 22:45         ` Derrick Stolee
2021-03-09  0:09   ` [PATCH v2 2/8] merge-ort: precompute subset of sources for which we need rename detection Elijah Newren via GitGitGadget
2021-03-09  0:09   ` [PATCH v2 3/8] merge-ort: add data structures for an alternate tree traversal Elijah Newren via GitGitGadget
2021-03-09  0:09   ` [PATCH v2 4/8] merge-ort: introduce wrappers for " Elijah Newren via GitGitGadget
2021-03-09 23:06     ` Derrick Stolee
2021-03-09  0:09   ` [PATCH v2 5/8] merge-ort: precompute whether directory rename detection is needed Elijah Newren via GitGitGadget
2021-03-09  0:09   ` [PATCH v2 6/8] merge-ort: use relevant_sources to filter possible rename sources Elijah Newren via GitGitGadget
2021-03-09  0:09   ` [PATCH v2 7/8] merge-ort: skip rename detection entirely if possible Elijah Newren via GitGitGadget
2021-03-09 22:51     ` Derrick Stolee
2021-03-09 22:57       ` Elijah Newren
2021-03-09  0:09   ` [PATCH v2 8/8] diffcore-rename: avoid doing basename comparisons for irrelevant sources Elijah Newren via GitGitGadget
2021-03-09 22:08   ` [PATCH v2 0/8] Optimization batch 9: avoid detecting irrelevant renames Derrick Stolee
2021-03-10 15:08   ` Derrick Stolee
2021-03-11  0:38   ` [PATCH v3 " Elijah Newren via GitGitGadget
2021-03-11  0:38     ` [PATCH v3 1/8] diffcore-rename: enable filtering possible rename sources Elijah Newren via GitGitGadget
2021-03-11  0:38     ` [PATCH v3 2/8] merge-ort: precompute subset of sources for which we need rename detection Elijah Newren via GitGitGadget
2021-03-11  0:38     ` [PATCH v3 3/8] merge-ort: add data structures for an alternate tree traversal Elijah Newren via GitGitGadget
2021-03-11  0:38     ` [PATCH v3 4/8] merge-ort: introduce wrappers for " Elijah Newren via GitGitGadget
2021-03-11  0:38     ` [PATCH v3 5/8] merge-ort: precompute whether directory rename detection is needed Elijah Newren via GitGitGadget
2021-03-11  0:38     ` [PATCH v3 6/8] merge-ort: use relevant_sources to filter possible rename sources Elijah Newren via GitGitGadget
2021-03-11  0:38     ` [PATCH v3 7/8] merge-ort: skip rename detection entirely if possible Elijah Newren via GitGitGadget
2021-03-11  0:38     ` [PATCH v3 8/8] diffcore-rename: avoid doing basename comparisons for irrelevant sources Elijah Newren via GitGitGadget
2021-03-15 13:57     ` [PATCH v3 0/8] Optimization batch 9: avoid detecting irrelevant renames Derrick Stolee
2021-03-15 17:10   ` [PATCH v2 " Elijah Newren

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).