git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/7] Optimization batch 14: trivial directory resolution
@ 2021-07-01  3:46 Elijah Newren via GitGitGadget
  2021-07-01  3:46 ` [PATCH 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
                   ` (8 more replies)
  0 siblings, 9 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-01  3:46 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Elijah Newren

This series depends textually on ort-perf-batch-12, but is semantically
independent. (It is both semantically and textually independent of
ort-perf-batch-13.)

Most of my previous series dramatically accelerated cases with lots of
renames, while providing comparatively minor benefits for cases with few or
no renames. This series is the opposite; it provides huge benefits when
there are few or no renames, and comparatively smaller (though still quite
decent) benefits for cases with many uncached renames.

=== Basic Optimization idea ===

unpack_trees has had a concept of trivial merges for individual files (see
Documentation/technical/trivial-merge.txt). The same idea can be applied in
merge-ort. It'd be really nice to extend that idea to trees as well, as it
could provide a huge performance boost; sadly however, applying it in
general would wreck both regular rename detection (the unmatched side can
have new files that serve as potential destinations in rename detection) and
directory rename detection (the unmatched side could have a new directory
that was moved into it).

If we somehow knew rename detection wasn't needed, we could do trivial
directory resolution. In the past, this wasn't possible. However...

With recent optimizations we have created a possibility to do trivial
directory resolutions in some cases. These came from the addition of the
"skipping irrelevant renames" optimizations (from ort-perf-batch-9 and
ort-perf-batch-10), and in particular noting that we added an ability to
entirely skip rename detection in commit f89b4f2bee ("merge-ort: skip rename
detection entirely if possible", 2021-03-11) when there are no relevant
sources. We can detect if there are no relevant sources without recursing
into the directories in question.

As a cherry on top, the caching of renames (from ort-perf-batch-11) allows
us to cover additional cases.

This series is all about adding all the special checks needed to safely
perform trival directory resolutions.

=== Results ===

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28), the
changes in just this series improves the performance as follows:

                     Before Series           After Series
no-renames:        5.235 s ±  0.042 s   204.2  ms ±  3.0  ms
mega-renames:      9.419 s ±  0.107 s     1.076 s ±  0.015 s
just-one-mega:   480.1  ms ±  3.9  ms   364.1  ms ±  7.0  ms


As a reminder, before any merge-ort/diffcore-rename performance work, the
performance results we started with were:

no-renames-am:      6.940 s ±  0.485 s
no-renames:        18.912 s ±  0.174 s
mega-renames:    5964.031 s ± 10.459 s
just-one-mega:    149.583 s ±  0.751 s


Elijah Newren (7):
  merge-ort: resolve paths early when we have sufficient information
  merge-ort: add some more explanations in collect_merge_info_callback()
  merge-ort: add data structures for allowable trivial directory
    resolves
  merge-ort: add a handle_deferred_entries() helper function
  merge-ort: defer recursing into directories when merge base is matched
  merge-ort: avoid recursing into directories when we don't need to
  merge-ort: restart merge with cached renames to reduce process entry
    cost

 merge-ort.c                         | 403 +++++++++++++++++++++++++++-
 t/t6423-merge-rename-directories.sh |   2 +-
 2 files changed, 393 insertions(+), 12 deletions(-)


base-commit: 2eeee12b02e441ac05054a5a5ecbcea6964a1e6b
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-988%2Fnewren%2Fort-perf-batch-14-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-988/newren/ort-perf-batch-14-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/988
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 1/7] merge-ort: resolve paths early when we have sufficient information
  2021-07-01  3:46 [PATCH 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
@ 2021-07-01  3:46 ` Elijah Newren via GitGitGadget
  2021-07-01  3:46 ` [PATCH 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-01  3:46 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

When there are no directories involved at a given path, and all three
sides have a file at that path, and two of the three sides of history
match, we can immediately resolve the merge of that path in
collect_merge_info() and do not need to wait until process_entries().

This is actually a very minor improvement: half the time when I run it,
I see an improvement; the other half a slowdown.  It seems to be in the
range of noise.  However, this idea serves as the beginning of some
bigger optimizations coming in the following patches.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index e3a5dfc7b31..6299b4f9413 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1023,6 +1023,43 @@ static int collect_merge_info_callback(int n,
 		return mask;
 	}
 
+	/*
+	 * If the sides match, and all three paths are present and are
+	 * files, then we can take either as the resolution.  We can't do
+	 * this with trees, because there may be rename sources from the
+	 * merge_base.
+	 */
+	if (sides_match && filemask == 0x07) {
+		/* use side1 (== side2) version as resolution */
+		setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
+				names, names+1, side1_null, 0,
+				filemask, dirmask, 1);
+		return mask;
+	}
+
+	/*
+	 * If side1 matches mbase and all three paths are present and are
+	 * files, then we can use side2 as the resolution.  We cannot
+	 * necessarily do so this for trees, because there may be rename
+	 * destinations within side2.
+	 */
+	if (side1_matches_mbase && filemask == 0x07) {
+		/* use side2 version as resolution */
+		setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
+				names, names+2, side2_null, 0,
+				filemask, dirmask, 1);
+		return mask;
+	}
+
+	/* Similar to above but swapping sides 1 and 2 */
+	if (side2_matches_mbase && filemask == 0x07) {
+		/* use side1 version as resolution */
+		setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
+				names, names+1, side1_null, 0,
+				filemask, dirmask, 1);
+		return mask;
+	}
+
 	/*
 	 * Gather additional information used in rename detection.
 	 */
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 2/7] merge-ort: add some more explanations in collect_merge_info_callback()
  2021-07-01  3:46 [PATCH 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
  2021-07-01  3:46 ` [PATCH 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
@ 2021-07-01  3:46 ` Elijah Newren via GitGitGadget
  2021-07-01  3:46 ` [PATCH 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-01  3:46 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

The previous patch possibly raises some questions about whether
additional cases in collect_merge_info_callback() can be handled early.
Add some explanations in the form of comments to help explain these
better.  While we're at it, add a few comments to denote what a few
boolean '0' or '1' values stand for.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index 6299b4f9413..843fa693145 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1018,8 +1018,8 @@ static int collect_merge_info_callback(int n,
 	if (side1_matches_mbase && side2_matches_mbase) {
 		/* mbase, side1, & side2 all match; use mbase as resolution */
 		setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
-				names, names+0, mbase_null, 0,
-				filemask, dirmask, 1);
+				names, names+0, mbase_null, 0 /* df_conflict */,
+				filemask, dirmask, 1 /* resolved */);
 		return mask;
 	}
 
@@ -1061,14 +1061,24 @@ static int collect_merge_info_callback(int n,
 	}
 
 	/*
-	 * Gather additional information used in rename detection.
+	 * Sometimes we can tell that a source path need not be included in
+	 * rename detection -- namely, whenever either
+	 *    side1_matches_mbase && side2_null
+	 * or
+	 *    side2_matches_mbase && side1_null
+	 * However, we call collect_rename_info() even in those cases,
+	 * because exact renames are cheap and would let us remove both a
+	 * source and destination path.  We'll cull the unneeded sources
+	 * later.
 	 */
 	collect_rename_info(opt, names, dirname, fullpath,
 			    filemask, dirmask, match_mask);
 
 	/*
-	 * Record information about the path so we can resolve later in
-	 * process_entries.
+	 * None of the special cases above matched, so we have a
+	 * provisional conflict.  (Rename detection might allow us to
+	 * unconflict some more cases, but that comes later so all we can
+	 * do now is record the different non-null file hashes.)
 	 */
 	setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
 			names, NULL, 0, df_conflict, filemask, dirmask, 0);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 3/7] merge-ort: add data structures for allowable trivial directory resolves
  2021-07-01  3:46 [PATCH 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
  2021-07-01  3:46 ` [PATCH 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
  2021-07-01  3:46 ` [PATCH 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
@ 2021-07-01  3:46 ` Elijah Newren via GitGitGadget
  2021-07-01  3:46 ` [PATCH 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-01  3:46 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

As noted a few commits ago, we can resolve individual files early if all
three sides of the merge have a file at the path and two of the three
sides match.  We would really like to do the same thing with
directories, because being able to do a trivial directory resolve means
we don't have to recurse into the directory, potentially saving us a
huge amount of time in both collect_merge_info() and process_entries().
Unfortunately, resolving directories early would mean missing any
renames whose source or destination is underneath that directory.

If we somehow knew there weren't any renames under the directory in
question, then we could resolve it early.  Sadly, it is impossible to
determine whether there are renames under the directory in question
without recursing into it, and this has traditionally kept us from ever
implementing such an optimization.

In commit f89b4f2bee ("merge-ort: skip rename detection entirely if
possible", 2021-03-11), we added an additional reason that rename
detection could be skipped entirely -- namely, if no *relevant* sources
were present.  Without completing collect_merge_info_callback(), we do
not yet know if there are no relevant sources.  However, we do know that
if the current directory on one side matches the merge base, then every
source file within that directory will not be RELEVANT_CONTENT, and a
few simple checks can often let us rule out RELEVANT_LOCATION as well.
This suggests we can just defer recursing into such directories until
the end of collect_merge_info.

Since the deferred directories are known to not add any relevant sources
due to the above properties, then if there are no relevant sources after
we've traversed all paths other than the deferred ones, then we know
there are not any relevant sources.  Under those conditions, rename
detection is unnecessary, and that means we can resolve the deferred
directories without recursing into them.

Note that the logic for skipping rename detection was also modified
further in commit 76e253793c ("merge-ort, diffcore-rename: employ cached
renames when possible", 2021-01-30); in particular rename detection can
be skipped if we already have cached renames for each relevant source.
We can take advantage of this information as well with our deferral of
recursing into directories where one side matches the merge base.

Add some data structures that we will use to do these deferrals, with
some lengthy comments explaining their purpose.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index 843fa693145..3d3f00b3b45 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -119,6 +119,51 @@ struct rename_info {
 	 */
 	struct strintmap relevant_sources[3];
 
+	/*
+	 * possible_trivial_merges: directories we defer recursing into
+	 *
+	 * possible_trivial_merges is a map of directory names to
+	 * dir_rename_mask.  When we detect that a directory is unchanged on
+	 * one side, we can sometimes resolve the directory without recursing
+	 * into it.  Renames are the only things that can prevent such an
+	 * optimization.  However, for rename sources:
+	 *   - If no parent directory needed directory rename detection, then
+	 *     no path under such a directory can be a relevant_source.
+	 * and for rename destinations:
+	 *   - If no cached rename has a target path under the directory AND
+	 *   - If there are no unpaired relevant_sources elsewhere in the
+	 *     repository
+	 * then we don't need any path under this directory for a rename
+	 * destination.  The only way to know the last item above is to defer
+	 * handling such directories until the end of collect_merge_info(),
+	 * in handle_deferred_entries().
+	 *
+	 * For each we store dir_rename_mask, since that's the only bit of
+	 * information we need, other than the path, to resume the recursive
+	 * traversal.
+	 */
+	struct strintmap possible_trivial_merges[3];
+
+	/*
+	 * trivial_merges_okay: if trivial directory merges are okay
+	 *
+	 * See possible_trivial_merges above.  The "no unpaired
+	 * relevant_sources elsewhere in the repository" is a single boolean
+	 * per merge side, which we store here.  Note that while 0 means no,
+	 * 1 only means "maybe" rather than "yes"; we optimistically set it
+	 * to 1 initially and only clear when we determine it is unsafe to
+	 * do trivial directory merges.
+	 */
+	unsigned trivial_merges_okay[3];
+
+	/*
+	 * target_dirs: ancestor directories of rename targets
+	 *
+	 * target_dirs contains all directory names that are an ancestor of
+	 * any rename destination.
+	 */
+	struct strset target_dirs[3];
+
 	/*
 	 * dir_rename_mask:
 	 *   0: optimization removing unmodified potential rename source okay
@@ -490,6 +535,9 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		strintmap_func(&renames->dirs_removed[i]);
 		strmap_func(&renames->dir_renames[i], 0);
 		strintmap_func(&renames->relevant_sources[i]);
+		strintmap_func(&renames->possible_trivial_merges[i]);
+		strset_func(&renames->target_dirs[i]);
+		renames->trivial_merges_okay[i] = 1; /* 1 == maybe */
 		if (!reinitialize)
 			assert(renames->cached_pairs_valid_side == 0);
 		if (i != renames->cached_pairs_valid_side) {
@@ -4045,12 +4093,17 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
 		strintmap_init_with_options(&renames->relevant_sources[i],
 					    -1 /* explicitly invalid */,
 					    NULL, 0);
+		strintmap_init_with_options(&renames->possible_trivial_merges[i],
+					    0, NULL, 0);
+		strset_init_with_options(&renames->target_dirs[i],
+					 NULL, 1);
 		strmap_init_with_options(&renames->cached_pairs[i],
 					 NULL, 1);
 		strset_init_with_options(&renames->cached_irrelevant[i],
 					 NULL, 1);
 		strset_init_with_options(&renames->cached_target_names[i],
 					 NULL, 0);
+		renames->trivial_merges_okay[i] = 1; /* 1 == maybe */
 	}
 
 	/*
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 4/7] merge-ort: add a handle_deferred_entries() helper function
  2021-07-01  3:46 [PATCH 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
                   ` (2 preceding siblings ...)
  2021-07-01  3:46 ` [PATCH 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
@ 2021-07-01  3:46 ` Elijah Newren via GitGitGadget
  2021-07-01  3:46 ` [PATCH 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-01  3:46 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

In order to allow trivial directory resolution, we first need to be able
to gather more information to determine if the optimization is safe.  To
enable that, we need a way of deferring the recursion into the directory
until a later time.  Naturally, deferring the entry into a subtree means
that we need some function that will later recurse into the subdirectory
exactly the same way that collect_merge_info_callback() would have done.

Add a helper function that does this.  For now this function is not used
but a subsequent commit will change that.  Future commits will also make
the function sometimes resolve directories instead of traversing inside.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 64 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index 3d3f00b3b45..eb0e18d7546 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1196,6 +1196,70 @@ static int collect_merge_info_callback(int n,
 	return mask;
 }
 
+MAYBE_UNUSED
+static int handle_deferred_entries(struct merge_options *opt,
+				   struct traverse_info *info)
+{
+	struct rename_info *renames = &opt->priv->renames;
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
+	int side, ret = 0;
+
+	for (side = MERGE_SIDE1; side <= MERGE_SIDE2; side++) {
+		renames->trivial_merges_okay[side] = 0;
+		strintmap_for_each_entry(&renames->possible_trivial_merges[side],
+					 &iter, entry) {
+			const char *path = entry->key;
+			unsigned dir_rename_mask = (intptr_t)entry->value;
+			struct conflict_info *ci;
+			unsigned dirmask;
+			struct tree_desc t[3];
+			void *buf[3] = {NULL,};
+			int i;
+
+			ci = strmap_get(&opt->priv->paths, path);
+			VERIFY_CI(ci);
+			dirmask = ci->dirmask;
+
+			info->name = path;
+			info->namelen = strlen(path);
+			info->pathlen = info->namelen + 1;
+
+			for (i = 0; i < 3; i++, dirmask >>= 1) {
+				if (i == 1 && ci->match_mask == 3)
+					t[1] = t[0];
+				else if (i == 2 && ci->match_mask == 5)
+					t[2] = t[0];
+				else if (i == 2 && ci->match_mask == 6)
+					t[2] = t[1];
+				else {
+					const struct object_id *oid = NULL;
+					if (dirmask & 1)
+						oid = &ci->stages[i].oid;
+					buf[i] = fill_tree_descriptor(opt->repo,
+								      t+i, oid);
+				}
+			}
+
+			ci->match_mask &= ci->filemask;
+			opt->priv->current_dir_name = path;
+			renames->dir_rename_mask = dir_rename_mask;
+			if (renames->dir_rename_mask == 0 ||
+			    renames->dir_rename_mask == 0x07)
+				ret = traverse_trees(NULL, 3, t, info);
+			else
+				ret = traverse_trees_wrapper(NULL, 3, t, info);
+
+			for (i = MERGE_BASE; i <= MERGE_SIDE2; i++)
+				free(buf[i]);
+
+			if (ret < 0)
+				return ret;
+		}
+	}
+	return ret;
+}
+
 static int collect_merge_info(struct merge_options *opt,
 			      struct tree *merge_base,
 			      struct tree *side1,
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 5/7] merge-ort: defer recursing into directories when merge base is matched
  2021-07-01  3:46 [PATCH 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
                   ` (3 preceding siblings ...)
  2021-07-01  3:46 ` [PATCH 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
@ 2021-07-01  3:46 ` Elijah Newren via GitGitGadget
  2021-07-01  3:46 ` [PATCH 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-01  3:46 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

When one side of history matches the merge base (including when the
merge base has no entry for the given directory), have
collect_merge_info_callback() defer recursing into the directory.  To
ensure those entries are eventually handled, add a call to
handled_deferred_entries() in collect_merge_info() after
traverse_trees() returns.

Note that the condition in collect_merge_info_callback() may look more
complicated than necessary at first glance;
renames->trivial_merges_okay[side] is always true until
handle_deferred_entries() is called, and possible_trivial_merges[side]
is always empty right now (and in the future won't be filled until
handle_deferred_entries() is called).  However, when
handle_deferred_entries() calls traverse_trees() for the relevant
deferred directories, those traverse_trees() calls will once again end
up in collect_merge_info_callback() for all the entries under those
subdirectories.  The extra conditions are there for such deferred cases
and will be used more as we do more with those variables.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index eb0e18d7546..bf0712d18a0 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1141,8 +1141,35 @@ static int collect_merge_info_callback(int n,
 		struct tree_desc t[3];
 		void *buf[3] = {NULL, NULL, NULL};
 		const char *original_dir_name;
-		int i, ret;
+		int i, ret, side;
 
+		/*
+		 * Check for whether we can avoid recursing due to one side
+		 * matching the merge base.  The side that does NOT match is
+		 * the one that might have a rename destination we need.
+		 */
+		assert(!side1_matches_mbase || !side2_matches_mbase);
+		side = side1_matches_mbase ? MERGE_SIDE2 :
+			side2_matches_mbase ? MERGE_SIDE1 : MERGE_BASE;
+		if (filemask == 0 && (dirmask == 2 || dirmask == 4)) {
+			/*
+			 * Also defer recursing into new directories; set up a
+			 * few variables to let us do so.
+			 */
+			ci->match_mask = (7 - dirmask);
+			side = dirmask / 2;
+		}
+		if (renames->dir_rename_mask != 0x07 &&
+		    (side != MERGE_BASE) &&
+		    renames->trivial_merges_okay[side] &&
+		    !strset_contains(&renames->target_dirs[side], pi.string)) {
+			strintmap_set(&renames->possible_trivial_merges[side],
+				      pi.string, renames->dir_rename_mask);
+			renames->dir_rename_mask = prev_dir_rename_mask;
+			return mask;
+		}
+
+		/* We need to recurse */
 		ci->match_mask &= filemask;
 		newinfo = *info;
 		newinfo.prev = info;
@@ -1196,7 +1223,6 @@ static int collect_merge_info_callback(int n,
 	return mask;
 }
 
-MAYBE_UNUSED
 static int handle_deferred_entries(struct merge_options *opt,
 				   struct traverse_info *info)
 {
@@ -1285,6 +1311,8 @@ static int collect_merge_info(struct merge_options *opt,
 
 	trace2_region_enter("merge", "traverse_trees", opt->repo);
 	ret = traverse_trees(NULL, 3, t, &info);
+	if (ret == 0)
+		ret = handle_deferred_entries(opt, &info);
 	trace2_region_leave("merge", "traverse_trees", opt->repo);
 
 	return ret;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 6/7] merge-ort: avoid recursing into directories when we don't need to
  2021-07-01  3:46 [PATCH 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
                   ` (4 preceding siblings ...)
  2021-07-01  3:46 ` [PATCH 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
@ 2021-07-01  3:46 ` Elijah Newren via GitGitGadget
  2021-07-01  3:46 ` [PATCH 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-01  3:46 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

This combines the work of the last several patches, and implements the
conditions when we don't need to recurse into directories.  It's perhaps
easiest to see the logic by separating the fact that a directory might
have both rename sources and rename destinations:

  * rename sources: only files present in the merge base can serve as
    rename sources, and only when one side deletes that file.  When the
    tree on one side matches the merge base, that means every file
    within the subtree matches the merge base.  This means that the
    skip-irrelevant-rename-detection optimization from before kicks in
    and we don't need any of these files as rename sources.

  * rename destinations: the tree that does not match the merge base
    might have newly added and hence unmatched destination files.
    This is what usually prevents us from doing trivial directory
    resolutions in the merge machinery.  However, the fact that we have
    deferred recursing into this directory until the end means we know
    whether there are any unmatched relevant potential rename sources
    elsewhere in this merge.  If there are no unmatched such relevant
    sources anywhere, then there is no need to look for unmatched
    potential rename destinations to match them with.

This informs our algorithm:
  * Search through relevant_sources; if we have entries, they better all
    be reflected in cached_pairs or cached_irrelevant, otherwise they
    represent an unmatched potential rename source (causing the
    optimization to be disallowed).
  * For any relevant_source represented in cached_pairs, we do need to
    to make sure to get the destination for each source, meaning we need
    to recurse into any ancestor directories of those destinations.
  * Once we've recursed into all the rename destinations for any
    relevant_sources in cached_pairs, we can then do the trivial
    directory resolution for the remaining directories.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:        5.235 s ±  0.042 s   205.1  ms ±  3.8  ms
    mega-renames:      9.419 s ±  0.107 s     1.564 s ±  0.010 s
    just-one-mega:   480.1  ms ±  3.9  ms   479.5  ms ±  3.9  ms

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 98 insertions(+), 3 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index bf0712d18a0..c9cf7a158c8 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1223,6 +1223,18 @@ static int collect_merge_info_callback(int n,
 	return mask;
 }
 
+static void resolve_trivial_directory_merge(struct conflict_info *ci, int side)
+{
+	VERIFY_CI(ci);
+	assert((side == 1 && ci->match_mask == 5) ||
+	       (side == 2 && ci->match_mask == 3));
+	oidcpy(&ci->merged.result.oid, &ci->stages[side].oid);
+	ci->merged.result.mode = ci->stages[side].mode;
+	ci->merged.is_null = is_null_oid(&ci->stages[side].oid);
+	ci->match_mask = 0;
+	ci->merged.clean = 1; /* (ci->filemask == 0); */
+}
+
 static int handle_deferred_entries(struct merge_options *opt,
 				   struct traverse_info *info)
 {
@@ -1232,9 +1244,71 @@ static int handle_deferred_entries(struct merge_options *opt,
 	int side, ret = 0;
 
 	for (side = MERGE_SIDE1; side <= MERGE_SIDE2; side++) {
-		renames->trivial_merges_okay[side] = 0;
-		strintmap_for_each_entry(&renames->possible_trivial_merges[side],
-					 &iter, entry) {
+		unsigned optimization_okay = 1;
+		struct strintmap copy;
+
+		/* Loop over the set of paths we need to know rename info for */
+		strset_for_each_entry(&renames->relevant_sources[side],
+				      &iter, entry) {
+			char *rename_target, *dir, *dir_marker;
+			struct strmap_entry *e;
+
+			/*
+			 * if we don't know delete/rename info for this path,
+			 * then we need to recurse into all trees to get all
+			 * adds to make sure we have it.
+			 */
+			if (strset_contains(&renames->cached_irrelevant[side],
+					    entry->key))
+				continue;
+			e = strmap_get_entry(&renames->cached_pairs[side],
+					     entry->key);
+			if (!e) {
+				optimization_okay = 0;
+				break;
+			}
+
+			/* If this is a delete, we have enough info already */
+			rename_target = e->value;
+			if (!rename_target)
+				continue;
+
+			/* If we already walked the rename target, we're good */
+			if (strmap_contains(&opt->priv->paths, rename_target))
+				continue;
+
+			/*
+			 * Otherwise, we need to get a list of directories that
+			 * will need to be recursed into to get this
+			 * rename_target.
+			 */
+			dir = xstrdup(rename_target);
+			while ((dir_marker = strrchr(dir, '/'))) {
+				*dir_marker = '\0';
+				if (strset_contains(&renames->target_dirs[side],
+						    dir))
+					break;
+				strset_add(&renames->target_dirs[side], dir);
+			}
+			free(dir);
+		}
+		renames->trivial_merges_okay[side] = optimization_okay;
+		/*
+		 * We need to recurse into any directories in
+		 * possible_trivial_merges[side] found in target_dirs[side].
+		 * But when we recurse, we may need to queue up some of the
+		 * subdirectories for possible_trivial_merges[side].  Since
+		 * we can't safely iterate through a hashmap while also adding
+		 * entries, move the entries into 'copy', iterate over 'copy',
+		 * and then we'll also iterate anything added into
+		 * possible_trivial_merges[side] once this loop is done.
+		 */
+		copy = renames->possible_trivial_merges[side];
+		strintmap_init_with_options(&renames->possible_trivial_merges[side],
+					    0,
+					    NULL,
+					    0);
+		strintmap_for_each_entry(&copy, &iter, entry) {
 			const char *path = entry->key;
 			unsigned dir_rename_mask = (intptr_t)entry->value;
 			struct conflict_info *ci;
@@ -1247,6 +1321,13 @@ static int handle_deferred_entries(struct merge_options *opt,
 			VERIFY_CI(ci);
 			dirmask = ci->dirmask;
 
+			if (optimization_okay &&
+			    !strset_contains(&renames->target_dirs[side],
+					     path)) {
+				resolve_trivial_directory_merge(ci, side);
+				continue;
+			}
+
 			info->name = path;
 			info->namelen = strlen(path);
 			info->pathlen = info->namelen + 1;
@@ -1282,6 +1363,20 @@ static int handle_deferred_entries(struct merge_options *opt,
 			if (ret < 0)
 				return ret;
 		}
+		strintmap_clear(&copy);
+		strintmap_for_each_entry(&renames->possible_trivial_merges[side],
+					 &iter, entry) {
+			const char *path = entry->key;
+			struct conflict_info *ci;
+
+			ci = strmap_get(&opt->priv->paths, path);
+			VERIFY_CI(ci);
+
+			assert(renames->trivial_merges_okay[side] &&
+			       !strset_contains(&renames->target_dirs[side],
+						path));
+			resolve_trivial_directory_merge(ci, side);
+		}
 	}
 	return ret;
 }
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH 7/7] merge-ort: restart merge with cached renames to reduce process entry cost
  2021-07-01  3:46 [PATCH 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
                   ` (5 preceding siblings ...)
  2021-07-01  3:46 ` [PATCH 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
@ 2021-07-01  3:46 ` Elijah Newren via GitGitGadget
  2021-07-01 13:21 ` [PATCH 0/7] Optimization batch 14: trivial directory resolution Ævar Arnfjörð Bjarmason
  2021-07-13 19:32 ` [PATCH v2 " Elijah Newren via GitGitGadget
  8 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-01  3:46 UTC (permalink / raw)
  To: git; +Cc: Derrick Stolee, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

The merge algorithm mostly consists of the following three functions:
   collect_merge_info()
   detect_and_process_renames()
   process_entries()
Prior to the trivial directory resolution optimization of the last half
dozen commits, process_entries() was consistently the slowest, followed
by collect_merge_info(), then detect_and_process_renames().  When the
trivial directory resolution applies, it often dramatically decreases
the amount of time spent in the two slower functions.

Looking at the performance results in the previous commit, the trivial
directory resolution optimization helps amazingly well when there are no
relevant renames.  It also helps really well when reapplying a long
series of linear commits (such as in a rebase or cherry-pick), since the
relevant renames may well be cached from the first reapplied commit.
But when there are any relevant renames that are not cached (represented
by the just-one-mega testcase), then the optimization does not help at
all.

Often, I noticed that when the optimization does not apply, it is
because there are a handful of relevant sources -- maybe even only one.
It felt frustrating to need to recurse into potentially hundreds or even
thousands of directories just for a single rename, but it was needed for
correctness.

However, staring at this list of functions and noticing that
process_entries() is the most expensive and knowing I could avoid it if
I had cached renames suggested a simple idea: change
   collect_merge_info()
   detect_and_process_renames()
   process_entries()
into
   collect_merge_info()
   detect_and_process_renames()
   <cache all the renames, and restart>
   collect_merge_info()
   detect_and_process_renames()
   process_entries()

This may seem odd and look like more work.  However, note that although
we run collect_merge_info() twice, the second time we get to employ
trivial directory resolves, which makes it much faster, so the increased
time in collect_merge_info() is small.  While we run
detect_and_process_renames() again, all renames are cached so it's
nearly a no-op (we don't call into diffcore_rename_extended() but we do
have a little bit of data structure checking and fixing up).  And the
big payoff comes from the fact that process_entries(), will be much
faster due to having far fewer entries to process.

This restarting only makes sense if we can save recursing into enough
directories to make it worth our while.  Introduce a simple heuristic to
guide this.  Note that this heuristic uses a "wanted_factor" that I have
virtually no actual real world data for, just some back-of-the-envelope
quasi-scientific calculations that I included in some comments and then
plucked a simple round number out of thin air.  It could be that
tweaking this number to make it either higher or lower improves the
optimization.  (There's slightly more here; when I first introduced this
optimization, I used a factor of 10, because I was completely confident
it was big enough to not cause slowdowns in special cases.  I was
certain it was higher than needed.  Several months later, I added the
rough calculations which make me think the optimal number is close to 2;
but instead of pushing to the limit, I just bumped it to 3 to reduce the
risk that there are special cases where this optimization can result in
slowing down the code a little.  If the ratio of path counts is below 3,
we probably will only see minor performance improvements at best
anyway.)

Also, note that while the diffstat looks kind of long (nearly 100
lines), more than half of it is in two comments explaining how things
work.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:      205.1  ms ±  3.8  ms   204.2  ms ±  3.0  ms
    mega-renames:      1.564 s ±  0.010 s     1.076 s ±  0.015 s
    just-one-mega:   479.5  ms ±  3.9  ms   364.1  ms ±  7.0  ms

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c                         | 106 ++++++++++++++++++++++++++--
 t/t6423-merge-rename-directories.sh |   2 +-
 2 files changed, 101 insertions(+), 7 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index c9cf7a158c8..51274f40059 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -209,6 +209,7 @@ struct rename_info {
 	 *   MERGE_SIDE2: cached data from side2 can be reused
 	 *   MERGE_SIDE1: cached data from side1 can be reused
 	 *   0:           no cached data can be reused
+	 *   -1:          See redo_after_renames; both sides can be reused.
 	 */
 	int cached_pairs_valid_side;
 
@@ -254,6 +255,28 @@ struct rename_info {
 	 */
 	struct strset cached_irrelevant[3];
 
+	/*
+	 * redo_after_renames: optimization flag for "restarting" the merge
+	 *
+	 * Sometimes it pays to detect renames, cache them, and then
+	 * restart the merge operation from the beginning.  The reason for
+	 * this is that when we know where all the renames are, we know
+	 * whether a certain directory has any paths under it affected --
+	 * and if a directory is not affected then it permits us to do
+	 * trivial tree merging in more cases.  Doing trivial tree merging
+	 * prevents the need to run process_entry() on every path
+	 * underneath trees that can be trivially merged, and
+	 * process_entry() is more expensive than collect_merge_info() --
+	 * plus, the second collect_merge_info() will be much faster since
+	 * it doesn't have to recurse into the relevant trees.
+	 *
+	 * Values for this flag:
+	 *   0 = don't bother, not worth it (or conditions not yet checked)
+	 *   1 = conditions for optimization met, optimization worthwhile
+	 *   2 = we already did it (don't restart merge yet again)
+	 */
+	unsigned redo_after_renames;
+
 	/*
 	 * needed_limit: value needed for inexact rename detection to run
 	 *
@@ -540,7 +563,8 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		renames->trivial_merges_okay[i] = 1; /* 1 == maybe */
 		if (!reinitialize)
 			assert(renames->cached_pairs_valid_side == 0);
-		if (i != renames->cached_pairs_valid_side) {
+		if (i != renames->cached_pairs_valid_side &&
+		    -1 != renames->cached_pairs_valid_side) {
 			strset_func(&renames->cached_target_names[i]);
 			strmap_func(&renames->cached_pairs[i], 1);
 			strset_func(&renames->cached_irrelevant[i]);
@@ -1242,7 +1266,9 @@ static int handle_deferred_entries(struct merge_options *opt,
 	struct hashmap_iter iter;
 	struct strmap_entry *entry;
 	int side, ret = 0;
+	int path_count_before, path_count_after = 0;
 
+	path_count_before = strmap_get_size(&opt->priv->paths);
 	for (side = MERGE_SIDE1; side <= MERGE_SIDE2; side++) {
 		unsigned optimization_okay = 1;
 		struct strintmap copy;
@@ -1377,7 +1403,51 @@ static int handle_deferred_entries(struct merge_options *opt,
 						path));
 			resolve_trivial_directory_merge(ci, side);
 		}
+		if (!optimization_okay || path_count_after)
+			path_count_after = strmap_get_size(&opt->priv->paths);
 	}
+	if (path_count_after) {
+		/*
+		 * Not sure were the right cut-off is for the optimization
+		 * to redo collect_merge_info after we've cached the
+		 * regular renames is.  Basically, collect_merge_info(),
+		 * detect_regular_renames(), and process_entries() are
+		 * similar costs and all big tent poles.  Caching the
+		 * result of detect_regular_renames() means that redoing
+		 * that one function will cost us virtually 0 extra, so it
+		 * depends on the other two functions, which are both O(N)
+		 * cost in the number of paths.  Thus, it makes sense that
+		 * if we can cut the number of paths in half, then redoing
+		 * collect_merge_info() at half cost in order to get
+		 * process_entries() at half cost should be about equal
+		 * cost.  If we can cut by more than half, then we would
+		 * win.  The fact that process_entries() is about 10%-20%
+		 * more expensive than collect_merge_info() suggests we
+		 * could make the factor be less than two.  The fact that
+		 * even when we have renames cached, we still have to
+		 * traverse down to the individual (relevant) renames,
+		 * which suggests we should perhaps use a bigger factor.
+		 *
+		 * The exact number isn't critical, since the code will
+		 * work even if we get the factor wrong -- it just might be
+		 * slightly slower if we're a bit off.  For now, just error
+		 * on the side of a bigger fudge.  For the linux kernel
+		 * testcases I was looking at with massive renames, the
+		 * ratio came in around 50 to 250, which clearly would
+		 * trigger this optimization and provided some *very* nice
+		 * speedups.
+		 */
+		int wanted_factor = 3;
+
+		/* We should only redo collect_merge_info one time */
+		assert(renames->redo_after_renames == 0);
+
+		if (path_count_after / path_count_before > wanted_factor) {
+			renames->redo_after_renames = 1;
+			renames->cached_pairs_valid_side = -1;
+		}
+	} else if (renames->redo_after_renames == 2)
+		renames->redo_after_renames = 0;
 	return ret;
 }
 
@@ -2820,8 +2890,8 @@ static int compare_pairs(const void *a_, const void *b_)
 }
 
 /* Call diffcore_rename() to update deleted/added pairs into rename pairs */
-static void detect_regular_renames(struct merge_options *opt,
-				   unsigned side_index)
+static int detect_regular_renames(struct merge_options *opt,
+				  unsigned side_index)
 {
 	struct diff_options diff_opts;
 	struct rename_info *renames = &opt->priv->renames;
@@ -2834,7 +2904,7 @@ static void detect_regular_renames(struct merge_options *opt,
 		 * side had directory renames.
 		 */
 		resolve_diffpair_statuses(&renames->pairs[side_index]);
-		return;
+		return 0;
 	}
 
 	partial_clear_dir_rename_count(&renames->dir_rename_count[side_index]);
@@ -2869,6 +2939,18 @@ static void detect_regular_renames(struct merge_options *opt,
 	diff_queued_diff.nr = 0;
 	diff_queued_diff.queue = NULL;
 	diff_flush(&diff_opts);
+
+	if (renames->redo_after_renames) {
+		int i;
+		struct diff_filepair *p;
+
+		renames->redo_after_renames = 2;
+		for (i = 0; i < renames->pairs[side_index].nr; ++i) {
+			p = renames->pairs[side_index].queue[i];
+			possibly_cache_new_pair(renames, p, side_index, NULL);
+		}
+	}
+	return 1;
 }
 
 /*
@@ -2958,14 +3040,19 @@ static int detect_and_process_renames(struct merge_options *opt,
 	struct diff_queue_struct combined;
 	struct rename_info *renames = &opt->priv->renames;
 	int need_dir_renames, s, clean = 1;
+	unsigned detection_run = 0;
 
 	memset(&combined, 0, sizeof(combined));
 	if (!possible_renames(renames))
 		goto cleanup;
 
 	trace2_region_enter("merge", "regular renames", opt->repo);
-	detect_regular_renames(opt, MERGE_SIDE1);
-	detect_regular_renames(opt, MERGE_SIDE2);
+	detection_run |= detect_regular_renames(opt, MERGE_SIDE1);
+	detection_run |= detect_regular_renames(opt, MERGE_SIDE2);
+	if (renames->redo_after_renames && detection_run) {
+		trace2_region_leave("merge", "regular renames", opt->repo);
+		goto cleanup;
+	}
 	use_cached_pairs(opt, &renames->cached_pairs[1], &renames->pairs[1]);
 	use_cached_pairs(opt, &renames->cached_pairs[2], &renames->pairs[2]);
 	trace2_region_leave("merge", "regular renames", opt->repo);
@@ -4380,6 +4467,7 @@ static void merge_ort_nonrecursive_internal(struct merge_options *opt,
 					       opt->subtree_shift);
 	}
 
+redo:
 	trace2_region_enter("merge", "collect_merge_info", opt->repo);
 	if (collect_merge_info(opt, merge_base, side1, side2) != 0) {
 		/*
@@ -4399,6 +4487,12 @@ static void merge_ort_nonrecursive_internal(struct merge_options *opt,
 	result->clean = detect_and_process_renames(opt, merge_base,
 						   side1, side2);
 	trace2_region_leave("merge", "renames", opt->repo);
+	if (opt->priv->renames.redo_after_renames == 2) {
+		trace2_region_enter("merge", "reset_maps", opt->repo);
+		clear_or_reinit_internal_opts(opt->priv, 1);
+		trace2_region_leave("merge", "reset_maps", opt->repo);
+		goto redo;
+	}
 
 	trace2_region_enter("merge", "process_entries", opt->repo);
 	process_entries(opt, &working_tree_oid);
diff --git a/t/t6423-merge-rename-directories.sh b/t/t6423-merge-rename-directories.sh
index e834b7e6efe..d8919d276a1 100755
--- a/t/t6423-merge-rename-directories.sh
+++ b/t/t6423-merge-rename-directories.sh
@@ -4797,7 +4797,7 @@ test_setup_12f () {
 	)
 }
 
-test_expect_merge_algorithm failure failure '12f: Trivial directory resolve, caching, all kinds of fun' '
+test_expect_merge_algorithm failure success '12f: Trivial directory resolve, caching, all kinds of fun' '
 	test_setup_12f &&
 	(
 		cd 12f &&
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/7] Optimization batch 14: trivial directory resolution
  2021-07-01  3:46 [PATCH 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
                   ` (6 preceding siblings ...)
  2021-07-01  3:46 ` [PATCH 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
@ 2021-07-01 13:21 ` Ævar Arnfjörð Bjarmason
  2021-07-01 15:04   ` Elijah Newren
  2021-07-13 19:32 ` [PATCH v2 " Elijah Newren via GitGitGadget
  8 siblings, 1 reply; 52+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-07-01 13:21 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, Derrick Stolee, Elijah Newren


On Thu, Jul 01 2021, Elijah Newren via GitGitGadget wrote:

> This series depends textually on ort-perf-batch-12, but is semantically
> independent. (It is both semantically and textually independent of
> ort-perf-batch-13.)

For others following along, that ort-perf-batch-12 is at
https://lore.kernel.org/git/pull.962.v4.git.1623168703.gitgitgadget@gmail.com/#t
& currently marked as 'will merge to next' in what's cooking.

> Most of my previous series dramatically accelerated cases with lots of
> renames, while providing comparatively minor benefits for cases with few or
> no renames. This series is the opposite; it provides huge benefits when
> there are few or no renames, and comparatively smaller (though still quite
> decent) benefits for cases with many uncached renames.

Sounds good, one thing I haven't seen at a glance is how these
performance numbers compare to the merge-recursive backend. Are we in a
state of reaching parity with it, or pulling ahead?

> [...]
> For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
> performance work; instrument with trace2_region_* calls", 2020-10-28), the
> changes in just this series improves the performance as follows:
>
>                      Before Series           After Series
> no-renames:        5.235 s ±  0.042 s   204.2  ms ±  3.0  ms
> mega-renames:      9.419 s ±  0.107 s     1.076 s ±  0.015 s
> just-one-mega:   480.1  ms ±  3.9  ms   364.1  ms ±  7.0  ms
>
>
> As a reminder, before any merge-ort/diffcore-rename performance work, the
> performance results we started with were:
>
> no-renames-am:      6.940 s ±  0.485 s
> no-renames:        18.912 s ±  0.174 s
> mega-renames:    5964.031 s ± 10.459 s
> just-one-mega:    149.583 s ±  0.751 s

I haven't given any of this a detailed look, just a note/question that
(depending on the answer to the "v.s. merge-recursive above") we may
want to consider bumping the default for the diff.renamelimit at some
point along with any major optimizations.

<random musings follow, the tl;dr is above this line :)>

As an aside that we have diff.renamelimit is one of the most "dangerous"
landmines/fork-in-eye/shotgun-to-foot edge cases we have in using diff
as plumbing IMO.

E.g. I somewhat recently had to deal with some 3rd party Go-language
lint plugin that can be configured to enforce lints "as of a commit".
I.e. it does a diff from that commit, sees in any introduced "issues"
are "new", and complains accordingly. The idea is that it allows you to
enforce lints on "only new code", say ignoring the return value of
os.Write(), without insisting that all existing code must be
whitelisted/fixed first.

The problem being two-fold, one that the thing will get slower over time
as we grow history (can't be avoided), but the more subtle one that at
some point we'll bump into the diff.renamelimit, and whatever unlucky
sob does so will find that the lint is now complaining about ALL THE
THINGS, since "old" code is now ending up as "new" to a naïve diff
parser relying on not bumping into the diff.renamelimit.

Arguably bumping the diff.renamelimit would make that sort of problem
worse for plumbing consumers, since they'd have more rope with which to
hang themselves, maybe it's better to step on that landmine early.

Sorry about the digression somewhat pointless but perhaps amusing
digression in the last 4 paragraphs :)

P.S.: I ended up dealing with the Go plugin by not using the "diff"
      feature, but just a one-off giant whitelist of all existing
      instances of stuff it would complain about.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/7] Optimization batch 14: trivial directory resolution
  2021-07-01 13:21 ` [PATCH 0/7] Optimization batch 14: trivial directory resolution Ævar Arnfjörð Bjarmason
@ 2021-07-01 15:04   ` Elijah Newren
  2021-07-01 19:22     ` Elijah Newren
  0 siblings, 1 reply; 52+ messages in thread
From: Elijah Newren @ 2021-07-01 15:04 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee

On Thu, Jul 1, 2021 at 6:31 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
>
> On Thu, Jul 01 2021, Elijah Newren via GitGitGadget wrote:
>
> > This series depends textually on ort-perf-batch-12, but is semantically
> > independent. (It is both semantically and textually independent of
> > ort-perf-batch-13.)
>
> For others following along, that ort-perf-batch-12 is at
> https://lore.kernel.org/git/pull.962.v4.git.1623168703.gitgitgadget@gmail.com/#t
> & currently marked as 'will merge to next' in what's cooking.
>
> > Most of my previous series dramatically accelerated cases with lots of
> > renames, while providing comparatively minor benefits for cases with few or
> > no renames. This series is the opposite; it provides huge benefits when
> > there are few or no renames, and comparatively smaller (though still quite
> > decent) benefits for cases with many uncached renames.
>
> Sounds good, one thing I haven't seen at a glance is how these
> performance numbers compare to the merge-recursive backend. Are we in a
> state of reaching parity with it, or pulling ahead?

Oh, wow, I really need to improve my wording if that's a question.
When I referred to "the performance results we started with" at the
end, that's referring to the numbers I got with merge-recursive before
I started this journey.  So these were the merge-recursive numbers
(with git-2.29.0 or git-2.30.0):

no-renames-am:      6.940 s ±  0.485 s
no-renames:        18.912 s ±  0.174 s
mega-renames:    5964.031 s ± 10.459 s
just-one-mega:    149.583 s ±  0.751 s

As per commit 557ac0350d ("merge-ort: begin performance work;
instrument with trace2_region_* calls", 2021-01-23), merge-ort was
faster than merge-recursive at the beginning.  But I wasn't content
with that....

Also in the cover letter, I showed the merge-ort numbers before and
after this series:

no-renames:        5.235 s ±  0.042 s   204.2  ms ±  3.0  ms
mega-renames:      9.419 s ±  0.107 s     1.076 s ±  0.015 s
just-one-mega:   480.1  ms ±  3.9  ms   364.1  ms ±  7.0  ms

So yeah, we've pulled a little ahead.  (Like my understatement there?)

Granted, merge-recursive doesn't take quite as long as it used to; it
also benefited from some of my optimizations[1].  Nowhere near as much
as merge-ort benefited, but it still would be about 20x faster on the
cases with "mega" in their name.  So, although today's merge-ort is
~5542.7x faster than git-2.29.0's merge-recursive for a massive set of
renames in a really long rebase sequence, it is probably only a mere
277x faster than today's merge-recursive on the same case.


If you'd like to catch up on the various optimizations, you can find
most of them with this:

git log --reverse --grep=mega-renames --author=newren


[1] In particular, merge-recursive would benefit from these ones:
12bd17521c ("Merge branch 'en/diffcore-rename'", 2021-03-01),
d3a035b055 ("Merge branch 'en/merge-ort-perf'", 2021-02-11), and
a5ac31b5b1 ("Merge branch 'en/diffcore-rename'", 2021-01-25)

> > [...]
> > For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
> > performance work; instrument with trace2_region_* calls", 2020-10-28), the
> > changes in just this series improves the performance as follows:
> >
> >                      Before Series           After Series
> > no-renames:        5.235 s ±  0.042 s   204.2  ms ±  3.0  ms
> > mega-renames:      9.419 s ±  0.107 s     1.076 s ±  0.015 s
> > just-one-mega:   480.1  ms ±  3.9  ms   364.1  ms ±  7.0  ms
> >
> >
> > As a reminder, before any merge-ort/diffcore-rename performance work, the
> > performance results we started with were:
> >
> > no-renames-am:      6.940 s ±  0.485 s
> > no-renames:        18.912 s ±  0.174 s
> > mega-renames:    5964.031 s ± 10.459 s
> > just-one-mega:    149.583 s ±  0.751 s
>
> I haven't given any of this a detailed look, just a note/question that
> (depending on the answer to the "v.s. merge-recursive above") we may
> want to consider bumping the default for the diff.renamelimit at some
> point along with any major optimizations.

I do think we should bump the diff.renamelimit, but not for these
reasons.  We should bump them for the same reasons we bumped them last
time at commit 92c57e5c1d ("bump rename limit defaults (again)",
2011-02-19).

In particular, none of my optimizations made the quadratic portion of
rename detection any faster.  It just added multiple ways for things
to either skip rename detection or be handled by a linear algorithm
instead and not be included in the quadratic loop.  Since
diff.renamelimit only applies to the stuff that goes into the
quadratic portion, most folks will notice that newer versions of git
detect renames with the same limits that older git would have given up
with.

It perhaps also is worth bearing that some of my rename detection
optimizations were specific to three-way merges (and thus help
merge/cherry-pick/rebase/etc. but not diff/log), some of my
optimizations were specific to reapplying long sequences of linear
commits (and thus help cherry-pick or rebase but not diff, log, or
even merge), while others help out all uses of the diff machinery
(diff/log/merge/cherry-pick/rebase/revert/etc.).

> <random musings follow, the tl;dr is above this line :)>
>
> As an aside that we have diff.renamelimit is one of the most "dangerous"
> landmines/fork-in-eye/shotgun-to-foot edge cases we have in using diff
> as plumbing IMO.
>
> E.g. I somewhat recently had to deal with some 3rd party Go-language
> lint plugin that can be configured to enforce lints "as of a commit".
> I.e. it does a diff from that commit, sees in any introduced "issues"
> are "new", and complains accordingly. The idea is that it allows you to
> enforce lints on "only new code", say ignoring the return value of
> os.Write(), without insisting that all existing code must be
> whitelisted/fixed first.
>
> The problem being two-fold, one that the thing will get slower over time
> as we grow history (can't be avoided), but the more subtle one that at
> some point we'll bump into the diff.renamelimit, and whatever unlucky
> sob does so will find that the lint is now complaining about ALL THE
> THINGS, since "old" code is now ending up as "new" to a naïve diff
> parser relying on not bumping into the diff.renamelimit.
>
> Arguably bumping the diff.renamelimit would make that sort of problem
> worse for plumbing consumers, since they'd have more rope with which to
> hang themselves, maybe it's better to step on that landmine early.
>
> Sorry about the digression somewhat pointless but perhaps amusing
> digression in the last 4 paragraphs :)

Certainly a digression, but yes it's amusing.  :-)

Let me add a digression of my own: I got started on this because
people complained they couldn't cherry-pick patches to maintenance
branches, git just messed it all up when it gave up on renames.  I
told them to set diff.renameLimit higher, and they told me that didn't
work.

I didn't believe them.

So, I started on a bit of a journey, somewhere around commit
9f7e4bfa3b ("diff: remove silent clamp of renameLimit", 2017-11-13).
Then after that commit, git could detect renames but took forever
doing so (we needed a renameLimit of 48941 for the very first testcase
I was pointed at).  I dug into that, trying to figure out why
cherry-picking a simple patch that modified 7 files (and not with
particularly big edits) would take something approaching 10 minutes to
complete.  Then I went and found optimizations and rewrote the entire
merge machinery.  Now that same cherry-pick can complete without
bumping diff.renameLimit (1000 is more than enough), and completes in
less than a second (well, with all my optimizations, including the
series not yet submitted).

So, as I said, I didn't believe them when they reported git couldn't
detect renames for their case.  Look at all the work I've done in
order to continue to not believe them.  ;-)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH 0/7] Optimization batch 14: trivial directory resolution
  2021-07-01 15:04   ` Elijah Newren
@ 2021-07-01 19:22     ` Elijah Newren
  0 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren @ 2021-07-01 19:22 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee

A small correction/clarification...

On Thu, Jul 1, 2021 at 8:04 AM Elijah Newren <newren@gmail.com> wrote:
>
> On Thu, Jul 1, 2021 at 6:31 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> >
> > On Thu, Jul 01 2021, Elijah Newren via GitGitGadget wrote:
> >
> > > This series depends textually on ort-perf-batch-12, but is semantically
> > > independent. (It is both semantically and textually independent of
> > > ort-perf-batch-13.)
> >
> > For others following along, that ort-perf-batch-12 is at
> > https://lore.kernel.org/git/pull.962.v4.git.1623168703.gitgitgadget@gmail.com/#t
> > & currently marked as 'will merge to next' in what's cooking.

Yeah, I should have referred to it as en/ort-perf-batch-12, the branch
name Junio used.

> Granted, merge-recursive doesn't take quite as long as it used to; it
> also benefited from some of my optimizations[1].  Nowhere near as much
> as merge-ort benefited, but it still would be about 20x faster on the
> cases with "mega" in their name.  So, although today's merge-ort is
> ~5542.7x faster than git-2.29.0's merge-recursive for a massive set of
> renames in a really long rebase sequence, it is probably only a mere
> 277x faster than today's merge-recursive on the same case.

The 20x number here I was just spit-balling, whereas the other numbers
were measured.  I think 20x may have been a bit too high.  Regardless,
though, to me the important takeaway is not the performance difference
between merge-ort and merge-recursive now (though that's also huge in
merge-ort's favor), but between merge-ort now and merge-recursive when
I started.  If someone really wants me to get the difference between
the two now, I can dig in, but it's annoyingly painful waiting for
merge-recursive to finish tests several times.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 0/7] Optimization batch 14: trivial directory resolution
  2021-07-01  3:46 [PATCH 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
                   ` (7 preceding siblings ...)
  2021-07-01 13:21 ` [PATCH 0/7] Optimization batch 14: trivial directory resolution Ævar Arnfjörð Bjarmason
@ 2021-07-13 19:32 ` Elijah Newren via GitGitGadget
  2021-07-13 19:32   ` [PATCH v2 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
                     ` (7 more replies)
  8 siblings, 8 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-13 19:32 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Elijah Newren

[This is mostly unchanged since v1; I'm primarily resending since it's been
two weeks and I want to bump it so folks have a chance to notice and review
it. But there was one change, noted below.]

This series depends textually on ort-perf-batch-12, but is semantically
independent. (It is both semantically and textually independent of
ort-perf-batch-13.)

Most of my previous series dramatically accelerated cases with lots of
renames, while providing comparatively minor benefits for cases with few or
no renames. This series is the opposite; it provides huge benefits when
there are few or no renames, and comparatively smaller (though still quite
decent) benefits for cases with many uncached renames.

Changes since v1:

 * Minor tweak to the final patch to correct implicit assumption that rename
   detection running implies all renames were found (rename limits could
   have been exceeded and prevented finding renames)

=== Basic Optimization idea ===

unpack_trees has had a concept of trivial merges for individual files (see
Documentation/technical/trivial-merge.txt). The same idea can be applied in
merge-ort. It'd be really nice to extend that idea to trees as well, as it
could provide a huge performance boost; sadly however, applying it in
general would wreck both regular rename detection (the unmatched side can
have new files that serve as potential destinations in rename detection) and
directory rename detection (the unmatched side could have a new directory
that was moved into it).

If we somehow knew rename detection wasn't needed, we could do trivial
directory resolution. In the past, this wasn't possible. However...

With recent optimizations we have created a possibility to do trivial
directory resolutions in some cases. These came from the addition of the
"skipping irrelevant renames" optimizations (from ort-perf-batch-9 and
ort-perf-batch-10), and in particular noting that we added an ability to
entirely skip rename detection in commit f89b4f2bee ("merge-ort: skip rename
detection entirely if possible", 2021-03-11) when there are no relevant
sources. We can detect if there are no relevant sources without recursing
into the directories in question.

As a cherry on top, the caching of renames (from ort-perf-batch-11) allows
us to cover additional cases.

This series is all about adding all the special checks needed to safely
perform trival directory resolutions.

=== Results ===

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28), the
changes in just this series improves the performance as follows:

                     Before Series           After Series
no-renames:        5.235 s ±  0.042 s   204.2  ms ±  3.0  ms
mega-renames:      9.419 s ±  0.107 s     1.076 s ±  0.015 s
just-one-mega:   480.1  ms ±  3.9  ms   364.1  ms ±  7.0  ms


As a reminder, before any merge-ort/diffcore-rename performance work, the
performance results we started with (for merge-recursive as of git-2.30.0)
were:

no-renames-am:      6.940 s ±  0.485 s
no-renames:        18.912 s ±  0.174 s
mega-renames:    5964.031 s ± 10.459 s
just-one-mega:    149.583 s ±  0.751 s


Elijah Newren (7):
  merge-ort: resolve paths early when we have sufficient information
  merge-ort: add some more explanations in collect_merge_info_callback()
  merge-ort: add data structures for allowable trivial directory
    resolves
  merge-ort: add a handle_deferred_entries() helper function
  merge-ort: defer recursing into directories when merge base is matched
  merge-ort: avoid recursing into directories when we don't need to
  merge-ort: restart merge with cached renames to reduce process entry
    cost

 merge-ort.c                         | 408 +++++++++++++++++++++++++++-
 t/t6423-merge-rename-directories.sh |   2 +-
 2 files changed, 398 insertions(+), 12 deletions(-)


base-commit: 2eeee12b02e441ac05054a5a5ecbcea6964a1e6b
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-988%2Fnewren%2Fort-perf-batch-14-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-988/newren/ort-perf-batch-14-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/988

Range-diff vs v1:

 1:  5dca982c0b = 1:  5dca982c0b merge-ort: resolve paths early when we have sufficient information
 2:  8aea371390 = 2:  8aea371390 merge-ort: add some more explanations in collect_merge_info_callback()
 3:  f7ac01055d = 3:  f7ac01055d merge-ort: add data structures for allowable trivial directory resolves
 4:  7e28323b62 = 4:  7e28323b62 merge-ort: add a handle_deferred_entries() helper function
 5:  317553eadb = 5:  317553eadb merge-ort: defer recursing into directories when merge base is matched
 6:  3409a6cd63 = 6:  3409a6cd63 merge-ort: avoid recursing into directories when we don't need to
 7:  76bc73262c ! 7:  7133f0efa5 merge-ort: restart merge with cached renames to reduce process entry cost
     @@ merge-ort.c: static void detect_regular_renames(struct merge_options *opt,
       	}
       
       	partial_clear_dir_rename_count(&renames->dir_rename_count[side_index]);
     +@@ merge-ort.c: static void detect_regular_renames(struct merge_options *opt,
     + 	trace2_region_leave("diff", "diffcore_rename", opt->repo);
     + 	resolve_diffpair_statuses(&diff_queued_diff);
     + 
     ++	if (diff_opts.needed_rename_limit > 0)
     ++		renames->redo_after_renames = 0;
     + 	if (diff_opts.needed_rename_limit > renames->needed_limit)
     + 		renames->needed_limit = diff_opts.needed_rename_limit;
     + 
      @@ merge-ort.c: static void detect_regular_renames(struct merge_options *opt,
       	diff_queued_diff.nr = 0;
       	diff_queued_diff.queue = NULL;
       	diff_flush(&diff_opts);
      +
     -+	if (renames->redo_after_renames) {
     -+		int i;
     -+		struct diff_filepair *p;
     -+
     -+		renames->redo_after_renames = 2;
     -+		for (i = 0; i < renames->pairs[side_index].nr; ++i) {
     -+			p = renames->pairs[side_index].queue[i];
     -+			possibly_cache_new_pair(renames, p, side_index, NULL);
     -+		}
     -+	}
      +	return 1;
       }
       
     @@ merge-ort.c: static int detect_and_process_renames(struct merge_options *opt,
      +	detection_run |= detect_regular_renames(opt, MERGE_SIDE1);
      +	detection_run |= detect_regular_renames(opt, MERGE_SIDE2);
      +	if (renames->redo_after_renames && detection_run) {
     ++		int i, side;
     ++		struct diff_filepair *p;
     ++
     ++		/* Cache the renames, we found */
     ++		for (side = MERGE_SIDE1; side <= MERGE_SIDE2; side++) {
     ++			for (i = 0; i < renames->pairs[side].nr; ++i) {
     ++				p = renames->pairs[side].queue[i];
     ++				possibly_cache_new_pair(renames, p, side, NULL);
     ++			}
     ++		}
     ++
     ++		/* Restart the merge with the cached renames */
     ++		renames->redo_after_renames = 2;
      +		trace2_region_leave("merge", "regular renames", opt->repo);
      +		goto cleanup;
      +	}

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 1/7] merge-ort: resolve paths early when we have sufficient information
  2021-07-13 19:32 ` [PATCH v2 " Elijah Newren via GitGitGadget
@ 2021-07-13 19:32   ` Elijah Newren via GitGitGadget
  2021-07-13 19:32   ` [PATCH v2 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-13 19:32 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

When there are no directories involved at a given path, and all three
sides have a file at that path, and two of the three sides of history
match, we can immediately resolve the merge of that path in
collect_merge_info() and do not need to wait until process_entries().

This is actually a very minor improvement: half the time when I run it,
I see an improvement; the other half a slowdown.  It seems to be in the
range of noise.  However, this idea serves as the beginning of some
bigger optimizations coming in the following patches.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index e3a5dfc7b31..6299b4f9413 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1023,6 +1023,43 @@ static int collect_merge_info_callback(int n,
 		return mask;
 	}
 
+	/*
+	 * If the sides match, and all three paths are present and are
+	 * files, then we can take either as the resolution.  We can't do
+	 * this with trees, because there may be rename sources from the
+	 * merge_base.
+	 */
+	if (sides_match && filemask == 0x07) {
+		/* use side1 (== side2) version as resolution */
+		setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
+				names, names+1, side1_null, 0,
+				filemask, dirmask, 1);
+		return mask;
+	}
+
+	/*
+	 * If side1 matches mbase and all three paths are present and are
+	 * files, then we can use side2 as the resolution.  We cannot
+	 * necessarily do so this for trees, because there may be rename
+	 * destinations within side2.
+	 */
+	if (side1_matches_mbase && filemask == 0x07) {
+		/* use side2 version as resolution */
+		setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
+				names, names+2, side2_null, 0,
+				filemask, dirmask, 1);
+		return mask;
+	}
+
+	/* Similar to above but swapping sides 1 and 2 */
+	if (side2_matches_mbase && filemask == 0x07) {
+		/* use side1 version as resolution */
+		setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
+				names, names+1, side1_null, 0,
+				filemask, dirmask, 1);
+		return mask;
+	}
+
 	/*
 	 * Gather additional information used in rename detection.
 	 */
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 2/7] merge-ort: add some more explanations in collect_merge_info_callback()
  2021-07-13 19:32 ` [PATCH v2 " Elijah Newren via GitGitGadget
  2021-07-13 19:32   ` [PATCH v2 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
@ 2021-07-13 19:32   ` Elijah Newren via GitGitGadget
  2021-07-13 23:34     ` Bagas Sanjaya
  2021-07-13 19:32   ` [PATCH v2 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-13 19:32 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

The previous patch possibly raises some questions about whether
additional cases in collect_merge_info_callback() can be handled early.
Add some explanations in the form of comments to help explain these
better.  While we're at it, add a few comments to denote what a few
boolean '0' or '1' values stand for.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index 6299b4f9413..843fa693145 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1018,8 +1018,8 @@ static int collect_merge_info_callback(int n,
 	if (side1_matches_mbase && side2_matches_mbase) {
 		/* mbase, side1, & side2 all match; use mbase as resolution */
 		setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
-				names, names+0, mbase_null, 0,
-				filemask, dirmask, 1);
+				names, names+0, mbase_null, 0 /* df_conflict */,
+				filemask, dirmask, 1 /* resolved */);
 		return mask;
 	}
 
@@ -1061,14 +1061,24 @@ static int collect_merge_info_callback(int n,
 	}
 
 	/*
-	 * Gather additional information used in rename detection.
+	 * Sometimes we can tell that a source path need not be included in
+	 * rename detection -- namely, whenever either
+	 *    side1_matches_mbase && side2_null
+	 * or
+	 *    side2_matches_mbase && side1_null
+	 * However, we call collect_rename_info() even in those cases,
+	 * because exact renames are cheap and would let us remove both a
+	 * source and destination path.  We'll cull the unneeded sources
+	 * later.
 	 */
 	collect_rename_info(opt, names, dirname, fullpath,
 			    filemask, dirmask, match_mask);
 
 	/*
-	 * Record information about the path so we can resolve later in
-	 * process_entries.
+	 * None of the special cases above matched, so we have a
+	 * provisional conflict.  (Rename detection might allow us to
+	 * unconflict some more cases, but that comes later so all we can
+	 * do now is record the different non-null file hashes.)
 	 */
 	setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
 			names, NULL, 0, df_conflict, filemask, dirmask, 0);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 3/7] merge-ort: add data structures for allowable trivial directory resolves
  2021-07-13 19:32 ` [PATCH v2 " Elijah Newren via GitGitGadget
  2021-07-13 19:32   ` [PATCH v2 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
  2021-07-13 19:32   ` [PATCH v2 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
@ 2021-07-13 19:32   ` Elijah Newren via GitGitGadget
  2021-07-15 13:54     ` Derrick Stolee
  2021-07-13 19:33   ` [PATCH v2 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
                     ` (4 subsequent siblings)
  7 siblings, 1 reply; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-13 19:32 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

As noted a few commits ago, we can resolve individual files early if all
three sides of the merge have a file at the path and two of the three
sides match.  We would really like to do the same thing with
directories, because being able to do a trivial directory resolve means
we don't have to recurse into the directory, potentially saving us a
huge amount of time in both collect_merge_info() and process_entries().
Unfortunately, resolving directories early would mean missing any
renames whose source or destination is underneath that directory.

If we somehow knew there weren't any renames under the directory in
question, then we could resolve it early.  Sadly, it is impossible to
determine whether there are renames under the directory in question
without recursing into it, and this has traditionally kept us from ever
implementing such an optimization.

In commit f89b4f2bee ("merge-ort: skip rename detection entirely if
possible", 2021-03-11), we added an additional reason that rename
detection could be skipped entirely -- namely, if no *relevant* sources
were present.  Without completing collect_merge_info_callback(), we do
not yet know if there are no relevant sources.  However, we do know that
if the current directory on one side matches the merge base, then every
source file within that directory will not be RELEVANT_CONTENT, and a
few simple checks can often let us rule out RELEVANT_LOCATION as well.
This suggests we can just defer recursing into such directories until
the end of collect_merge_info.

Since the deferred directories are known to not add any relevant sources
due to the above properties, then if there are no relevant sources after
we've traversed all paths other than the deferred ones, then we know
there are not any relevant sources.  Under those conditions, rename
detection is unnecessary, and that means we can resolve the deferred
directories without recursing into them.

Note that the logic for skipping rename detection was also modified
further in commit 76e253793c ("merge-ort, diffcore-rename: employ cached
renames when possible", 2021-01-30); in particular rename detection can
be skipped if we already have cached renames for each relevant source.
We can take advantage of this information as well with our deferral of
recursing into directories where one side matches the merge base.

Add some data structures that we will use to do these deferrals, with
some lengthy comments explaining their purpose.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index 843fa693145..3d3f00b3b45 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -119,6 +119,51 @@ struct rename_info {
 	 */
 	struct strintmap relevant_sources[3];
 
+	/*
+	 * possible_trivial_merges: directories we defer recursing into
+	 *
+	 * possible_trivial_merges is a map of directory names to
+	 * dir_rename_mask.  When we detect that a directory is unchanged on
+	 * one side, we can sometimes resolve the directory without recursing
+	 * into it.  Renames are the only things that can prevent such an
+	 * optimization.  However, for rename sources:
+	 *   - If no parent directory needed directory rename detection, then
+	 *     no path under such a directory can be a relevant_source.
+	 * and for rename destinations:
+	 *   - If no cached rename has a target path under the directory AND
+	 *   - If there are no unpaired relevant_sources elsewhere in the
+	 *     repository
+	 * then we don't need any path under this directory for a rename
+	 * destination.  The only way to know the last item above is to defer
+	 * handling such directories until the end of collect_merge_info(),
+	 * in handle_deferred_entries().
+	 *
+	 * For each we store dir_rename_mask, since that's the only bit of
+	 * information we need, other than the path, to resume the recursive
+	 * traversal.
+	 */
+	struct strintmap possible_trivial_merges[3];
+
+	/*
+	 * trivial_merges_okay: if trivial directory merges are okay
+	 *
+	 * See possible_trivial_merges above.  The "no unpaired
+	 * relevant_sources elsewhere in the repository" is a single boolean
+	 * per merge side, which we store here.  Note that while 0 means no,
+	 * 1 only means "maybe" rather than "yes"; we optimistically set it
+	 * to 1 initially and only clear when we determine it is unsafe to
+	 * do trivial directory merges.
+	 */
+	unsigned trivial_merges_okay[3];
+
+	/*
+	 * target_dirs: ancestor directories of rename targets
+	 *
+	 * target_dirs contains all directory names that are an ancestor of
+	 * any rename destination.
+	 */
+	struct strset target_dirs[3];
+
 	/*
 	 * dir_rename_mask:
 	 *   0: optimization removing unmodified potential rename source okay
@@ -490,6 +535,9 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		strintmap_func(&renames->dirs_removed[i]);
 		strmap_func(&renames->dir_renames[i], 0);
 		strintmap_func(&renames->relevant_sources[i]);
+		strintmap_func(&renames->possible_trivial_merges[i]);
+		strset_func(&renames->target_dirs[i]);
+		renames->trivial_merges_okay[i] = 1; /* 1 == maybe */
 		if (!reinitialize)
 			assert(renames->cached_pairs_valid_side == 0);
 		if (i != renames->cached_pairs_valid_side) {
@@ -4045,12 +4093,17 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
 		strintmap_init_with_options(&renames->relevant_sources[i],
 					    -1 /* explicitly invalid */,
 					    NULL, 0);
+		strintmap_init_with_options(&renames->possible_trivial_merges[i],
+					    0, NULL, 0);
+		strset_init_with_options(&renames->target_dirs[i],
+					 NULL, 1);
 		strmap_init_with_options(&renames->cached_pairs[i],
 					 NULL, 1);
 		strset_init_with_options(&renames->cached_irrelevant[i],
 					 NULL, 1);
 		strset_init_with_options(&renames->cached_target_names[i],
 					 NULL, 0);
+		renames->trivial_merges_okay[i] = 1; /* 1 == maybe */
 	}
 
 	/*
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 4/7] merge-ort: add a handle_deferred_entries() helper function
  2021-07-13 19:32 ` [PATCH v2 " Elijah Newren via GitGitGadget
                     ` (2 preceding siblings ...)
  2021-07-13 19:32   ` [PATCH v2 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
@ 2021-07-13 19:33   ` Elijah Newren via GitGitGadget
  2021-07-15 14:32     ` Derrick Stolee
  2021-07-13 19:33   ` [PATCH v2 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
                     ` (3 subsequent siblings)
  7 siblings, 1 reply; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-13 19:33 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

In order to allow trivial directory resolution, we first need to be able
to gather more information to determine if the optimization is safe.  To
enable that, we need a way of deferring the recursion into the directory
until a later time.  Naturally, deferring the entry into a subtree means
that we need some function that will later recurse into the subdirectory
exactly the same way that collect_merge_info_callback() would have done.

Add a helper function that does this.  For now this function is not used
but a subsequent commit will change that.  Future commits will also make
the function sometimes resolve directories instead of traversing inside.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 64 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index 3d3f00b3b45..eb0e18d7546 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1196,6 +1196,70 @@ static int collect_merge_info_callback(int n,
 	return mask;
 }
 
+MAYBE_UNUSED
+static int handle_deferred_entries(struct merge_options *opt,
+				   struct traverse_info *info)
+{
+	struct rename_info *renames = &opt->priv->renames;
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
+	int side, ret = 0;
+
+	for (side = MERGE_SIDE1; side <= MERGE_SIDE2; side++) {
+		renames->trivial_merges_okay[side] = 0;
+		strintmap_for_each_entry(&renames->possible_trivial_merges[side],
+					 &iter, entry) {
+			const char *path = entry->key;
+			unsigned dir_rename_mask = (intptr_t)entry->value;
+			struct conflict_info *ci;
+			unsigned dirmask;
+			struct tree_desc t[3];
+			void *buf[3] = {NULL,};
+			int i;
+
+			ci = strmap_get(&opt->priv->paths, path);
+			VERIFY_CI(ci);
+			dirmask = ci->dirmask;
+
+			info->name = path;
+			info->namelen = strlen(path);
+			info->pathlen = info->namelen + 1;
+
+			for (i = 0; i < 3; i++, dirmask >>= 1) {
+				if (i == 1 && ci->match_mask == 3)
+					t[1] = t[0];
+				else if (i == 2 && ci->match_mask == 5)
+					t[2] = t[0];
+				else if (i == 2 && ci->match_mask == 6)
+					t[2] = t[1];
+				else {
+					const struct object_id *oid = NULL;
+					if (dirmask & 1)
+						oid = &ci->stages[i].oid;
+					buf[i] = fill_tree_descriptor(opt->repo,
+								      t+i, oid);
+				}
+			}
+
+			ci->match_mask &= ci->filemask;
+			opt->priv->current_dir_name = path;
+			renames->dir_rename_mask = dir_rename_mask;
+			if (renames->dir_rename_mask == 0 ||
+			    renames->dir_rename_mask == 0x07)
+				ret = traverse_trees(NULL, 3, t, info);
+			else
+				ret = traverse_trees_wrapper(NULL, 3, t, info);
+
+			for (i = MERGE_BASE; i <= MERGE_SIDE2; i++)
+				free(buf[i]);
+
+			if (ret < 0)
+				return ret;
+		}
+	}
+	return ret;
+}
+
 static int collect_merge_info(struct merge_options *opt,
 			      struct tree *merge_base,
 			      struct tree *side1,
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 5/7] merge-ort: defer recursing into directories when merge base is matched
  2021-07-13 19:32 ` [PATCH v2 " Elijah Newren via GitGitGadget
                     ` (3 preceding siblings ...)
  2021-07-13 19:33   ` [PATCH v2 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
@ 2021-07-13 19:33   ` Elijah Newren via GitGitGadget
  2021-07-15 14:43     ` Derrick Stolee
  2021-07-13 19:33   ` [PATCH v2 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
                     ` (2 subsequent siblings)
  7 siblings, 1 reply; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-13 19:33 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

When one side of history matches the merge base (including when the
merge base has no entry for the given directory), have
collect_merge_info_callback() defer recursing into the directory.  To
ensure those entries are eventually handled, add a call to
handled_deferred_entries() in collect_merge_info() after
traverse_trees() returns.

Note that the condition in collect_merge_info_callback() may look more
complicated than necessary at first glance;
renames->trivial_merges_okay[side] is always true until
handle_deferred_entries() is called, and possible_trivial_merges[side]
is always empty right now (and in the future won't be filled until
handle_deferred_entries() is called).  However, when
handle_deferred_entries() calls traverse_trees() for the relevant
deferred directories, those traverse_trees() calls will once again end
up in collect_merge_info_callback() for all the entries under those
subdirectories.  The extra conditions are there for such deferred cases
and will be used more as we do more with those variables.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index eb0e18d7546..bf0712d18a0 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1141,8 +1141,35 @@ static int collect_merge_info_callback(int n,
 		struct tree_desc t[3];
 		void *buf[3] = {NULL, NULL, NULL};
 		const char *original_dir_name;
-		int i, ret;
+		int i, ret, side;
 
+		/*
+		 * Check for whether we can avoid recursing due to one side
+		 * matching the merge base.  The side that does NOT match is
+		 * the one that might have a rename destination we need.
+		 */
+		assert(!side1_matches_mbase || !side2_matches_mbase);
+		side = side1_matches_mbase ? MERGE_SIDE2 :
+			side2_matches_mbase ? MERGE_SIDE1 : MERGE_BASE;
+		if (filemask == 0 && (dirmask == 2 || dirmask == 4)) {
+			/*
+			 * Also defer recursing into new directories; set up a
+			 * few variables to let us do so.
+			 */
+			ci->match_mask = (7 - dirmask);
+			side = dirmask / 2;
+		}
+		if (renames->dir_rename_mask != 0x07 &&
+		    (side != MERGE_BASE) &&
+		    renames->trivial_merges_okay[side] &&
+		    !strset_contains(&renames->target_dirs[side], pi.string)) {
+			strintmap_set(&renames->possible_trivial_merges[side],
+				      pi.string, renames->dir_rename_mask);
+			renames->dir_rename_mask = prev_dir_rename_mask;
+			return mask;
+		}
+
+		/* We need to recurse */
 		ci->match_mask &= filemask;
 		newinfo = *info;
 		newinfo.prev = info;
@@ -1196,7 +1223,6 @@ static int collect_merge_info_callback(int n,
 	return mask;
 }
 
-MAYBE_UNUSED
 static int handle_deferred_entries(struct merge_options *opt,
 				   struct traverse_info *info)
 {
@@ -1285,6 +1311,8 @@ static int collect_merge_info(struct merge_options *opt,
 
 	trace2_region_enter("merge", "traverse_trees", opt->repo);
 	ret = traverse_trees(NULL, 3, t, &info);
+	if (ret == 0)
+		ret = handle_deferred_entries(opt, &info);
 	trace2_region_leave("merge", "traverse_trees", opt->repo);
 
 	return ret;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 6/7] merge-ort: avoid recursing into directories when we don't need to
  2021-07-13 19:32 ` [PATCH v2 " Elijah Newren via GitGitGadget
                     ` (4 preceding siblings ...)
  2021-07-13 19:33   ` [PATCH v2 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
@ 2021-07-13 19:33   ` Elijah Newren via GitGitGadget
  2021-07-15 14:55     ` Derrick Stolee
  2021-07-13 19:33   ` [PATCH v2 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
  2021-07-16  5:22   ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
  7 siblings, 1 reply; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-13 19:33 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

This combines the work of the last several patches, and implements the
conditions when we don't need to recurse into directories.  It's perhaps
easiest to see the logic by separating the fact that a directory might
have both rename sources and rename destinations:

  * rename sources: only files present in the merge base can serve as
    rename sources, and only when one side deletes that file.  When the
    tree on one side matches the merge base, that means every file
    within the subtree matches the merge base.  This means that the
    skip-irrelevant-rename-detection optimization from before kicks in
    and we don't need any of these files as rename sources.

  * rename destinations: the tree that does not match the merge base
    might have newly added and hence unmatched destination files.
    This is what usually prevents us from doing trivial directory
    resolutions in the merge machinery.  However, the fact that we have
    deferred recursing into this directory until the end means we know
    whether there are any unmatched relevant potential rename sources
    elsewhere in this merge.  If there are no unmatched such relevant
    sources anywhere, then there is no need to look for unmatched
    potential rename destinations to match them with.

This informs our algorithm:
  * Search through relevant_sources; if we have entries, they better all
    be reflected in cached_pairs or cached_irrelevant, otherwise they
    represent an unmatched potential rename source (causing the
    optimization to be disallowed).
  * For any relevant_source represented in cached_pairs, we do need to
    to make sure to get the destination for each source, meaning we need
    to recurse into any ancestor directories of those destinations.
  * Once we've recursed into all the rename destinations for any
    relevant_sources in cached_pairs, we can then do the trivial
    directory resolution for the remaining directories.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:        5.235 s ±  0.042 s   205.1  ms ±  3.8  ms
    mega-renames:      9.419 s ±  0.107 s     1.564 s ±  0.010 s
    just-one-mega:   480.1  ms ±  3.9  ms   479.5  ms ±  3.9  ms

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 98 insertions(+), 3 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index bf0712d18a0..c9cf7a158c8 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1223,6 +1223,18 @@ static int collect_merge_info_callback(int n,
 	return mask;
 }
 
+static void resolve_trivial_directory_merge(struct conflict_info *ci, int side)
+{
+	VERIFY_CI(ci);
+	assert((side == 1 && ci->match_mask == 5) ||
+	       (side == 2 && ci->match_mask == 3));
+	oidcpy(&ci->merged.result.oid, &ci->stages[side].oid);
+	ci->merged.result.mode = ci->stages[side].mode;
+	ci->merged.is_null = is_null_oid(&ci->stages[side].oid);
+	ci->match_mask = 0;
+	ci->merged.clean = 1; /* (ci->filemask == 0); */
+}
+
 static int handle_deferred_entries(struct merge_options *opt,
 				   struct traverse_info *info)
 {
@@ -1232,9 +1244,71 @@ static int handle_deferred_entries(struct merge_options *opt,
 	int side, ret = 0;
 
 	for (side = MERGE_SIDE1; side <= MERGE_SIDE2; side++) {
-		renames->trivial_merges_okay[side] = 0;
-		strintmap_for_each_entry(&renames->possible_trivial_merges[side],
-					 &iter, entry) {
+		unsigned optimization_okay = 1;
+		struct strintmap copy;
+
+		/* Loop over the set of paths we need to know rename info for */
+		strset_for_each_entry(&renames->relevant_sources[side],
+				      &iter, entry) {
+			char *rename_target, *dir, *dir_marker;
+			struct strmap_entry *e;
+
+			/*
+			 * if we don't know delete/rename info for this path,
+			 * then we need to recurse into all trees to get all
+			 * adds to make sure we have it.
+			 */
+			if (strset_contains(&renames->cached_irrelevant[side],
+					    entry->key))
+				continue;
+			e = strmap_get_entry(&renames->cached_pairs[side],
+					     entry->key);
+			if (!e) {
+				optimization_okay = 0;
+				break;
+			}
+
+			/* If this is a delete, we have enough info already */
+			rename_target = e->value;
+			if (!rename_target)
+				continue;
+
+			/* If we already walked the rename target, we're good */
+			if (strmap_contains(&opt->priv->paths, rename_target))
+				continue;
+
+			/*
+			 * Otherwise, we need to get a list of directories that
+			 * will need to be recursed into to get this
+			 * rename_target.
+			 */
+			dir = xstrdup(rename_target);
+			while ((dir_marker = strrchr(dir, '/'))) {
+				*dir_marker = '\0';
+				if (strset_contains(&renames->target_dirs[side],
+						    dir))
+					break;
+				strset_add(&renames->target_dirs[side], dir);
+			}
+			free(dir);
+		}
+		renames->trivial_merges_okay[side] = optimization_okay;
+		/*
+		 * We need to recurse into any directories in
+		 * possible_trivial_merges[side] found in target_dirs[side].
+		 * But when we recurse, we may need to queue up some of the
+		 * subdirectories for possible_trivial_merges[side].  Since
+		 * we can't safely iterate through a hashmap while also adding
+		 * entries, move the entries into 'copy', iterate over 'copy',
+		 * and then we'll also iterate anything added into
+		 * possible_trivial_merges[side] once this loop is done.
+		 */
+		copy = renames->possible_trivial_merges[side];
+		strintmap_init_with_options(&renames->possible_trivial_merges[side],
+					    0,
+					    NULL,
+					    0);
+		strintmap_for_each_entry(&copy, &iter, entry) {
 			const char *path = entry->key;
 			unsigned dir_rename_mask = (intptr_t)entry->value;
 			struct conflict_info *ci;
@@ -1247,6 +1321,13 @@ static int handle_deferred_entries(struct merge_options *opt,
 			VERIFY_CI(ci);
 			dirmask = ci->dirmask;
 
+			if (optimization_okay &&
+			    !strset_contains(&renames->target_dirs[side],
+					     path)) {
+				resolve_trivial_directory_merge(ci, side);
+				continue;
+			}
+
 			info->name = path;
 			info->namelen = strlen(path);
 			info->pathlen = info->namelen + 1;
@@ -1282,6 +1363,20 @@ static int handle_deferred_entries(struct merge_options *opt,
 			if (ret < 0)
 				return ret;
 		}
+		strintmap_clear(&copy);
+		strintmap_for_each_entry(&renames->possible_trivial_merges[side],
+					 &iter, entry) {
+			const char *path = entry->key;
+			struct conflict_info *ci;
+
+			ci = strmap_get(&opt->priv->paths, path);
+			VERIFY_CI(ci);
+
+			assert(renames->trivial_merges_okay[side] &&
+			       !strset_contains(&renames->target_dirs[side],
+						path));
+			resolve_trivial_directory_merge(ci, side);
+		}
 	}
 	return ret;
 }
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 7/7] merge-ort: restart merge with cached renames to reduce process entry cost
  2021-07-13 19:32 ` [PATCH v2 " Elijah Newren via GitGitGadget
                     ` (5 preceding siblings ...)
  2021-07-13 19:33   ` [PATCH v2 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
@ 2021-07-13 19:33   ` Elijah Newren via GitGitGadget
  2021-07-15 15:09     ` Derrick Stolee
  2021-07-16  5:22   ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
  7 siblings, 1 reply; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-13 19:33 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

The merge algorithm mostly consists of the following three functions:
   collect_merge_info()
   detect_and_process_renames()
   process_entries()
Prior to the trivial directory resolution optimization of the last half
dozen commits, process_entries() was consistently the slowest, followed
by collect_merge_info(), then detect_and_process_renames().  When the
trivial directory resolution applies, it often dramatically decreases
the amount of time spent in the two slower functions.

Looking at the performance results in the previous commit, the trivial
directory resolution optimization helps amazingly well when there are no
relevant renames.  It also helps really well when reapplying a long
series of linear commits (such as in a rebase or cherry-pick), since the
relevant renames may well be cached from the first reapplied commit.
But when there are any relevant renames that are not cached (represented
by the just-one-mega testcase), then the optimization does not help at
all.

Often, I noticed that when the optimization does not apply, it is
because there are a handful of relevant sources -- maybe even only one.
It felt frustrating to need to recurse into potentially hundreds or even
thousands of directories just for a single rename, but it was needed for
correctness.

However, staring at this list of functions and noticing that
process_entries() is the most expensive and knowing I could avoid it if
I had cached renames suggested a simple idea: change
   collect_merge_info()
   detect_and_process_renames()
   process_entries()
into
   collect_merge_info()
   detect_and_process_renames()
   <cache all the renames, and restart>
   collect_merge_info()
   detect_and_process_renames()
   process_entries()

This may seem odd and look like more work.  However, note that although
we run collect_merge_info() twice, the second time we get to employ
trivial directory resolves, which makes it much faster, so the increased
time in collect_merge_info() is small.  While we run
detect_and_process_renames() again, all renames are cached so it's
nearly a no-op (we don't call into diffcore_rename_extended() but we do
have a little bit of data structure checking and fixing up).  And the
big payoff comes from the fact that process_entries(), will be much
faster due to having far fewer entries to process.

This restarting only makes sense if we can save recursing into enough
directories to make it worth our while.  Introduce a simple heuristic to
guide this.  Note that this heuristic uses a "wanted_factor" that I have
virtually no actual real world data for, just some back-of-the-envelope
quasi-scientific calculations that I included in some comments and then
plucked a simple round number out of thin air.  It could be that
tweaking this number to make it either higher or lower improves the
optimization.  (There's slightly more here; when I first introduced this
optimization, I used a factor of 10, because I was completely confident
it was big enough to not cause slowdowns in special cases.  I was
certain it was higher than needed.  Several months later, I added the
rough calculations which make me think the optimal number is close to 2;
but instead of pushing to the limit, I just bumped it to 3 to reduce the
risk that there are special cases where this optimization can result in
slowing down the code a little.  If the ratio of path counts is below 3,
we probably will only see minor performance improvements at best
anyway.)

Also, note that while the diffstat looks kind of long (nearly 100
lines), more than half of it is in two comments explaining how things
work.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:      205.1  ms ±  3.8  ms   204.2  ms ±  3.0  ms
    mega-renames:      1.564 s ±  0.010 s     1.076 s ±  0.015 s
    just-one-mega:   479.5  ms ±  3.9  ms   364.1  ms ±  7.0  ms

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c                         | 111 ++++++++++++++++++++++++++--
 t/t6423-merge-rename-directories.sh |   2 +-
 2 files changed, 106 insertions(+), 7 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index c9cf7a158c8..99af64fce7d 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -209,6 +209,7 @@ struct rename_info {
 	 *   MERGE_SIDE2: cached data from side2 can be reused
 	 *   MERGE_SIDE1: cached data from side1 can be reused
 	 *   0:           no cached data can be reused
+	 *   -1:          See redo_after_renames; both sides can be reused.
 	 */
 	int cached_pairs_valid_side;
 
@@ -254,6 +255,28 @@ struct rename_info {
 	 */
 	struct strset cached_irrelevant[3];
 
+	/*
+	 * redo_after_renames: optimization flag for "restarting" the merge
+	 *
+	 * Sometimes it pays to detect renames, cache them, and then
+	 * restart the merge operation from the beginning.  The reason for
+	 * this is that when we know where all the renames are, we know
+	 * whether a certain directory has any paths under it affected --
+	 * and if a directory is not affected then it permits us to do
+	 * trivial tree merging in more cases.  Doing trivial tree merging
+	 * prevents the need to run process_entry() on every path
+	 * underneath trees that can be trivially merged, and
+	 * process_entry() is more expensive than collect_merge_info() --
+	 * plus, the second collect_merge_info() will be much faster since
+	 * it doesn't have to recurse into the relevant trees.
+	 *
+	 * Values for this flag:
+	 *   0 = don't bother, not worth it (or conditions not yet checked)
+	 *   1 = conditions for optimization met, optimization worthwhile
+	 *   2 = we already did it (don't restart merge yet again)
+	 */
+	unsigned redo_after_renames;
+
 	/*
 	 * needed_limit: value needed for inexact rename detection to run
 	 *
@@ -540,7 +563,8 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		renames->trivial_merges_okay[i] = 1; /* 1 == maybe */
 		if (!reinitialize)
 			assert(renames->cached_pairs_valid_side == 0);
-		if (i != renames->cached_pairs_valid_side) {
+		if (i != renames->cached_pairs_valid_side &&
+		    -1 != renames->cached_pairs_valid_side) {
 			strset_func(&renames->cached_target_names[i]);
 			strmap_func(&renames->cached_pairs[i], 1);
 			strset_func(&renames->cached_irrelevant[i]);
@@ -1242,7 +1266,9 @@ static int handle_deferred_entries(struct merge_options *opt,
 	struct hashmap_iter iter;
 	struct strmap_entry *entry;
 	int side, ret = 0;
+	int path_count_before, path_count_after = 0;
 
+	path_count_before = strmap_get_size(&opt->priv->paths);
 	for (side = MERGE_SIDE1; side <= MERGE_SIDE2; side++) {
 		unsigned optimization_okay = 1;
 		struct strintmap copy;
@@ -1377,7 +1403,51 @@ static int handle_deferred_entries(struct merge_options *opt,
 						path));
 			resolve_trivial_directory_merge(ci, side);
 		}
+		if (!optimization_okay || path_count_after)
+			path_count_after = strmap_get_size(&opt->priv->paths);
 	}
+	if (path_count_after) {
+		/*
+		 * Not sure were the right cut-off is for the optimization
+		 * to redo collect_merge_info after we've cached the
+		 * regular renames is.  Basically, collect_merge_info(),
+		 * detect_regular_renames(), and process_entries() are
+		 * similar costs and all big tent poles.  Caching the
+		 * result of detect_regular_renames() means that redoing
+		 * that one function will cost us virtually 0 extra, so it
+		 * depends on the other two functions, which are both O(N)
+		 * cost in the number of paths.  Thus, it makes sense that
+		 * if we can cut the number of paths in half, then redoing
+		 * collect_merge_info() at half cost in order to get
+		 * process_entries() at half cost should be about equal
+		 * cost.  If we can cut by more than half, then we would
+		 * win.  The fact that process_entries() is about 10%-20%
+		 * more expensive than collect_merge_info() suggests we
+		 * could make the factor be less than two.  The fact that
+		 * even when we have renames cached, we still have to
+		 * traverse down to the individual (relevant) renames,
+		 * which suggests we should perhaps use a bigger factor.
+		 *
+		 * The exact number isn't critical, since the code will
+		 * work even if we get the factor wrong -- it just might be
+		 * slightly slower if we're a bit off.  For now, just error
+		 * on the side of a bigger fudge.  For the linux kernel
+		 * testcases I was looking at with massive renames, the
+		 * ratio came in around 50 to 250, which clearly would
+		 * trigger this optimization and provided some *very* nice
+		 * speedups.
+		 */
+		int wanted_factor = 3;
+
+		/* We should only redo collect_merge_info one time */
+		assert(renames->redo_after_renames == 0);
+
+		if (path_count_after / path_count_before > wanted_factor) {
+			renames->redo_after_renames = 1;
+			renames->cached_pairs_valid_side = -1;
+		}
+	} else if (renames->redo_after_renames == 2)
+		renames->redo_after_renames = 0;
 	return ret;
 }
 
@@ -2820,8 +2890,8 @@ static int compare_pairs(const void *a_, const void *b_)
 }
 
 /* Call diffcore_rename() to update deleted/added pairs into rename pairs */
-static void detect_regular_renames(struct merge_options *opt,
-				   unsigned side_index)
+static int detect_regular_renames(struct merge_options *opt,
+				  unsigned side_index)
 {
 	struct diff_options diff_opts;
 	struct rename_info *renames = &opt->priv->renames;
@@ -2834,7 +2904,7 @@ static void detect_regular_renames(struct merge_options *opt,
 		 * side had directory renames.
 		 */
 		resolve_diffpair_statuses(&renames->pairs[side_index]);
-		return;
+		return 0;
 	}
 
 	partial_clear_dir_rename_count(&renames->dir_rename_count[side_index]);
@@ -2860,6 +2930,8 @@ static void detect_regular_renames(struct merge_options *opt,
 	trace2_region_leave("diff", "diffcore_rename", opt->repo);
 	resolve_diffpair_statuses(&diff_queued_diff);
 
+	if (diff_opts.needed_rename_limit > 0)
+		renames->redo_after_renames = 0;
 	if (diff_opts.needed_rename_limit > renames->needed_limit)
 		renames->needed_limit = diff_opts.needed_rename_limit;
 
@@ -2869,6 +2941,8 @@ static void detect_regular_renames(struct merge_options *opt,
 	diff_queued_diff.nr = 0;
 	diff_queued_diff.queue = NULL;
 	diff_flush(&diff_opts);
+
+	return 1;
 }
 
 /*
@@ -2958,14 +3032,32 @@ static int detect_and_process_renames(struct merge_options *opt,
 	struct diff_queue_struct combined;
 	struct rename_info *renames = &opt->priv->renames;
 	int need_dir_renames, s, clean = 1;
+	unsigned detection_run = 0;
 
 	memset(&combined, 0, sizeof(combined));
 	if (!possible_renames(renames))
 		goto cleanup;
 
 	trace2_region_enter("merge", "regular renames", opt->repo);
-	detect_regular_renames(opt, MERGE_SIDE1);
-	detect_regular_renames(opt, MERGE_SIDE2);
+	detection_run |= detect_regular_renames(opt, MERGE_SIDE1);
+	detection_run |= detect_regular_renames(opt, MERGE_SIDE2);
+	if (renames->redo_after_renames && detection_run) {
+		int i, side;
+		struct diff_filepair *p;
+
+		/* Cache the renames, we found */
+		for (side = MERGE_SIDE1; side <= MERGE_SIDE2; side++) {
+			for (i = 0; i < renames->pairs[side].nr; ++i) {
+				p = renames->pairs[side].queue[i];
+				possibly_cache_new_pair(renames, p, side, NULL);
+			}
+		}
+
+		/* Restart the merge with the cached renames */
+		renames->redo_after_renames = 2;
+		trace2_region_leave("merge", "regular renames", opt->repo);
+		goto cleanup;
+	}
 	use_cached_pairs(opt, &renames->cached_pairs[1], &renames->pairs[1]);
 	use_cached_pairs(opt, &renames->cached_pairs[2], &renames->pairs[2]);
 	trace2_region_leave("merge", "regular renames", opt->repo);
@@ -4380,6 +4472,7 @@ static void merge_ort_nonrecursive_internal(struct merge_options *opt,
 					       opt->subtree_shift);
 	}
 
+redo:
 	trace2_region_enter("merge", "collect_merge_info", opt->repo);
 	if (collect_merge_info(opt, merge_base, side1, side2) != 0) {
 		/*
@@ -4399,6 +4492,12 @@ static void merge_ort_nonrecursive_internal(struct merge_options *opt,
 	result->clean = detect_and_process_renames(opt, merge_base,
 						   side1, side2);
 	trace2_region_leave("merge", "renames", opt->repo);
+	if (opt->priv->renames.redo_after_renames == 2) {
+		trace2_region_enter("merge", "reset_maps", opt->repo);
+		clear_or_reinit_internal_opts(opt->priv, 1);
+		trace2_region_leave("merge", "reset_maps", opt->repo);
+		goto redo;
+	}
 
 	trace2_region_enter("merge", "process_entries", opt->repo);
 	process_entries(opt, &working_tree_oid);
diff --git a/t/t6423-merge-rename-directories.sh b/t/t6423-merge-rename-directories.sh
index e834b7e6efe..d8919d276a1 100755
--- a/t/t6423-merge-rename-directories.sh
+++ b/t/t6423-merge-rename-directories.sh
@@ -4797,7 +4797,7 @@ test_setup_12f () {
 	)
 }
 
-test_expect_merge_algorithm failure failure '12f: Trivial directory resolve, caching, all kinds of fun' '
+test_expect_merge_algorithm failure success '12f: Trivial directory resolve, caching, all kinds of fun' '
 	test_setup_12f &&
 	(
 		cd 12f &&
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 2/7] merge-ort: add some more explanations in collect_merge_info_callback()
  2021-07-13 19:32   ` [PATCH v2 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
@ 2021-07-13 23:34     ` Bagas Sanjaya
  2021-07-14  0:19       ` Elijah Newren
  0 siblings, 1 reply; 52+ messages in thread
From: Bagas Sanjaya @ 2021-07-13 23:34 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason, Elijah Newren

On 14/07/21 02.32, Elijah Newren via GitGitGadget wrote:
> @@ -1018,8 +1018,8 @@ static int collect_merge_info_callback(int n,
>   	if (side1_matches_mbase && side2_matches_mbase) {
>   		/* mbase, side1, & side2 all match; use mbase as resolution */
>   		setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
> -				names, names+0, mbase_null, 0,
> -				filemask, dirmask, 1);
> +				names, names+0, mbase_null, 0 /* df_conflict */,
> +				filemask, dirmask, 1 /* resolved */);
>   		return mask;
>   	}
>


Is df_conflict stands for directory-file conflict?

>   	/*
> -	 * Record information about the path so we can resolve later in
> -	 * process_entries.
> +	 * None of the special cases above matched, so we have a
> +	 * provisional conflict.  (Rename detection might allow us to
> +	 * unconflict some more cases, but that comes later so all we can
> +	 * do now is record the different non-null file hashes.)
>   	 */
>   	setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
>   			names, NULL, 0, df_conflict, filemask, dirmask, 0);
> 

So when none of special cases matched, we assumed there's conflict 
(although provisional), right?

-- 
An old man doll... just what I always wanted! - Clara

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 2/7] merge-ort: add some more explanations in collect_merge_info_callback()
  2021-07-13 23:34     ` Bagas Sanjaya
@ 2021-07-14  0:19       ` Elijah Newren
  0 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren @ 2021-07-14  0:19 UTC (permalink / raw)
  To: Bagas Sanjaya
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Derrick Stolee,
	Ævar Arnfjörð Bjarmason

On Tue, Jul 13, 2021 at 4:34 PM Bagas Sanjaya <bagasdotme@gmail.com> wrote:
>
> On 14/07/21 02.32, Elijah Newren via GitGitGadget wrote:
> > @@ -1018,8 +1018,8 @@ static int collect_merge_info_callback(int n,
> >       if (side1_matches_mbase && side2_matches_mbase) {
> >               /* mbase, side1, & side2 all match; use mbase as resolution */
> >               setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
> > -                             names, names+0, mbase_null, 0,
> > -                             filemask, dirmask, 1);
> > +                             names, names+0, mbase_null, 0 /* df_conflict */,
> > +                             filemask, dirmask, 1 /* resolved */);
> >               return mask;
> >       }
> >
>
>
> Is df_conflict stands for directory-file conflict?

Yes, eventually propagating up to conflict_info->df_conflict, defined as:

    /* Whether this path is/was involved in a directory/file conflict */
    unsigned df_conflict:1;

> >       /*
> > -      * Record information about the path so we can resolve later in
> > -      * process_entries.
> > +      * None of the special cases above matched, so we have a
> > +      * provisional conflict.  (Rename detection might allow us to
> > +      * unconflict some more cases, but that comes later so all we can
> > +      * do now is record the different non-null file hashes.)
> >        */
> >       setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
> >                       names, NULL, 0, df_conflict, filemask, dirmask, 0);
> >
>
> So when none of special cases matched, we assumed there's conflict
> (although provisional), right?

Yes, the "provisional" adjective in particular is there because we
revisit each case in process_entries(), and resolve those that can be
via content merges, renormalization, etc.  The content merges need to
wait until after rename detection allows us to pair up different
files.  If the different file versions are still not resolved after
process_entries() then it becomes an actual conflict rather than just
a provisional conflict.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/7] merge-ort: add data structures for allowable trivial directory resolves
  2021-07-13 19:32   ` [PATCH v2 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
@ 2021-07-15 13:54     ` Derrick Stolee
  2021-07-15 15:54       ` Elijah Newren
  0 siblings, 1 reply; 52+ messages in thread
From: Derrick Stolee @ 2021-07-15 13:54 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Ævar Arnfjörð Bjarmason, Elijah Newren

On 7/13/2021 3:32 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>

The first two patches were easy to follow. This one requires a more
careful read, so here is my attempt to double-check that I am
reading it right.

> In commit f89b4f2bee ("merge-ort: skip rename detection entirely if
> possible", 2021-03-11), we added an additional reason that rename
> detection could be skipped entirely -- namely, if no *relevant* sources
> were present.  Without completing collect_merge_info_callback(), we do
> not yet know if there are no relevant sources.  However, we do know that
> if the current directory on one side matches the merge base, then every
> source file within that directory will not be RELEVANT_CONTENT, and a
> few simple checks can often let us rule out RELEVANT_LOCATION as well.
> This suggests we can just defer recursing into such directories until
> the end of collect_merge_info.

We are not making early decisions, but collecting a list of directories
that cannot be relevant sources. These directories could still be
relevant locations (or content), but we don't want to pay the cost of
recursing right away.

> Since the deferred directories are known to not add any relevant sources
> due to the above properties, then if there are no relevant sources after
> we've traversed all paths other than the deferred ones, then we know
> there are not any relevant sources.  Under those conditions, rename
> detection is unnecessary, and that means we can resolve the deferred
> directories without recursing into them.

Only in the case where there are no relevant sources at the end of
scanning all directories (possibly non-recursively), then we can skip
recursing into these directories because finding relevant locations
is wasted effort now.

This means the performance benefit will be limited to cases where
rename detection isn't needed at all, but that is a frequent enough
case to merit the overhead of tracking this information differently.

> Note that the logic for skipping rename detection was also modified
> further in commit 76e253793c ("merge-ort, diffcore-rename: employ cached
> renames when possible", 2021-01-30); in particular rename detection can
> be skipped if we already have cached renames for each relevant source.
> We can take advantage of this information as well with our deferral of
> recursing into directories where one side matches the merge base.
> 
> Add some data structures that we will use to do these deferrals, with
> some lengthy comments explaining their purpose.

And now that the idea is understood, we are just laying the groundwork
for the process by creating the tracking data structures, not actually
the algorithm to use them. Makes sense to split them up.

> +	/*
> +	 * possible_trivial_merges: directories we defer recursing into

Not only are we deferring the recursion, we might be able to avoid
it entirely. Perhaps "directories to be explored only when needed"
would emphasize this point differently?

> +	 *
> +	 * possible_trivial_merges is a map of directory names to
> +	 * dir_rename_mask.  When we detect that a directory is unchanged on
> +	 * one side, we can sometimes resolve the directory without recursing
> +	 * into it.  Renames are the only things that can prevent such an
> +	 * optimization.  However, for rename sources:
> +	 *   - If no parent directory needed directory rename detection, then
> +	 *     no path under such a directory can be a relevant_source.
> +	 * and for rename destinations:
> +	 *   - If no cached rename has a target path under the directory AND
> +	 *   - If there are no unpaired relevant_sources elsewhere in the
> +	 *     repository
> +	 * then we don't need any path under this directory for a rename
> +	 * destination.  The only way to know the last item above is to defer
> +	 * handling such directories until the end of collect_merge_info(),
> +	 * in handle_deferred_entries().
> +	 *
> +	 * For each we store dir_rename_mask, since that's the only bit of
> +	 * information we need, other than the path, to resume the recursive
> +	 * traversal.
> +	 */
> +	struct strintmap possible_trivial_merges[3];
> +
> +	/*
> +	 * trivial_merges_okay: if trivial directory merges are okay
> +	 *
> +	 * See possible_trivial_merges above.  The "no unpaired
> +	 * relevant_sources elsewhere in the repository" is a single boolean
> +	 * per merge side, which we store here.  Note that while 0 means no,
> +	 * 1 only means "maybe" rather than "yes"; we optimistically set it
> +	 * to 1 initially and only clear when we determine it is unsafe to
> +	 * do trivial directory merges.
> +	 */
> +	unsigned trivial_merges_okay[3];
> +
> +	/*
> +	 * target_dirs: ancestor directories of rename targets
> +	 *
> +	 * target_dirs contains all directory names that are an ancestor of
> +	 * any rename destination.
> +	 */
> +	struct strset target_dirs[3];

These three members seem tightly coupled. Would it be beneficial to
group the data into a 'struct trivial_merge_data' that could then
have an array of three here?

>  	/*
>  	 * dir_rename_mask:
>  	 *   0: optimization removing unmodified potential rename source okay
> @@ -490,6 +535,9 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
>  		strintmap_func(&renames->dirs_removed[i]);
>  		strmap_func(&renames->dir_renames[i], 0);
>  		strintmap_func(&renames->relevant_sources[i]);
> +		strintmap_func(&renames->possible_trivial_merges[i]);
> +		strset_func(&renames->target_dirs[i]);
> +		renames->trivial_merges_okay[i] = 1; /* 1 == maybe */

This could be collected into an initializer method for the data
required by this optimization instead of three initializations
among many.

>  		if (!reinitialize)
>  			assert(renames->cached_pairs_valid_side == 0);
>  		if (i != renames->cached_pairs_valid_side) {
> @@ -4045,12 +4093,17 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
>  		strintmap_init_with_options(&renames->relevant_sources[i],
>  					    -1 /* explicitly invalid */,
>  					    NULL, 0);
> +		strintmap_init_with_options(&renames->possible_trivial_merges[i],
> +					    0, NULL, 0);
> +		strset_init_with_options(&renames->target_dirs[i],
> +					 NULL, 1);
>  		strmap_init_with_options(&renames->cached_pairs[i],
>  					 NULL, 1);
>  		strset_init_with_options(&renames->cached_irrelevant[i],
>  					 NULL, 1);
>  		strset_init_with_options(&renames->cached_target_names[i],
>  					 NULL, 0);
> +		renames->trivial_merges_okay[i] = 1; /* 1 == maybe */

Here, there is even a separation between the grouped data. If you
don't like the idea of creating helper methods, then at least these
new lines could be grouped together. Bonus points for adding
whitespace between these lines and the others.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 4/7] merge-ort: add a handle_deferred_entries() helper function
  2021-07-13 19:33   ` [PATCH v2 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
@ 2021-07-15 14:32     ` Derrick Stolee
  2021-07-15 15:59       ` Elijah Newren
  0 siblings, 1 reply; 52+ messages in thread
From: Derrick Stolee @ 2021-07-15 14:32 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Ævar Arnfjörð Bjarmason, Elijah Newren

On 7/13/2021 3:33 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
> 
> In order to allow trivial directory resolution, we first need to be able
> to gather more information to determine if the optimization is safe.  To
> enable that, we need a way of deferring the recursion into the directory
> until a later time.  Naturally, deferring the entry into a subtree means
> that we need some function that will later recurse into the subdirectory
> exactly the same way that collect_merge_info_callback() would have done.
> 
> Add a helper function that does this.  For now this function is not used
> but a subsequent commit will change that.  Future commits will also make
> the function sometimes resolve directories instead of traversing inside.
...> +MAYBE_UNUSED
> +static int handle_deferred_entries(struct merge_options *opt,
> +				   struct traverse_info *info)
> +{
> +	struct rename_info *renames = &opt->priv->renames;
> +	struct hashmap_iter iter;
> +	struct strmap_entry *entry;
> +	int side, ret = 0;
> +
> +	for (side = MERGE_SIDE1; side <= MERGE_SIDE2; side++) {
> +		renames->trivial_merges_okay[side] = 0;

We are unconditionally setting this to zero? I will await the
use of the method to see how that affects things.

> +		strintmap_for_each_entry(&renames->possible_trivial_merges[side],
> +					 &iter, entry) {

The rest of this loop seems like standard logic for handling
these tree walks.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 5/7] merge-ort: defer recursing into directories when merge base is matched
  2021-07-13 19:33   ` [PATCH v2 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
@ 2021-07-15 14:43     ` Derrick Stolee
  2021-07-15 16:03       ` Elijah Newren
  0 siblings, 1 reply; 52+ messages in thread
From: Derrick Stolee @ 2021-07-15 14:43 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Ævar Arnfjörð Bjarmason, Elijah Newren

On 7/13/2021 3:33 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
...
> +		/*
> +		 * Check for whether we can avoid recursing due to one side
> +		 * matching the merge base.  The side that does NOT match is
> +		 * the one that might have a rename destination we need.
> +		 */
> +		assert(!side1_matches_mbase || !side2_matches_mbase);
> +		side = side1_matches_mbase ? MERGE_SIDE2 :
> +			side2_matches_mbase ? MERGE_SIDE1 : MERGE_BASE;
> +		if (filemask == 0 && (dirmask == 2 || dirmask == 4)) {
> +			/*
> +			 * Also defer recursing into new directories; set up a
> +			 * few variables to let us do so.
> +			 */
> +			ci->match_mask = (7 - dirmask);
> +			side = dirmask / 2;

Clever.

> +		}
> +		if (renames->dir_rename_mask != 0x07 &&
> +		    (side != MERGE_BASE) &&

nit: these parens are not necessary?

> +		    renames->trivial_merges_okay[side] &&
> +		    !strset_contains(&renames->target_dirs[side], pi.string)) {

Does this last condition mean "this directory is not already a parent of a
chosen rename target"?

> +			strintmap_set(&renames->possible_trivial_merges[side],
> +				      pi.string, renames->dir_rename_mask);
> +			renames->dir_rename_mask = prev_dir_rename_mask;
> +			return mask;
> +		}
> +
> +		/* We need to recurse */
>  		ci->match_mask &= filemask;
Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 6/7] merge-ort: avoid recursing into directories when we don't need to
  2021-07-13 19:33   ` [PATCH v2 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
@ 2021-07-15 14:55     ` Derrick Stolee
  2021-07-15 16:28       ` Elijah Newren
  0 siblings, 1 reply; 52+ messages in thread
From: Derrick Stolee @ 2021-07-15 14:55 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Ævar Arnfjörð Bjarmason, Elijah Newren

On 7/13/2021 3:33 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
...
> For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
> performance work; instrument with trace2_region_* calls", 2020-10-28),
> this change improves the performance as follows:
> 
>                             Before                  After
>     no-renames:        5.235 s ±  0.042 s   205.1  ms ±  3.8  ms

Wow! This is quite the savings, and reinforces that when no renames
exist we are likely to hit this optimization.

>     mega-renames:      9.419 s ±  0.107 s     1.564 s ±  0.010 s

I'm surprised that this one works so well, too. Clearly, there must
be a lot of directories that can be skipped despite many renames
existing.

>     just-one-mega:   480.1  ms ±  3.9  ms   479.5  ms ±  3.9  ms

And no overhead added to this case. Good.

> +		/* Loop over the set of paths we need to know rename info for */
> +		strset_for_each_entry(&renames->relevant_sources[side],
> +				      &iter, entry) {
> +			char *rename_target, *dir, *dir_marker;
> +			struct strmap_entry *e;
> +
> +			/*
> +			 * if we don't know delete/rename info for this path,
> +			 * then we need to recurse into all trees to get all
> +			 * adds to make sure we have it.
> +			 */

super-nit: s/if/If/

Otherwise, I can't speak to the code being 100% correct because
this area is so dense. I find the code to be well organized,
which will help finding and fixing any potential bugs that might
show up in strange corner cases later.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 7/7] merge-ort: restart merge with cached renames to reduce process entry cost
  2021-07-13 19:33   ` [PATCH v2 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
@ 2021-07-15 15:09     ` Derrick Stolee
  2021-07-15 16:53       ` Elijah Newren
  0 siblings, 1 reply; 52+ messages in thread
From: Derrick Stolee @ 2021-07-15 15:09 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Ævar Arnfjörð Bjarmason, Elijah Newren

On 7/13/2021 3:33 PM, Elijah Newren via GitGitGadget wrote:
> From: Elijah Newren <newren@gmail.com>
...
> Often, I noticed that when the optimization does not apply, it is
> because there are a handful of relevant sources -- maybe even only one.
> It felt frustrating to need to recurse into potentially hundreds or even
> thousands of directories just for a single rename, but it was needed for
> correctness.
> 
> However, staring at this list of functions and noticing that
> process_entries() is the most expensive and knowing I could avoid it if
> I had cached renames suggested a simple idea: change
>    collect_merge_info()
>    detect_and_process_renames()
>    process_entries()
> into
>    collect_merge_info()
>    detect_and_process_renames()
>    <cache all the renames, and restart>
>    collect_merge_info()
>    detect_and_process_renames()
>    process_entries()
> 
> This may seem odd and look like more work.  However, note that although
> we run collect_merge_info() twice, the second time we get to employ
> trivial directory resolves, which makes it much faster, so the increased
> time in collect_merge_info() is small.  While we run
> detect_and_process_renames() again, all renames are cached so it's
> nearly a no-op (we don't call into diffcore_rename_extended() but we do
> have a little bit of data structure checking and fixing up).  And the
> big payoff comes from the fact that process_entries(), will be much
> faster due to having far fewer entries to process.

I enjoyed the story you tell here.

> +	if (path_count_after) {
> +		/*
> +		 * Not sure were the right cut-off is for the optimization
> +		 * to redo collect_merge_info after we've cached the
> +		 * regular renames is.  Basically, collect_merge_info(),
> +		 * detect_regular_renames(), and process_entries() are
> +		 * similar costs and all big tent poles.  Caching the
> +		 * result of detect_regular_renames() means that redoing
> +		 * that one function will cost us virtually 0 extra, so it
> +		 * depends on the other two functions, which are both O(N)
> +		 * cost in the number of paths.  Thus, it makes sense that
> +		 * if we can cut the number of paths in half, then redoing
> +		 * collect_merge_info() at half cost in order to get
> +		 * process_entries() at half cost should be about equal
> +		 * cost.  If we can cut by more than half, then we would
> +		 * win.  The fact that process_entries() is about 10%-20%
> +		 * more expensive than collect_merge_info() suggests we
> +		 * could make the factor be less than two.  The fact that
> +		 * even when we have renames cached, we still have to
> +		 * traverse down to the individual (relevant) renames,
> +		 * which suggests we should perhaps use a bigger factor.
> +		 *
> +		 * The exact number isn't critical, since the code will
> +		 * work even if we get the factor wrong -- it just might be
> +		 * slightly slower if we're a bit off.  For now, just error
> +		 * on the side of a bigger fudge.  For the linux kernel

super-nit: s/linux/Linux/

> +		 * testcases I was looking at with massive renames, the
> +		 * ratio came in around 50 to 250, which clearly would
> +		 * trigger this optimization and provided some *very* nice
> +		 * speedups.

This bit of your testing might be more appropriate for your commit
message. This discussion of a test made at a certain point in time
is more likely to go stale than the description of how this factor
does not change correctness, only performance.

> +		 */
> +		int wanted_factor = 3;

and perhaps make it 'const'?

> +
> +		/* We should only redo collect_merge_info one time */
> +		assert(renames->redo_after_renames == 0);
> +
> +		if (path_count_after / path_count_before > wanted_factor) {

With truncation from integer division, this condition is equivalent* to

	path_count_after >= 4 * path_count_before

Or, do you want to change this to a ">=" so that the factor of 3 seems
more accurate?

*I understand the intention of using division to avoid (unlikely)
overflow via multiplication. The truncation is causing some confusion.
  
> -test_expect_merge_algorithm failure failure '12f: Trivial directory resolve, caching, all kinds of fun' '
> +test_expect_merge_algorithm failure success '12f: Trivial directory resolve, caching, all kinds of fun' '
>  	test_setup_12f &&
>  	(
>  		cd 12f &&
> 

Oh, and a bonus test success! Excellent.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/7] merge-ort: add data structures for allowable trivial directory resolves
  2021-07-15 13:54     ` Derrick Stolee
@ 2021-07-15 15:54       ` Elijah Newren
  0 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren @ 2021-07-15 15:54 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, Git Mailing List,
	Ævar Arnfjörð Bjarmason

On Thu, Jul 15, 2021 at 6:54 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 7/13/2021 3:32 PM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
>
> The first two patches were easy to follow. This one requires a more
> careful read, so here is my attempt to double-check that I am
> reading it right.

Yep, you seem to be understanding all of it just fine (does that mean
I did a good job explaining?), and made some good suggestions to
improve it further.

> > In commit f89b4f2bee ("merge-ort: skip rename detection entirely if
> > possible", 2021-03-11), we added an additional reason that rename
> > detection could be skipped entirely -- namely, if no *relevant* sources
> > were present.  Without completing collect_merge_info_callback(), we do
> > not yet know if there are no relevant sources.  However, we do know that
> > if the current directory on one side matches the merge base, then every
> > source file within that directory will not be RELEVANT_CONTENT, and a
> > few simple checks can often let us rule out RELEVANT_LOCATION as well.
> > This suggests we can just defer recursing into such directories until
> > the end of collect_merge_info.
>
> We are not making early decisions, but collecting a list of directories
> that cannot be relevant sources. These directories could still be
> relevant locations (or content), but we don't want to pay the cost of
> recursing right away.
>
> > Since the deferred directories are known to not add any relevant sources
> > due to the above properties, then if there are no relevant sources after
> > we've traversed all paths other than the deferred ones, then we know
> > there are not any relevant sources.  Under those conditions, rename
> > detection is unnecessary, and that means we can resolve the deferred
> > directories without recursing into them.
>
> Only in the case where there are no relevant sources at the end of
> scanning all directories (possibly non-recursively), then we can skip
> recursing into these directories because finding relevant locations
> is wasted effort now.
>
> This means the performance benefit will be limited to cases where
> rename detection isn't needed at all, but that is a frequent enough
> case to merit the overhead of tracking this information differently.
>
> > Note that the logic for skipping rename detection was also modified
> > further in commit 76e253793c ("merge-ort, diffcore-rename: employ cached
> > renames when possible", 2021-01-30); in particular rename detection can
> > be skipped if we already have cached renames for each relevant source.
> > We can take advantage of this information as well with our deferral of
> > recursing into directories where one side matches the merge base.
> >
> > Add some data structures that we will use to do these deferrals, with
> > some lengthy comments explaining their purpose.
>
> And now that the idea is understood, we are just laying the groundwork
> for the process by creating the tracking data structures, not actually
> the algorithm to use them. Makes sense to split them up.
>
> > +     /*
> > +      * possible_trivial_merges: directories we defer recursing into
>
> Not only are we deferring the recursion, we might be able to avoid
> it entirely. Perhaps "directories to be explored only when needed"
> would emphasize this point differently?

I like the sound of that; seems clearer.  Will update.

> > +      *
> > +      * possible_trivial_merges is a map of directory names to
> > +      * dir_rename_mask.  When we detect that a directory is unchanged on
> > +      * one side, we can sometimes resolve the directory without recursing
> > +      * into it.  Renames are the only things that can prevent such an
> > +      * optimization.  However, for rename sources:
> > +      *   - If no parent directory needed directory rename detection, then
> > +      *     no path under such a directory can be a relevant_source.
> > +      * and for rename destinations:
> > +      *   - If no cached rename has a target path under the directory AND
> > +      *   - If there are no unpaired relevant_sources elsewhere in the
> > +      *     repository
> > +      * then we don't need any path under this directory for a rename
> > +      * destination.  The only way to know the last item above is to defer
> > +      * handling such directories until the end of collect_merge_info(),
> > +      * in handle_deferred_entries().
> > +      *
> > +      * For each we store dir_rename_mask, since that's the only bit of
> > +      * information we need, other than the path, to resume the recursive
> > +      * traversal.
> > +      */
> > +     struct strintmap possible_trivial_merges[3];
> > +
> > +     /*
> > +      * trivial_merges_okay: if trivial directory merges are okay
> > +      *
> > +      * See possible_trivial_merges above.  The "no unpaired
> > +      * relevant_sources elsewhere in the repository" is a single boolean
> > +      * per merge side, which we store here.  Note that while 0 means no,
> > +      * 1 only means "maybe" rather than "yes"; we optimistically set it
> > +      * to 1 initially and only clear when we determine it is unsafe to
> > +      * do trivial directory merges.
> > +      */
> > +     unsigned trivial_merges_okay[3];
> > +
> > +     /*
> > +      * target_dirs: ancestor directories of rename targets
> > +      *
> > +      * target_dirs contains all directory names that are an ancestor of
> > +      * any rename destination.
> > +      */
> > +     struct strset target_dirs[3];
>
> These three members seem tightly coupled. Would it be beneficial to
> group the data into a 'struct trivial_merge_data' that could then
> have an array of three here?

Ooh, I like that idea; makes sense.  That'll naturally flow with your
last two comments too.

> >       /*
> >        * dir_rename_mask:
> >        *   0: optimization removing unmodified potential rename source okay
> > @@ -490,6 +535,9 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
> >               strintmap_func(&renames->dirs_removed[i]);
> >               strmap_func(&renames->dir_renames[i], 0);
> >               strintmap_func(&renames->relevant_sources[i]);
> > +             strintmap_func(&renames->possible_trivial_merges[i]);
> > +             strset_func(&renames->target_dirs[i]);
> > +             renames->trivial_merges_okay[i] = 1; /* 1 == maybe */
>
> This could be collected into an initializer method for the data
> required by this optimization instead of three initializations
> among many.
>
> >               if (!reinitialize)
> >                       assert(renames->cached_pairs_valid_side == 0);
> >               if (i != renames->cached_pairs_valid_side) {
> > @@ -4045,12 +4093,17 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
> >               strintmap_init_with_options(&renames->relevant_sources[i],
> >                                           -1 /* explicitly invalid */,
> >                                           NULL, 0);
> > +             strintmap_init_with_options(&renames->possible_trivial_merges[i],
> > +                                         0, NULL, 0);
> > +             strset_init_with_options(&renames->target_dirs[i],
> > +                                      NULL, 1);
> >               strmap_init_with_options(&renames->cached_pairs[i],
> >                                        NULL, 1);
> >               strset_init_with_options(&renames->cached_irrelevant[i],
> >                                        NULL, 1);
> >               strset_init_with_options(&renames->cached_target_names[i],
> >                                        NULL, 0);
> > +             renames->trivial_merges_okay[i] = 1; /* 1 == maybe */
>
> Here, there is even a separation between the grouped data. If you
> don't like the idea of creating helper methods, then at least these
> new lines could be grouped together. Bonus points for adding
> whitespace between these lines and the others.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 4/7] merge-ort: add a handle_deferred_entries() helper function
  2021-07-15 14:32     ` Derrick Stolee
@ 2021-07-15 15:59       ` Elijah Newren
  0 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren @ 2021-07-15 15:59 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, Git Mailing List,
	Ævar Arnfjörð Bjarmason

On Thu, Jul 15, 2021 at 7:32 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 7/13/2021 3:33 PM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
> >
> > In order to allow trivial directory resolution, we first need to be able
> > to gather more information to determine if the optimization is safe.  To
> > enable that, we need a way of deferring the recursion into the directory
> > until a later time.  Naturally, deferring the entry into a subtree means
> > that we need some function that will later recurse into the subdirectory
> > exactly the same way that collect_merge_info_callback() would have done.
> >
> > Add a helper function that does this.  For now this function is not used
> > but a subsequent commit will change that.  Future commits will also make
> > the function sometimes resolve directories instead of traversing inside.
> ...> +MAYBE_UNUSED
> > +static int handle_deferred_entries(struct merge_options *opt,
> > +                                struct traverse_info *info)
> > +{
> > +     struct rename_info *renames = &opt->priv->renames;
> > +     struct hashmap_iter iter;
> > +     struct strmap_entry *entry;
> > +     int side, ret = 0;
> > +
> > +     for (side = MERGE_SIDE1; side <= MERGE_SIDE2; side++) {
> > +             renames->trivial_merges_okay[side] = 0;
>
> We are unconditionally setting this to zero? I will await the
> use of the method to see how that affects things.

Unconditionally set to 0 because at this point in the series we don't
yet have the logic to handle it.  This line will be replaced later
once we add the necessary logic.

>
> > +             strintmap_for_each_entry(&renames->possible_trivial_merges[side],
> > +                                      &iter, entry) {
>
> The rest of this loop seems like standard logic for handling
> these tree walks.
>
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 5/7] merge-ort: defer recursing into directories when merge base is matched
  2021-07-15 14:43     ` Derrick Stolee
@ 2021-07-15 16:03       ` Elijah Newren
  2021-07-15 17:14         ` Derrick Stolee
  0 siblings, 1 reply; 52+ messages in thread
From: Elijah Newren @ 2021-07-15 16:03 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, Git Mailing List,
	Ævar Arnfjörð Bjarmason

On Thu, Jul 15, 2021 at 7:43 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 7/13/2021 3:33 PM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
> ...
> > +             /*
> > +              * Check for whether we can avoid recursing due to one side
> > +              * matching the merge base.  The side that does NOT match is
> > +              * the one that might have a rename destination we need.
> > +              */
> > +             assert(!side1_matches_mbase || !side2_matches_mbase);
> > +             side = side1_matches_mbase ? MERGE_SIDE2 :
> > +                     side2_matches_mbase ? MERGE_SIDE1 : MERGE_BASE;
> > +             if (filemask == 0 && (dirmask == 2 || dirmask == 4)) {
> > +                     /*
> > +                      * Also defer recursing into new directories; set up a
> > +                      * few variables to let us do so.
> > +                      */
> > +                     ci->match_mask = (7 - dirmask);
> > +                     side = dirmask / 2;
>
> Clever.
>
> > +             }
> > +             if (renames->dir_rename_mask != 0x07 &&
> > +                 (side != MERGE_BASE) &&
>
> nit: these parens are not necessary?

Indeed; I'll remove them.

> > +                 renames->trivial_merges_okay[side] &&
> > +                 !strset_contains(&renames->target_dirs[side], pi.string)) {
>
> Does this last condition mean "this directory is not already a parent of a
> chosen rename target"?

I'm not sure what you mean by "chosen" here.  If by "chosen" you mean
"cached" (which comes from ort-perf-batch-11's caching of upstream
renames from previously cherry-picked commits), then yes.

Perhaps it's worth noting that rename detection has not yet been run
for this merge by the time we hit this logic, so the only renames we
can know are the cached ones from a previous pick.

> > +                     strintmap_set(&renames->possible_trivial_merges[side],
> > +                                   pi.string, renames->dir_rename_mask);
> > +                     renames->dir_rename_mask = prev_dir_rename_mask;
> > +                     return mask;
> > +             }
> > +
> > +             /* We need to recurse */
> >               ci->match_mask &= filemask;
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 6/7] merge-ort: avoid recursing into directories when we don't need to
  2021-07-15 14:55     ` Derrick Stolee
@ 2021-07-15 16:28       ` Elijah Newren
  0 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren @ 2021-07-15 16:28 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, Git Mailing List,
	Ævar Arnfjörð Bjarmason

On Thu, Jul 15, 2021 at 7:55 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 7/13/2021 3:33 PM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
> ...
> > For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
> > performance work; instrument with trace2_region_* calls", 2020-10-28),
> > this change improves the performance as follows:
> >
> >                             Before                  After
> >     no-renames:        5.235 s ±  0.042 s   205.1  ms ±  3.8  ms
>
> Wow! This is quite the savings, and reinforces that when no renames
> exist we are likely to hit this optimization.

Just to clarify, "no-renames" was an unfortunate misnomer.  I should
have called it "few-renames".

But yeah, I was pretty amazed a year ago (when I initially implemented
this optimization) seeing the speedups on this testcase -- a testcase
that I hadn't even been paying much attention to beyond trying to
avoid regressing it.

> >     mega-renames:      9.419 s ±  0.107 s     1.564 s ±  0.010 s
>
> I'm surprised that this one works so well, too. Clearly, there must
> be a lot of directories that can be skipped despite many renames
> existing.

One small reason it works is that some of the commits don't touch any
paths involved in the big directory rename.  That means that for those
commits, all the renames under that big directory are irrelevant.
This fact makes the optimization work for 4 of the 35 patches.  (Of
course, I did pick that particular series of commits for rebasing
because it touched the directory being renamed a lot, including adding
files to it, which made it a better stress test case for my
optimizations.  But that does make it less likely for the
irrelevant-renames criteria to help us out on this particular
optimization.)

The bigger reason this optimization works so well on this testcase, is
the mere fact that this rebase involved a sequence of 35 commits.  The
optimization doesn't work for the first commit (as noted by the
just-one-mega testcase below).  However, once the first commit is
picked, renames on the upstream side that were relevant in the first
commit are cached, so the optimization does trigger for many of the
subsequent commits.  And whenever the optimization doesn't work on a
subsequent commit -- which happens because the new commit modifies
different files that make other renames become relevant -- then after
detecting the new renames those also get cached.  More cached renames
means a higher likelihood that we can skip recursing into directories
when rebasing subsequent commits.

> >     just-one-mega:   480.1  ms ±  3.9  ms   479.5  ms ±  3.9  ms
>
> And no overhead added to this case. Good.
>
> > +             /* Loop over the set of paths we need to know rename info for */
> > +             strset_for_each_entry(&renames->relevant_sources[side],
> > +                                   &iter, entry) {
> > +                     char *rename_target, *dir, *dir_marker;
> > +                     struct strmap_entry *e;
> > +
> > +                     /*
> > +                      * if we don't know delete/rename info for this path,
> > +                      * then we need to recurse into all trees to get all
> > +                      * adds to make sure we have it.
> > +                      */
>
> super-nit: s/if/If/

Will fix.

> Otherwise, I can't speak to the code being 100% correct because
> this area is so dense. I find the code to be well organized,
> which will help finding and fixing any potential bugs that might
> show up in strange corner cases later.
>
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 7/7] merge-ort: restart merge with cached renames to reduce process entry cost
  2021-07-15 15:09     ` Derrick Stolee
@ 2021-07-15 16:53       ` Elijah Newren
  2021-07-15 17:19         ` Derrick Stolee
  0 siblings, 1 reply; 52+ messages in thread
From: Elijah Newren @ 2021-07-15 16:53 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, Git Mailing List,
	Ævar Arnfjörð Bjarmason

On Thu, Jul 15, 2021 at 8:10 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 7/13/2021 3:33 PM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
> ...
> > Often, I noticed that when the optimization does not apply, it is
> > because there are a handful of relevant sources -- maybe even only one.
> > It felt frustrating to need to recurse into potentially hundreds or even
> > thousands of directories just for a single rename, but it was needed for
> > correctness.
> >
> > However, staring at this list of functions and noticing that
> > process_entries() is the most expensive and knowing I could avoid it if
> > I had cached renames suggested a simple idea: change
> >    collect_merge_info()
> >    detect_and_process_renames()
> >    process_entries()
> > into
> >    collect_merge_info()
> >    detect_and_process_renames()
> >    <cache all the renames, and restart>
> >    collect_merge_info()
> >    detect_and_process_renames()
> >    process_entries()
> >
> > This may seem odd and look like more work.  However, note that although
> > we run collect_merge_info() twice, the second time we get to employ
> > trivial directory resolves, which makes it much faster, so the increased
> > time in collect_merge_info() is small.  While we run
> > detect_and_process_renames() again, all renames are cached so it's
> > nearly a no-op (we don't call into diffcore_rename_extended() but we do
> > have a little bit of data structure checking and fixing up).  And the
> > big payoff comes from the fact that process_entries(), will be much
> > faster due to having far fewer entries to process.
>
> I enjoyed the story you tell here.

:-)

> > +     if (path_count_after) {
> > +             /*
> > +              * Not sure were the right cut-off is for the optimization
> > +              * to redo collect_merge_info after we've cached the
> > +              * regular renames is.  Basically, collect_merge_info(),
> > +              * detect_regular_renames(), and process_entries() are
> > +              * similar costs and all big tent poles.  Caching the
> > +              * result of detect_regular_renames() means that redoing
> > +              * that one function will cost us virtually 0 extra, so it
> > +              * depends on the other two functions, which are both O(N)
> > +              * cost in the number of paths.  Thus, it makes sense that
> > +              * if we can cut the number of paths in half, then redoing
> > +              * collect_merge_info() at half cost in order to get
> > +              * process_entries() at half cost should be about equal
> > +              * cost.  If we can cut by more than half, then we would
> > +              * win.  The fact that process_entries() is about 10%-20%
> > +              * more expensive than collect_merge_info() suggests we
> > +              * could make the factor be less than two.  The fact that
> > +              * even when we have renames cached, we still have to
> > +              * traverse down to the individual (relevant) renames,
> > +              * which suggests we should perhaps use a bigger factor.
> > +              *
> > +              * The exact number isn't critical, since the code will
> > +              * work even if we get the factor wrong -- it just might be
> > +              * slightly slower if we're a bit off.  For now, just error
> > +              * on the side of a bigger fudge.  For the linux kernel
>
> super-nit: s/linux/Linux/

Yeah, I tend to refer to projects by the name of their repository
instead of their proper name.  (I do it with git too.)  Bad habit.
Will fix.  That is, I will fix this instance.  Not sure I can fix the
habit.

> > +              * testcases I was looking at with massive renames, the
> > +              * ratio came in around 50 to 250, which clearly would
> > +              * trigger this optimization and provided some *very* nice
> > +              * speedups.
>
> This bit of your testing might be more appropriate for your commit
> message. This discussion of a test made at a certain point in time
> is more likely to go stale than the description of how this factor
> does not change correctness, only performance.

The commit message does include discussion on how this factor only
changes performance, not correctness.  I left this comment in the code
because I figured it looked weird and magic and deserved an
explanation without resorting to git-log or git-blame.  Granted, I
wrote this comment and the commit message at much different times (I
wrote the comment first, then the commit message many months later)
and thus summarized a bit differently.  But I thought I had the same
relevant content in both places.

Are there pieces you feel are missing from the commit message?  Should
I trim this comment down in the code and just let people look for the
commit message for more details?

> > +              */
> > +             int wanted_factor = 3;
>
> and perhaps make it 'const'?

Sure, will do.

> > +
> > +             /* We should only redo collect_merge_info one time */
> > +             assert(renames->redo_after_renames == 0);
> > +
> > +             if (path_count_after / path_count_before > wanted_factor) {
>
> With truncation from integer division, this condition is equivalent* to
>
>         path_count_after >= 4 * path_count_before
>
> Or, do you want to change this to a ">=" so that the factor of 3 seems
> more accurate?
>
> *I understand the intention of using division to avoid (unlikely)
> overflow via multiplication. The truncation is causing some confusion.

Good catch; I'll fix it up to use ">=".

> > -test_expect_merge_algorithm failure failure '12f: Trivial directory resolve, caching, all kinds of fun' '
> > +test_expect_merge_algorithm failure success '12f: Trivial directory resolve, caching, all kinds of fun' '
> >       test_setup_12f &&
> >       (
> >               cd 12f &&
> >
>
> Oh, and a bonus test success! Excellent.

Yeah, this testcase was slightly weird in the order I sent it
upstream.  12f was written specifically with this optimization in mind
as a way of ensuring code coverage of the restart logic.  I would have
waited to submit the 12f testcase with this series, but the testcase
also demonstrated an important directory rename detection bug that I
found existed in both merge-recursive and merge-ort at the time.  See
commit 902c521a35 ("t6423: more involved directory rename test",
2020-10-15)

The merge-ort bug was fixed with commit 203c872c4f ("merge-ort: fix a
directory rename detection bug", 2021-01-19).  The merge-recursive bug
still exists.

So, this series just fixed up the final thing that 12f was testing for
-- namely that it included two collect_merge_info() region_enter
trace2 calls per commit instead of just one.

Perhaps I could have split the test, but both things require a
relatively big set of files which makes the test a bit more expensive
so I didn't want to duplicate it.  Besides, having both factors
involved makes it a better stress test of the restart logic.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 5/7] merge-ort: defer recursing into directories when merge base is matched
  2021-07-15 16:03       ` Elijah Newren
@ 2021-07-15 17:14         ` Derrick Stolee
  0 siblings, 0 replies; 52+ messages in thread
From: Derrick Stolee @ 2021-07-15 17:14 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List,
	Ævar Arnfjörð Bjarmason

On 7/15/2021 12:03 PM, Elijah Newren wrote:
> On Thu, Jul 15, 2021 at 7:43 AM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 7/13/2021 3:33 PM, Elijah Newren via GitGitGadget wrote:
>>> From: Elijah Newren <newren@gmail.com>
...
>>> +                 renames->trivial_merges_okay[side] &&
>>> +                 !strset_contains(&renames->target_dirs[side], pi.string)) {
>>
>> Does this last condition mean "this directory is not already a parent of a
>> chosen rename target"?
> 
> I'm not sure what you mean by "chosen" here.  If by "chosen" you mean
> "cached" (which comes from ort-perf-batch-11's caching of upstream
> renames from previously cherry-picked commits), then yes.
> 
> Perhaps it's worth noting that rename detection has not yet been run
> for this merge by the time we hit this logic, so the only renames we
> can know are the cached ones from a previous pick.

I was missing the interaction with the cached results, which now makes
much more sense here.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 7/7] merge-ort: restart merge with cached renames to reduce process entry cost
  2021-07-15 16:53       ` Elijah Newren
@ 2021-07-15 17:19         ` Derrick Stolee
  2021-07-15 17:32           ` Elijah Newren
  0 siblings, 1 reply; 52+ messages in thread
From: Derrick Stolee @ 2021-07-15 17:19 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Elijah Newren via GitGitGadget, Git Mailing List,
	Ævar Arnfjörð Bjarmason

On 7/15/2021 12:53 PM, Elijah Newren wrote:
> On Thu, Jul 15, 2021 at 8:10 AM Derrick Stolee <stolee@gmail.com> wrote:
>>
>> On 7/13/2021 3:33 PM, Elijah Newren via GitGitGadget wrote:
>>> From: Elijah Newren <newren@gmail.com>
...
>>> +              * The exact number isn't critical, since the code will
>>> +              * work even if we get the factor wrong -- it just might be
>>> +              * slightly slower if we're a bit off.  For now, just error
>>> +              * on the side of a bigger fudge.  For the linux kernel
>>
>> super-nit: s/linux/Linux/
> 
> Yeah, I tend to refer to projects by the name of their repository
> instead of their proper name.  (I do it with git too.)  Bad habit.
> Will fix.  That is, I will fix this instance.  Not sure I can fix the
> habit.

If you had written it as torvalds/linux, then I wouldn't have batted
an eye, because that is clearly a repo name (at least, clear to me).

>>> +              * testcases I was looking at with massive renames, the
>>> +              * ratio came in around 50 to 250, which clearly would
>>> +              * trigger this optimization and provided some *very* nice
>>> +              * speedups.
>>
>> This bit of your testing might be more appropriate for your commit
>> message. This discussion of a test made at a certain point in time
>> is more likely to go stale than the description of how this factor
>> does not change correctness, only performance.
> 
> The commit message does include discussion on how this factor only
> changes performance, not correctness.  I left this comment in the code
> because I figured it looked weird and magic and deserved an
> explanation without resorting to git-log or git-blame.  Granted, I
> wrote this comment and the commit message at much different times (I
> wrote the comment first, then the commit message many months later)
> and thus summarized a bit differently.  But I thought I had the same
> relevant content in both places.
> 
> Are there pieces you feel are missing from the commit message?  Should
> I trim this comment down in the code and just let people look for the
> commit message for more details?

I meant to say "these kinds of details are better for the commit
message instead of comments" and not say "your commit message
doesn't have enough."

I don't feel strongly about this being present in the comment or
not, but it seems like something that could be dropped from the
comment without loss of information.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 7/7] merge-ort: restart merge with cached renames to reduce process entry cost
  2021-07-15 17:19         ` Derrick Stolee
@ 2021-07-15 17:32           ` Elijah Newren
  0 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren @ 2021-07-15 17:32 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, Git Mailing List,
	Ævar Arnfjörð Bjarmason

On Thu, Jul 15, 2021 at 10:19 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 7/15/2021 12:53 PM, Elijah Newren wrote:
> > On Thu, Jul 15, 2021 at 8:10 AM Derrick Stolee <stolee@gmail.com> wrote:
> >>
> >> On 7/13/2021 3:33 PM, Elijah Newren via GitGitGadget wrote:
> >>> From: Elijah Newren <newren@gmail.com>
> ...
> >>> +              * The exact number isn't critical, since the code will
> >>> +              * work even if we get the factor wrong -- it just might be
> >>> +              * slightly slower if we're a bit off.  For now, just error
> >>> +              * on the side of a bigger fudge.  For the linux kernel
> >>
> >> super-nit: s/linux/Linux/
> >
> > Yeah, I tend to refer to projects by the name of their repository
> > instead of their proper name.  (I do it with git too.)  Bad habit.
> > Will fix.  That is, I will fix this instance.  Not sure I can fix the
> > habit.
>
> If you had written it as torvalds/linux, then I wouldn't have batted
> an eye, because that is clearly a repo name (at least, clear to me).

Yeah, the thing is I use the repo name even in cases where it's not
ambiguous whether the project or the repo is meant.  For example, I
would often write something like "I'm part of the git project."
Christian has been trying to fix my capitalization.  Folks in another
project tried to "fix" my capitalization too.  Perhaps by demanding
five letters of all caps, they were able to get one out of me[1].  I
suspect, though, that if they had just a single repo instead of having
hundreds of repositories, they might not have even gotten that one
letter of capitalization from me.  ;-)

[1] https://blogs.gnome.org/newren/2006/04/22/enlightening-mankind-about-the-correct-spelling-of-gnome/

> >>> +              * testcases I was looking at with massive renames, the
> >>> +              * ratio came in around 50 to 250, which clearly would
> >>> +              * trigger this optimization and provided some *very* nice
> >>> +              * speedups.
> >>
> >> This bit of your testing might be more appropriate for your commit
> >> message. This discussion of a test made at a certain point in time
> >> is more likely to go stale than the description of how this factor
> >> does not change correctness, only performance.
> >
> > The commit message does include discussion on how this factor only
> > changes performance, not correctness.  I left this comment in the code
> > because I figured it looked weird and magic and deserved an
> > explanation without resorting to git-log or git-blame.  Granted, I
> > wrote this comment and the commit message at much different times (I
> > wrote the comment first, then the commit message many months later)
> > and thus summarized a bit differently.  But I thought I had the same
> > relevant content in both places.
> >
> > Are there pieces you feel are missing from the commit message?  Should
> > I trim this comment down in the code and just let people look for the
> > commit message for more details?
>
> I meant to say "these kinds of details are better for the commit
> message instead of comments" and not say "your commit message
> doesn't have enough."
>
> I don't feel strongly about this being present in the comment or
> not, but it seems like something that could be dropped from the
> comment without loss of information.

Ah, makes sense.  I'll trim it down.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 0/7] Optimization batch 14: trivial directory resolution
  2021-07-13 19:32 ` [PATCH v2 " Elijah Newren via GitGitGadget
                     ` (6 preceding siblings ...)
  2021-07-13 19:33   ` [PATCH v2 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
@ 2021-07-16  5:22   ` Elijah Newren via GitGitGadget
  2021-07-16  5:22     ` [PATCH v3 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
                       ` (8 more replies)
  7 siblings, 9 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-16  5:22 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Bagas Sanjaya, Elijah Newren

This series depends textually on ort-perf-batch-12, but is semantically
independent. (It is both semantically and textually independent of
ort-perf-batch-13.)

Most of my previous series dramatically accelerated cases with lots of
renames, while providing comparatively minor benefits for cases with few or
no renames. This series is the opposite; it provides huge benefits when
there are few or no renames, and comparatively smaller (though still quite
decent) benefits for cases with many uncached renames.

Changes since v2, addressing feedback from Stolee:

 * Created a separate struct for three related variables to hint they are
   related
 * Simplified a lengthy comment that was duplicated by the commit message
 * Various other minor cleanups

Changes since v1:

 * Minor tweak to the final patch to correct implicit assumption that rename
   detection running implies all renames were found (rename limits could
   have been exceeded and prevented finding renames)

=== Basic Optimization idea ===

unpack_trees has had a concept of trivial merges for individual files (see
Documentation/technical/trivial-merge.txt). The same idea can be applied in
merge-ort. It'd be really nice to extend that idea to trees as well, as it
could provide a huge performance boost; sadly however, applying it in
general would wreck both regular rename detection (the unmatched side can
have new files that serve as potential destinations in rename detection) and
directory rename detection (the unmatched side could have a new directory
that was moved into it).

If we somehow knew rename detection wasn't needed, we could do trivial
directory resolution. In the past, this wasn't possible. However...

With recent optimizations we have created a possibility to do trivial
directory resolutions in some cases. These came from the addition of the
"skipping irrelevant renames" optimizations (from ort-perf-batch-9 and
ort-perf-batch-10), and in particular noting that we added an ability to
entirely skip rename detection in commit f89b4f2bee ("merge-ort: skip rename
detection entirely if possible", 2021-03-11) when there are no relevant
sources. We can detect if there are no relevant sources without recursing
into the directories in question.

As a cherry on top, the caching of renames (from ort-perf-batch-11) allows
us to cover additional cases.

This series is all about adding all the special checks needed to safely
perform trival directory resolutions.

=== Results ===

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28), the
changes in just this series improves the performance as follows:

                     Before Series           After Series
no-renames:        5.235 s ±  0.042 s   204.2  ms ±  3.0  ms
mega-renames:      9.419 s ±  0.107 s     1.076 s ±  0.015 s
just-one-mega:   480.1  ms ±  3.9  ms   364.1  ms ±  7.0  ms


As a reminder, before any merge-ort/diffcore-rename performance work, the
performance results we started with (for merge-recursive as of git-2.30.0)
were:

no-renames-am:      6.940 s ±  0.485 s
no-renames:        18.912 s ±  0.174 s
mega-renames:    5964.031 s ± 10.459 s
just-one-mega:    149.583 s ±  0.751 s


Elijah Newren (7):
  merge-ort: resolve paths early when we have sufficient information
  merge-ort: add some more explanations in collect_merge_info_callback()
  merge-ort: add data structures for allowable trivial directory
    resolves
  merge-ort: add a handle_deferred_entries() helper function
  merge-ort: defer recursing into directories when merge base is matched
  merge-ort: avoid recursing into directories when we don't need to
  merge-ort: restart merge with cached renames to reduce process entry
    cost

 merge-ort.c                         | 399 +++++++++++++++++++++++++++-
 t/t6423-merge-rename-directories.sh |   2 +-
 2 files changed, 389 insertions(+), 12 deletions(-)


base-commit: 2eeee12b02e441ac05054a5a5ecbcea6964a1e6b
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-988%2Fnewren%2Fort-perf-batch-14-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-988/newren/ort-perf-batch-14-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/988

Range-diff vs v2:

 1:  5dca982c0b0 = 1:  5dca982c0b0 merge-ort: resolve paths early when we have sufficient information
 2:  8aea3713902 = 2:  8aea3713902 merge-ort: add some more explanations in collect_merge_info_callback()
 3:  f7ac01055d9 ! 3:  c2b45fef1d7 merge-ort: add data structures for allowable trivial directory resolves
     @@ Commit message
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## merge-ort.c ##
     -@@ merge-ort.c: struct rename_info {
     - 	 */
     - 	struct strintmap relevant_sources[3];
     +@@ merge-ort.c: struct traversal_callback_data {
     + 	struct name_entry names[3];
     + };
       
     ++struct deferred_traversal_data {
      +	/*
     -+	 * possible_trivial_merges: directories we defer recursing into
     ++	 * possible_trivial_merges: directories to be explored only when needed
      +	 *
      +	 * possible_trivial_merges is a map of directory names to
      +	 * dir_rename_mask.  When we detect that a directory is unchanged on
     @@ merge-ort.c: struct rename_info {
      +	 * information we need, other than the path, to resume the recursive
      +	 * traversal.
      +	 */
     -+	struct strintmap possible_trivial_merges[3];
     ++	struct strintmap possible_trivial_merges;
      +
      +	/*
      +	 * trivial_merges_okay: if trivial directory merges are okay
     @@ merge-ort.c: struct rename_info {
      +	 * to 1 initially and only clear when we determine it is unsafe to
      +	 * do trivial directory merges.
      +	 */
     -+	unsigned trivial_merges_okay[3];
     ++	unsigned trivial_merges_okay;
      +
      +	/*
      +	 * target_dirs: ancestor directories of rename targets
     @@ merge-ort.c: struct rename_info {
      +	 * target_dirs contains all directory names that are an ancestor of
      +	 * any rename destination.
      +	 */
     -+	struct strset target_dirs[3];
     ++	struct strset target_dirs;
     ++};
     ++
     + struct rename_info {
     + 	/*
     + 	 * All variables that are arrays of size 3 correspond to data tracked
     +@@ merge-ort.c: struct rename_info {
     + 	 */
     + 	struct strintmap relevant_sources[3];
     + 
     ++	struct deferred_traversal_data deferred[3];
      +
       	/*
       	 * dir_rename_mask:
       	 *   0: optimization removing unmodified potential rename source okay
      @@ merge-ort.c: static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
     - 		strintmap_func(&renames->dirs_removed[i]);
     - 		strmap_func(&renames->dir_renames[i], 0);
     - 		strintmap_func(&renames->relevant_sources[i]);
     -+		strintmap_func(&renames->possible_trivial_merges[i]);
     -+		strset_func(&renames->target_dirs[i]);
     -+		renames->trivial_merges_okay[i] = 1; /* 1 == maybe */
     - 		if (!reinitialize)
     - 			assert(renames->cached_pairs_valid_side == 0);
     - 		if (i != renames->cached_pairs_valid_side) {
     + 				strmap_clear(&renames->dir_rename_count[i], 1);
     + 		}
     + 	}
     ++	for (i = MERGE_SIDE1; i <= MERGE_SIDE2; ++i) {
     ++		strintmap_func(&renames->deferred[i].possible_trivial_merges);
     ++		strset_func(&renames->deferred[i].target_dirs);
     ++		renames->deferred[i].trivial_merges_okay = 1; /* 1 == maybe */
     ++	}
     + 	renames->cached_pairs_valid_side = 0;
     + 	renames->dir_rename_mask = 0;
     + 
      @@ merge-ort.c: static void merge_start(struct merge_options *opt, struct merge_result *result)
     - 		strintmap_init_with_options(&renames->relevant_sources[i],
     - 					    -1 /* explicitly invalid */,
     - 					    NULL, 0);
     -+		strintmap_init_with_options(&renames->possible_trivial_merges[i],
     -+					    0, NULL, 0);
     -+		strset_init_with_options(&renames->target_dirs[i],
     -+					 NULL, 1);
     - 		strmap_init_with_options(&renames->cached_pairs[i],
     - 					 NULL, 1);
     - 		strset_init_with_options(&renames->cached_irrelevant[i],
     - 					 NULL, 1);
       		strset_init_with_options(&renames->cached_target_names[i],
       					 NULL, 0);
     -+		renames->trivial_merges_okay[i] = 1; /* 1 == maybe */
       	}
     ++	for (i = MERGE_SIDE1; i <= MERGE_SIDE2; i++) {
     ++		strintmap_init_with_options(&renames->deferred[i].possible_trivial_merges,
     ++					    0, NULL, 0);
     ++		strset_init_with_options(&renames->deferred[i].target_dirs,
     ++					 NULL, 1);
     ++		renames->deferred[i].trivial_merges_okay = 1; /* 1 == maybe */
     ++	}
       
       	/*
     + 	 * Although we initialize opt->priv->paths with strdup_strings=0,
 4:  7e28323b624 ! 4:  1cf4a47562a merge-ort: add a handle_deferred_entries() helper function
     @@ merge-ort.c: static int collect_merge_info_callback(int n,
      +	int side, ret = 0;
      +
      +	for (side = MERGE_SIDE1; side <= MERGE_SIDE2; side++) {
     -+		renames->trivial_merges_okay[side] = 0;
     -+		strintmap_for_each_entry(&renames->possible_trivial_merges[side],
     ++		renames->deferred[side].trivial_merges_okay = 0;
     ++		strintmap_for_each_entry(&renames->deferred[side].possible_trivial_merges,
      +					 &iter, entry) {
      +			const char *path = entry->key;
      +			unsigned dir_rename_mask = (intptr_t)entry->value;
 5:  317553eadb6 ! 5:  79c51536829 merge-ort: defer recursing into directories when merge base is matched
     @@ merge-ort.c: static int collect_merge_info_callback(int n,
      +			side = dirmask / 2;
      +		}
      +		if (renames->dir_rename_mask != 0x07 &&
     -+		    (side != MERGE_BASE) &&
     -+		    renames->trivial_merges_okay[side] &&
     -+		    !strset_contains(&renames->target_dirs[side], pi.string)) {
     -+			strintmap_set(&renames->possible_trivial_merges[side],
     ++		    side != MERGE_BASE &&
     ++		    renames->deferred[side].trivial_merges_okay &&
     ++		    !strset_contains(&renames->deferred[side].target_dirs,
     ++				     pi.string)) {
     ++			strintmap_set(&renames->deferred[side].possible_trivial_merges,
      +				      pi.string, renames->dir_rename_mask);
      +			renames->dir_rename_mask = prev_dir_rename_mask;
      +			return mask;
 6:  3409a6cd631 ! 6:  572cc5e94d2 merge-ort: avoid recursing into directories when we don't need to
     @@ merge-ort.c: static int handle_deferred_entries(struct merge_options *opt,
       	int side, ret = 0;
       
       	for (side = MERGE_SIDE1; side <= MERGE_SIDE2; side++) {
     --		renames->trivial_merges_okay[side] = 0;
     --		strintmap_for_each_entry(&renames->possible_trivial_merges[side],
     +-		renames->deferred[side].trivial_merges_okay = 0;
     +-		strintmap_for_each_entry(&renames->deferred[side].possible_trivial_merges,
      -					 &iter, entry) {
      +		unsigned optimization_okay = 1;
      +		struct strintmap copy;
     @@ merge-ort.c: static int handle_deferred_entries(struct merge_options *opt,
      +			struct strmap_entry *e;
      +
      +			/*
     -+			 * if we don't know delete/rename info for this path,
     ++			 * If we don't know delete/rename info for this path,
      +			 * then we need to recurse into all trees to get all
      +			 * adds to make sure we have it.
      +			 */
     @@ merge-ort.c: static int handle_deferred_entries(struct merge_options *opt,
      +			dir = xstrdup(rename_target);
      +			while ((dir_marker = strrchr(dir, '/'))) {
      +				*dir_marker = '\0';
     -+				if (strset_contains(&renames->target_dirs[side],
     ++				if (strset_contains(&renames->deferred[side].target_dirs,
      +						    dir))
      +					break;
     -+				strset_add(&renames->target_dirs[side], dir);
     ++				strset_add(&renames->deferred[side].target_dirs,
     ++					   dir);
      +			}
      +			free(dir);
      +		}
     -+		renames->trivial_merges_okay[side] = optimization_okay;
     ++		renames->deferred[side].trivial_merges_okay = optimization_okay;
      +		/*
      +		 * We need to recurse into any directories in
      +		 * possible_trivial_merges[side] found in target_dirs[side].
     @@ merge-ort.c: static int handle_deferred_entries(struct merge_options *opt,
      +		 * and then we'll also iterate anything added into
      +		 * possible_trivial_merges[side] once this loop is done.
      +		 */
     -+		copy = renames->possible_trivial_merges[side];
     -+		strintmap_init_with_options(&renames->possible_trivial_merges[side],
     ++		copy = renames->deferred[side].possible_trivial_merges;
     ++		strintmap_init_with_options(&renames->deferred[side].possible_trivial_merges,
      +					    0,
      +					    NULL,
      +					    0);
     @@ merge-ort.c: static int handle_deferred_entries(struct merge_options *opt,
       			dirmask = ci->dirmask;
       
      +			if (optimization_okay &&
     -+			    !strset_contains(&renames->target_dirs[side],
     ++			    !strset_contains(&renames->deferred[side].target_dirs,
      +					     path)) {
      +				resolve_trivial_directory_merge(ci, side);
      +				continue;
     @@ merge-ort.c: static int handle_deferred_entries(struct merge_options *opt,
       				return ret;
       		}
      +		strintmap_clear(&copy);
     -+		strintmap_for_each_entry(&renames->possible_trivial_merges[side],
     ++		strintmap_for_each_entry(&renames->deferred[side].possible_trivial_merges,
      +					 &iter, entry) {
      +			const char *path = entry->key;
      +			struct conflict_info *ci;
     @@ merge-ort.c: static int handle_deferred_entries(struct merge_options *opt,
      +			ci = strmap_get(&opt->priv->paths, path);
      +			VERIFY_CI(ci);
      +
     -+			assert(renames->trivial_merges_okay[side] &&
     -+			       !strset_contains(&renames->target_dirs[side],
     ++			assert(renames->deferred[side].trivial_merges_okay &&
     ++			       !strset_contains(&renames->deferred[side].target_dirs,
      +						path));
      +			resolve_trivial_directory_merge(ci, side);
      +		}
 7:  7133f0efa52 ! 7:  a9cbc1d4f18 merge-ort: restart merge with cached renames to reduce process entry cost
     @@ merge-ort.c: struct rename_info {
       	 * needed_limit: value needed for inexact rename detection to run
       	 *
      @@ merge-ort.c: static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
     - 		renames->trivial_merges_okay[i] = 1; /* 1 == maybe */
     + 		strintmap_func(&renames->relevant_sources[i]);
       		if (!reinitialize)
       			assert(renames->cached_pairs_valid_side == 0);
      -		if (i != renames->cached_pairs_valid_side) {
     @@ merge-ort.c: static int handle_deferred_entries(struct merge_options *opt,
       	}
      +	if (path_count_after) {
      +		/*
     -+		 * Not sure were the right cut-off is for the optimization
     -+		 * to redo collect_merge_info after we've cached the
     -+		 * regular renames is.  Basically, collect_merge_info(),
     -+		 * detect_regular_renames(), and process_entries() are
     -+		 * similar costs and all big tent poles.  Caching the
     -+		 * result of detect_regular_renames() means that redoing
     -+		 * that one function will cost us virtually 0 extra, so it
     -+		 * depends on the other two functions, which are both O(N)
     -+		 * cost in the number of paths.  Thus, it makes sense that
     -+		 * if we can cut the number of paths in half, then redoing
     -+		 * collect_merge_info() at half cost in order to get
     -+		 * process_entries() at half cost should be about equal
     -+		 * cost.  If we can cut by more than half, then we would
     -+		 * win.  The fact that process_entries() is about 10%-20%
     -+		 * more expensive than collect_merge_info() suggests we
     -+		 * could make the factor be less than two.  The fact that
     -+		 * even when we have renames cached, we still have to
     -+		 * traverse down to the individual (relevant) renames,
     -+		 * which suggests we should perhaps use a bigger factor.
     -+		 *
     -+		 * The exact number isn't critical, since the code will
     -+		 * work even if we get the factor wrong -- it just might be
     -+		 * slightly slower if we're a bit off.  For now, just error
     -+		 * on the side of a bigger fudge.  For the linux kernel
     -+		 * testcases I was looking at with massive renames, the
     -+		 * ratio came in around 50 to 250, which clearly would
     -+		 * trigger this optimization and provided some *very* nice
     -+		 * speedups.
     ++		 * The choice of wanted_factor here does not affect
     ++		 * correctness, only performance.  When the
     ++		 *    path_count_after / path_count_before
     ++		 * ratio is high, redoing after renames is a big
     ++		 * performance boost.  I suspect that redoing is a wash
     ++		 * somewhere near a value of 2, and below that redoing will
     ++		 * slow things down.  I applied a fudge factor and picked
     ++		 * 3; see the commit message when this was introduced for
     ++		 * back of the envelope calculations for this ratio.
      +		 */
     -+		int wanted_factor = 3;
     ++		const int wanted_factor = 3;
      +
      +		/* We should only redo collect_merge_info one time */
      +		assert(renames->redo_after_renames == 0);
      +
     -+		if (path_count_after / path_count_before > wanted_factor) {
     ++		if (path_count_after / path_count_before >= wanted_factor) {
      +			renames->redo_after_renames = 1;
      +			renames->cached_pairs_valid_side = -1;
      +		}

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 1/7] merge-ort: resolve paths early when we have sufficient information
  2021-07-16  5:22   ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
@ 2021-07-16  5:22     ` Elijah Newren via GitGitGadget
  2021-07-16  5:22     ` [PATCH v3 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
                       ` (7 subsequent siblings)
  8 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-16  5:22 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Bagas Sanjaya, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

When there are no directories involved at a given path, and all three
sides have a file at that path, and two of the three sides of history
match, we can immediately resolve the merge of that path in
collect_merge_info() and do not need to wait until process_entries().

This is actually a very minor improvement: half the time when I run it,
I see an improvement; the other half a slowdown.  It seems to be in the
range of noise.  However, this idea serves as the beginning of some
bigger optimizations coming in the following patches.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index e3a5dfc7b31..6299b4f9413 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1023,6 +1023,43 @@ static int collect_merge_info_callback(int n,
 		return mask;
 	}
 
+	/*
+	 * If the sides match, and all three paths are present and are
+	 * files, then we can take either as the resolution.  We can't do
+	 * this with trees, because there may be rename sources from the
+	 * merge_base.
+	 */
+	if (sides_match && filemask == 0x07) {
+		/* use side1 (== side2) version as resolution */
+		setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
+				names, names+1, side1_null, 0,
+				filemask, dirmask, 1);
+		return mask;
+	}
+
+	/*
+	 * If side1 matches mbase and all three paths are present and are
+	 * files, then we can use side2 as the resolution.  We cannot
+	 * necessarily do so this for trees, because there may be rename
+	 * destinations within side2.
+	 */
+	if (side1_matches_mbase && filemask == 0x07) {
+		/* use side2 version as resolution */
+		setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
+				names, names+2, side2_null, 0,
+				filemask, dirmask, 1);
+		return mask;
+	}
+
+	/* Similar to above but swapping sides 1 and 2 */
+	if (side2_matches_mbase && filemask == 0x07) {
+		/* use side1 version as resolution */
+		setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
+				names, names+1, side1_null, 0,
+				filemask, dirmask, 1);
+		return mask;
+	}
+
 	/*
 	 * Gather additional information used in rename detection.
 	 */
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 2/7] merge-ort: add some more explanations in collect_merge_info_callback()
  2021-07-16  5:22   ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
  2021-07-16  5:22     ` [PATCH v3 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
@ 2021-07-16  5:22     ` Elijah Newren via GitGitGadget
  2021-07-16  5:22     ` [PATCH v3 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
                       ` (6 subsequent siblings)
  8 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-16  5:22 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Bagas Sanjaya, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

The previous patch possibly raises some questions about whether
additional cases in collect_merge_info_callback() can be handled early.
Add some explanations in the form of comments to help explain these
better.  While we're at it, add a few comments to denote what a few
boolean '0' or '1' values stand for.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index 6299b4f9413..843fa693145 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1018,8 +1018,8 @@ static int collect_merge_info_callback(int n,
 	if (side1_matches_mbase && side2_matches_mbase) {
 		/* mbase, side1, & side2 all match; use mbase as resolution */
 		setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
-				names, names+0, mbase_null, 0,
-				filemask, dirmask, 1);
+				names, names+0, mbase_null, 0 /* df_conflict */,
+				filemask, dirmask, 1 /* resolved */);
 		return mask;
 	}
 
@@ -1061,14 +1061,24 @@ static int collect_merge_info_callback(int n,
 	}
 
 	/*
-	 * Gather additional information used in rename detection.
+	 * Sometimes we can tell that a source path need not be included in
+	 * rename detection -- namely, whenever either
+	 *    side1_matches_mbase && side2_null
+	 * or
+	 *    side2_matches_mbase && side1_null
+	 * However, we call collect_rename_info() even in those cases,
+	 * because exact renames are cheap and would let us remove both a
+	 * source and destination path.  We'll cull the unneeded sources
+	 * later.
 	 */
 	collect_rename_info(opt, names, dirname, fullpath,
 			    filemask, dirmask, match_mask);
 
 	/*
-	 * Record information about the path so we can resolve later in
-	 * process_entries.
+	 * None of the special cases above matched, so we have a
+	 * provisional conflict.  (Rename detection might allow us to
+	 * unconflict some more cases, but that comes later so all we can
+	 * do now is record the different non-null file hashes.)
 	 */
 	setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
 			names, NULL, 0, df_conflict, filemask, dirmask, 0);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 3/7] merge-ort: add data structures for allowable trivial directory resolves
  2021-07-16  5:22   ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
  2021-07-16  5:22     ` [PATCH v3 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
  2021-07-16  5:22     ` [PATCH v3 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
@ 2021-07-16  5:22     ` Elijah Newren via GitGitGadget
  2021-07-16  5:22     ` [PATCH v3 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
                       ` (5 subsequent siblings)
  8 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-16  5:22 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Bagas Sanjaya, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

As noted a few commits ago, we can resolve individual files early if all
three sides of the merge have a file at the path and two of the three
sides match.  We would really like to do the same thing with
directories, because being able to do a trivial directory resolve means
we don't have to recurse into the directory, potentially saving us a
huge amount of time in both collect_merge_info() and process_entries().
Unfortunately, resolving directories early would mean missing any
renames whose source or destination is underneath that directory.

If we somehow knew there weren't any renames under the directory in
question, then we could resolve it early.  Sadly, it is impossible to
determine whether there are renames under the directory in question
without recursing into it, and this has traditionally kept us from ever
implementing such an optimization.

In commit f89b4f2bee ("merge-ort: skip rename detection entirely if
possible", 2021-03-11), we added an additional reason that rename
detection could be skipped entirely -- namely, if no *relevant* sources
were present.  Without completing collect_merge_info_callback(), we do
not yet know if there are no relevant sources.  However, we do know that
if the current directory on one side matches the merge base, then every
source file within that directory will not be RELEVANT_CONTENT, and a
few simple checks can often let us rule out RELEVANT_LOCATION as well.
This suggests we can just defer recursing into such directories until
the end of collect_merge_info.

Since the deferred directories are known to not add any relevant sources
due to the above properties, then if there are no relevant sources after
we've traversed all paths other than the deferred ones, then we know
there are not any relevant sources.  Under those conditions, rename
detection is unnecessary, and that means we can resolve the deferred
directories without recursing into them.

Note that the logic for skipping rename detection was also modified
further in commit 76e253793c ("merge-ort, diffcore-rename: employ cached
renames when possible", 2021-01-30); in particular rename detection can
be skipped if we already have cached renames for each relevant source.
We can take advantage of this information as well with our deferral of
recursing into directories where one side matches the merge base.

Add some data structures that we will use to do these deferrals, with
some lengthy comments explaining their purpose.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index 843fa693145..d9263ec5aca 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -62,6 +62,53 @@ struct traversal_callback_data {
 	struct name_entry names[3];
 };
 
+struct deferred_traversal_data {
+	/*
+	 * possible_trivial_merges: directories to be explored only when needed
+	 *
+	 * possible_trivial_merges is a map of directory names to
+	 * dir_rename_mask.  When we detect that a directory is unchanged on
+	 * one side, we can sometimes resolve the directory without recursing
+	 * into it.  Renames are the only things that can prevent such an
+	 * optimization.  However, for rename sources:
+	 *   - If no parent directory needed directory rename detection, then
+	 *     no path under such a directory can be a relevant_source.
+	 * and for rename destinations:
+	 *   - If no cached rename has a target path under the directory AND
+	 *   - If there are no unpaired relevant_sources elsewhere in the
+	 *     repository
+	 * then we don't need any path under this directory for a rename
+	 * destination.  The only way to know the last item above is to defer
+	 * handling such directories until the end of collect_merge_info(),
+	 * in handle_deferred_entries().
+	 *
+	 * For each we store dir_rename_mask, since that's the only bit of
+	 * information we need, other than the path, to resume the recursive
+	 * traversal.
+	 */
+	struct strintmap possible_trivial_merges;
+
+	/*
+	 * trivial_merges_okay: if trivial directory merges are okay
+	 *
+	 * See possible_trivial_merges above.  The "no unpaired
+	 * relevant_sources elsewhere in the repository" is a single boolean
+	 * per merge side, which we store here.  Note that while 0 means no,
+	 * 1 only means "maybe" rather than "yes"; we optimistically set it
+	 * to 1 initially and only clear when we determine it is unsafe to
+	 * do trivial directory merges.
+	 */
+	unsigned trivial_merges_okay;
+
+	/*
+	 * target_dirs: ancestor directories of rename targets
+	 *
+	 * target_dirs contains all directory names that are an ancestor of
+	 * any rename destination.
+	 */
+	struct strset target_dirs;
+};
+
 struct rename_info {
 	/*
 	 * All variables that are arrays of size 3 correspond to data tracked
@@ -119,6 +166,8 @@ struct rename_info {
 	 */
 	struct strintmap relevant_sources[3];
 
+	struct deferred_traversal_data deferred[3];
+
 	/*
 	 * dir_rename_mask:
 	 *   0: optimization removing unmodified potential rename source okay
@@ -501,6 +550,11 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 				strmap_clear(&renames->dir_rename_count[i], 1);
 		}
 	}
+	for (i = MERGE_SIDE1; i <= MERGE_SIDE2; ++i) {
+		strintmap_func(&renames->deferred[i].possible_trivial_merges);
+		strset_func(&renames->deferred[i].target_dirs);
+		renames->deferred[i].trivial_merges_okay = 1; /* 1 == maybe */
+	}
 	renames->cached_pairs_valid_side = 0;
 	renames->dir_rename_mask = 0;
 
@@ -4052,6 +4106,13 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
 		strset_init_with_options(&renames->cached_target_names[i],
 					 NULL, 0);
 	}
+	for (i = MERGE_SIDE1; i <= MERGE_SIDE2; i++) {
+		strintmap_init_with_options(&renames->deferred[i].possible_trivial_merges,
+					    0, NULL, 0);
+		strset_init_with_options(&renames->deferred[i].target_dirs,
+					 NULL, 1);
+		renames->deferred[i].trivial_merges_okay = 1; /* 1 == maybe */
+	}
 
 	/*
 	 * Although we initialize opt->priv->paths with strdup_strings=0,
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 4/7] merge-ort: add a handle_deferred_entries() helper function
  2021-07-16  5:22   ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
                       ` (2 preceding siblings ...)
  2021-07-16  5:22     ` [PATCH v3 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
@ 2021-07-16  5:22     ` Elijah Newren via GitGitGadget
  2021-07-16  5:22     ` [PATCH v3 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
                       ` (4 subsequent siblings)
  8 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-16  5:22 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Bagas Sanjaya, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

In order to allow trivial directory resolution, we first need to be able
to gather more information to determine if the optimization is safe.  To
enable that, we need a way of deferring the recursion into the directory
until a later time.  Naturally, deferring the entry into a subtree means
that we need some function that will later recurse into the subdirectory
exactly the same way that collect_merge_info_callback() would have done.

Add a helper function that does this.  For now this function is not used
but a subsequent commit will change that.  Future commits will also make
the function sometimes resolve directories instead of traversing inside.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 64 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index d9263ec5aca..f0a07684df6 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1202,6 +1202,70 @@ static int collect_merge_info_callback(int n,
 	return mask;
 }
 
+MAYBE_UNUSED
+static int handle_deferred_entries(struct merge_options *opt,
+				   struct traverse_info *info)
+{
+	struct rename_info *renames = &opt->priv->renames;
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
+	int side, ret = 0;
+
+	for (side = MERGE_SIDE1; side <= MERGE_SIDE2; side++) {
+		renames->deferred[side].trivial_merges_okay = 0;
+		strintmap_for_each_entry(&renames->deferred[side].possible_trivial_merges,
+					 &iter, entry) {
+			const char *path = entry->key;
+			unsigned dir_rename_mask = (intptr_t)entry->value;
+			struct conflict_info *ci;
+			unsigned dirmask;
+			struct tree_desc t[3];
+			void *buf[3] = {NULL,};
+			int i;
+
+			ci = strmap_get(&opt->priv->paths, path);
+			VERIFY_CI(ci);
+			dirmask = ci->dirmask;
+
+			info->name = path;
+			info->namelen = strlen(path);
+			info->pathlen = info->namelen + 1;
+
+			for (i = 0; i < 3; i++, dirmask >>= 1) {
+				if (i == 1 && ci->match_mask == 3)
+					t[1] = t[0];
+				else if (i == 2 && ci->match_mask == 5)
+					t[2] = t[0];
+				else if (i == 2 && ci->match_mask == 6)
+					t[2] = t[1];
+				else {
+					const struct object_id *oid = NULL;
+					if (dirmask & 1)
+						oid = &ci->stages[i].oid;
+					buf[i] = fill_tree_descriptor(opt->repo,
+								      t+i, oid);
+				}
+			}
+
+			ci->match_mask &= ci->filemask;
+			opt->priv->current_dir_name = path;
+			renames->dir_rename_mask = dir_rename_mask;
+			if (renames->dir_rename_mask == 0 ||
+			    renames->dir_rename_mask == 0x07)
+				ret = traverse_trees(NULL, 3, t, info);
+			else
+				ret = traverse_trees_wrapper(NULL, 3, t, info);
+
+			for (i = MERGE_BASE; i <= MERGE_SIDE2; i++)
+				free(buf[i]);
+
+			if (ret < 0)
+				return ret;
+		}
+	}
+	return ret;
+}
+
 static int collect_merge_info(struct merge_options *opt,
 			      struct tree *merge_base,
 			      struct tree *side1,
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 5/7] merge-ort: defer recursing into directories when merge base is matched
  2021-07-16  5:22   ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
                       ` (3 preceding siblings ...)
  2021-07-16  5:22     ` [PATCH v3 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
@ 2021-07-16  5:22     ` Elijah Newren via GitGitGadget
  2021-07-16  5:22     ` [PATCH v3 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
                       ` (3 subsequent siblings)
  8 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-16  5:22 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Bagas Sanjaya, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

When one side of history matches the merge base (including when the
merge base has no entry for the given directory), have
collect_merge_info_callback() defer recursing into the directory.  To
ensure those entries are eventually handled, add a call to
handled_deferred_entries() in collect_merge_info() after
traverse_trees() returns.

Note that the condition in collect_merge_info_callback() may look more
complicated than necessary at first glance;
renames->trivial_merges_okay[side] is always true until
handle_deferred_entries() is called, and possible_trivial_merges[side]
is always empty right now (and in the future won't be filled until
handle_deferred_entries() is called).  However, when
handle_deferred_entries() calls traverse_trees() for the relevant
deferred directories, those traverse_trees() calls will once again end
up in collect_merge_info_callback() for all the entries under those
subdirectories.  The extra conditions are there for such deferred cases
and will be used more as we do more with those variables.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 33 +++++++++++++++++++++++++++++++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index f0a07684df6..dbccf8c62e2 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1147,8 +1147,36 @@ static int collect_merge_info_callback(int n,
 		struct tree_desc t[3];
 		void *buf[3] = {NULL, NULL, NULL};
 		const char *original_dir_name;
-		int i, ret;
+		int i, ret, side;
 
+		/*
+		 * Check for whether we can avoid recursing due to one side
+		 * matching the merge base.  The side that does NOT match is
+		 * the one that might have a rename destination we need.
+		 */
+		assert(!side1_matches_mbase || !side2_matches_mbase);
+		side = side1_matches_mbase ? MERGE_SIDE2 :
+			side2_matches_mbase ? MERGE_SIDE1 : MERGE_BASE;
+		if (filemask == 0 && (dirmask == 2 || dirmask == 4)) {
+			/*
+			 * Also defer recursing into new directories; set up a
+			 * few variables to let us do so.
+			 */
+			ci->match_mask = (7 - dirmask);
+			side = dirmask / 2;
+		}
+		if (renames->dir_rename_mask != 0x07 &&
+		    side != MERGE_BASE &&
+		    renames->deferred[side].trivial_merges_okay &&
+		    !strset_contains(&renames->deferred[side].target_dirs,
+				     pi.string)) {
+			strintmap_set(&renames->deferred[side].possible_trivial_merges,
+				      pi.string, renames->dir_rename_mask);
+			renames->dir_rename_mask = prev_dir_rename_mask;
+			return mask;
+		}
+
+		/* We need to recurse */
 		ci->match_mask &= filemask;
 		newinfo = *info;
 		newinfo.prev = info;
@@ -1202,7 +1230,6 @@ static int collect_merge_info_callback(int n,
 	return mask;
 }
 
-MAYBE_UNUSED
 static int handle_deferred_entries(struct merge_options *opt,
 				   struct traverse_info *info)
 {
@@ -1291,6 +1318,8 @@ static int collect_merge_info(struct merge_options *opt,
 
 	trace2_region_enter("merge", "traverse_trees", opt->repo);
 	ret = traverse_trees(NULL, 3, t, &info);
+	if (ret == 0)
+		ret = handle_deferred_entries(opt, &info);
 	trace2_region_leave("merge", "traverse_trees", opt->repo);
 
 	return ret;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 6/7] merge-ort: avoid recursing into directories when we don't need to
  2021-07-16  5:22   ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
                       ` (4 preceding siblings ...)
  2021-07-16  5:22     ` [PATCH v3 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
@ 2021-07-16  5:22     ` Elijah Newren via GitGitGadget
  2021-07-16  5:22     ` [PATCH v3 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
                       ` (2 subsequent siblings)
  8 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-16  5:22 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Bagas Sanjaya, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

This combines the work of the last several patches, and implements the
conditions when we don't need to recurse into directories.  It's perhaps
easiest to see the logic by separating the fact that a directory might
have both rename sources and rename destinations:

  * rename sources: only files present in the merge base can serve as
    rename sources, and only when one side deletes that file.  When the
    tree on one side matches the merge base, that means every file
    within the subtree matches the merge base.  This means that the
    skip-irrelevant-rename-detection optimization from before kicks in
    and we don't need any of these files as rename sources.

  * rename destinations: the tree that does not match the merge base
    might have newly added and hence unmatched destination files.
    This is what usually prevents us from doing trivial directory
    resolutions in the merge machinery.  However, the fact that we have
    deferred recursing into this directory until the end means we know
    whether there are any unmatched relevant potential rename sources
    elsewhere in this merge.  If there are no unmatched such relevant
    sources anywhere, then there is no need to look for unmatched
    potential rename destinations to match them with.

This informs our algorithm:
  * Search through relevant_sources; if we have entries, they better all
    be reflected in cached_pairs or cached_irrelevant, otherwise they
    represent an unmatched potential rename source (causing the
    optimization to be disallowed).
  * For any relevant_source represented in cached_pairs, we do need to
    to make sure to get the destination for each source, meaning we need
    to recurse into any ancestor directories of those destinations.
  * Once we've recursed into all the rename destinations for any
    relevant_sources in cached_pairs, we can then do the trivial
    directory resolution for the remaining directories.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:        5.235 s ±  0.042 s   205.1  ms ±  3.8  ms
    mega-renames:      9.419 s ±  0.107 s     1.564 s ±  0.010 s
    just-one-mega:   480.1  ms ±  3.9  ms   479.5  ms ±  3.9  ms

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 99 insertions(+), 3 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index dbccf8c62e2..a013708fa79 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1230,6 +1230,18 @@ static int collect_merge_info_callback(int n,
 	return mask;
 }
 
+static void resolve_trivial_directory_merge(struct conflict_info *ci, int side)
+{
+	VERIFY_CI(ci);
+	assert((side == 1 && ci->match_mask == 5) ||
+	       (side == 2 && ci->match_mask == 3));
+	oidcpy(&ci->merged.result.oid, &ci->stages[side].oid);
+	ci->merged.result.mode = ci->stages[side].mode;
+	ci->merged.is_null = is_null_oid(&ci->stages[side].oid);
+	ci->match_mask = 0;
+	ci->merged.clean = 1; /* (ci->filemask == 0); */
+}
+
 static int handle_deferred_entries(struct merge_options *opt,
 				   struct traverse_info *info)
 {
@@ -1239,9 +1251,72 @@ static int handle_deferred_entries(struct merge_options *opt,
 	int side, ret = 0;
 
 	for (side = MERGE_SIDE1; side <= MERGE_SIDE2; side++) {
-		renames->deferred[side].trivial_merges_okay = 0;
-		strintmap_for_each_entry(&renames->deferred[side].possible_trivial_merges,
-					 &iter, entry) {
+		unsigned optimization_okay = 1;
+		struct strintmap copy;
+
+		/* Loop over the set of paths we need to know rename info for */
+		strset_for_each_entry(&renames->relevant_sources[side],
+				      &iter, entry) {
+			char *rename_target, *dir, *dir_marker;
+			struct strmap_entry *e;
+
+			/*
+			 * If we don't know delete/rename info for this path,
+			 * then we need to recurse into all trees to get all
+			 * adds to make sure we have it.
+			 */
+			if (strset_contains(&renames->cached_irrelevant[side],
+					    entry->key))
+				continue;
+			e = strmap_get_entry(&renames->cached_pairs[side],
+					     entry->key);
+			if (!e) {
+				optimization_okay = 0;
+				break;
+			}
+
+			/* If this is a delete, we have enough info already */
+			rename_target = e->value;
+			if (!rename_target)
+				continue;
+
+			/* If we already walked the rename target, we're good */
+			if (strmap_contains(&opt->priv->paths, rename_target))
+				continue;
+
+			/*
+			 * Otherwise, we need to get a list of directories that
+			 * will need to be recursed into to get this
+			 * rename_target.
+			 */
+			dir = xstrdup(rename_target);
+			while ((dir_marker = strrchr(dir, '/'))) {
+				*dir_marker = '\0';
+				if (strset_contains(&renames->deferred[side].target_dirs,
+						    dir))
+					break;
+				strset_add(&renames->deferred[side].target_dirs,
+					   dir);
+			}
+			free(dir);
+		}
+		renames->deferred[side].trivial_merges_okay = optimization_okay;
+		/*
+		 * We need to recurse into any directories in
+		 * possible_trivial_merges[side] found in target_dirs[side].
+		 * But when we recurse, we may need to queue up some of the
+		 * subdirectories for possible_trivial_merges[side].  Since
+		 * we can't safely iterate through a hashmap while also adding
+		 * entries, move the entries into 'copy', iterate over 'copy',
+		 * and then we'll also iterate anything added into
+		 * possible_trivial_merges[side] once this loop is done.
+		 */
+		copy = renames->deferred[side].possible_trivial_merges;
+		strintmap_init_with_options(&renames->deferred[side].possible_trivial_merges,
+					    0,
+					    NULL,
+					    0);
+		strintmap_for_each_entry(&copy, &iter, entry) {
 			const char *path = entry->key;
 			unsigned dir_rename_mask = (intptr_t)entry->value;
 			struct conflict_info *ci;
@@ -1254,6 +1329,13 @@ static int handle_deferred_entries(struct merge_options *opt,
 			VERIFY_CI(ci);
 			dirmask = ci->dirmask;
 
+			if (optimization_okay &&
+			    !strset_contains(&renames->deferred[side].target_dirs,
+					     path)) {
+				resolve_trivial_directory_merge(ci, side);
+				continue;
+			}
+
 			info->name = path;
 			info->namelen = strlen(path);
 			info->pathlen = info->namelen + 1;
@@ -1289,6 +1371,20 @@ static int handle_deferred_entries(struct merge_options *opt,
 			if (ret < 0)
 				return ret;
 		}
+		strintmap_clear(&copy);
+		strintmap_for_each_entry(&renames->deferred[side].possible_trivial_merges,
+					 &iter, entry) {
+			const char *path = entry->key;
+			struct conflict_info *ci;
+
+			ci = strmap_get(&opt->priv->paths, path);
+			VERIFY_CI(ci);
+
+			assert(renames->deferred[side].trivial_merges_okay &&
+			       !strset_contains(&renames->deferred[side].target_dirs,
+						path));
+			resolve_trivial_directory_merge(ci, side);
+		}
 	}
 	return ret;
 }
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v3 7/7] merge-ort: restart merge with cached renames to reduce process entry cost
  2021-07-16  5:22   ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
                       ` (5 preceding siblings ...)
  2021-07-16  5:22     ` [PATCH v3 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
@ 2021-07-16  5:22     ` Elijah Newren via GitGitGadget
  2021-07-20 13:00     ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Derrick Stolee
  2021-07-21  4:23     ` [PATCH v4 " Elijah Newren via GitGitGadget
  8 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-16  5:22 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Bagas Sanjaya, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

The merge algorithm mostly consists of the following three functions:
   collect_merge_info()
   detect_and_process_renames()
   process_entries()
Prior to the trivial directory resolution optimization of the last half
dozen commits, process_entries() was consistently the slowest, followed
by collect_merge_info(), then detect_and_process_renames().  When the
trivial directory resolution applies, it often dramatically decreases
the amount of time spent in the two slower functions.

Looking at the performance results in the previous commit, the trivial
directory resolution optimization helps amazingly well when there are no
relevant renames.  It also helps really well when reapplying a long
series of linear commits (such as in a rebase or cherry-pick), since the
relevant renames may well be cached from the first reapplied commit.
But when there are any relevant renames that are not cached (represented
by the just-one-mega testcase), then the optimization does not help at
all.

Often, I noticed that when the optimization does not apply, it is
because there are a handful of relevant sources -- maybe even only one.
It felt frustrating to need to recurse into potentially hundreds or even
thousands of directories just for a single rename, but it was needed for
correctness.

However, staring at this list of functions and noticing that
process_entries() is the most expensive and knowing I could avoid it if
I had cached renames suggested a simple idea: change
   collect_merge_info()
   detect_and_process_renames()
   process_entries()
into
   collect_merge_info()
   detect_and_process_renames()
   <cache all the renames, and restart>
   collect_merge_info()
   detect_and_process_renames()
   process_entries()

This may seem odd and look like more work.  However, note that although
we run collect_merge_info() twice, the second time we get to employ
trivial directory resolves, which makes it much faster, so the increased
time in collect_merge_info() is small.  While we run
detect_and_process_renames() again, all renames are cached so it's
nearly a no-op (we don't call into diffcore_rename_extended() but we do
have a little bit of data structure checking and fixing up).  And the
big payoff comes from the fact that process_entries(), will be much
faster due to having far fewer entries to process.

This restarting only makes sense if we can save recursing into enough
directories to make it worth our while.  Introduce a simple heuristic to
guide this.  Note that this heuristic uses a "wanted_factor" that I have
virtually no actual real world data for, just some back-of-the-envelope
quasi-scientific calculations that I included in some comments and then
plucked a simple round number out of thin air.  It could be that
tweaking this number to make it either higher or lower improves the
optimization.  (There's slightly more here; when I first introduced this
optimization, I used a factor of 10, because I was completely confident
it was big enough to not cause slowdowns in special cases.  I was
certain it was higher than needed.  Several months later, I added the
rough calculations which make me think the optimal number is close to 2;
but instead of pushing to the limit, I just bumped it to 3 to reduce the
risk that there are special cases where this optimization can result in
slowing down the code a little.  If the ratio of path counts is below 3,
we probably will only see minor performance improvements at best
anyway.)

Also, note that while the diffstat looks kind of long (nearly 100
lines), more than half of it is in two comments explaining how things
work.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:      205.1  ms ±  3.8  ms   204.2  ms ±  3.0  ms
    mega-renames:      1.564 s ±  0.010 s     1.076 s ±  0.015 s
    just-one-mega:   479.5  ms ±  3.9  ms   364.1  ms ±  7.0  ms

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c                         | 92 +++++++++++++++++++++++++++--
 t/t6423-merge-rename-directories.sh |  2 +-
 2 files changed, 87 insertions(+), 7 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index a013708fa79..e361443087a 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -213,6 +213,7 @@ struct rename_info {
 	 *   MERGE_SIDE2: cached data from side2 can be reused
 	 *   MERGE_SIDE1: cached data from side1 can be reused
 	 *   0:           no cached data can be reused
+	 *   -1:          See redo_after_renames; both sides can be reused.
 	 */
 	int cached_pairs_valid_side;
 
@@ -258,6 +259,28 @@ struct rename_info {
 	 */
 	struct strset cached_irrelevant[3];
 
+	/*
+	 * redo_after_renames: optimization flag for "restarting" the merge
+	 *
+	 * Sometimes it pays to detect renames, cache them, and then
+	 * restart the merge operation from the beginning.  The reason for
+	 * this is that when we know where all the renames are, we know
+	 * whether a certain directory has any paths under it affected --
+	 * and if a directory is not affected then it permits us to do
+	 * trivial tree merging in more cases.  Doing trivial tree merging
+	 * prevents the need to run process_entry() on every path
+	 * underneath trees that can be trivially merged, and
+	 * process_entry() is more expensive than collect_merge_info() --
+	 * plus, the second collect_merge_info() will be much faster since
+	 * it doesn't have to recurse into the relevant trees.
+	 *
+	 * Values for this flag:
+	 *   0 = don't bother, not worth it (or conditions not yet checked)
+	 *   1 = conditions for optimization met, optimization worthwhile
+	 *   2 = we already did it (don't restart merge yet again)
+	 */
+	unsigned redo_after_renames;
+
 	/*
 	 * needed_limit: value needed for inexact rename detection to run
 	 *
@@ -541,7 +564,8 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		strintmap_func(&renames->relevant_sources[i]);
 		if (!reinitialize)
 			assert(renames->cached_pairs_valid_side == 0);
-		if (i != renames->cached_pairs_valid_side) {
+		if (i != renames->cached_pairs_valid_side &&
+		    -1 != renames->cached_pairs_valid_side) {
 			strset_func(&renames->cached_target_names[i]);
 			strmap_func(&renames->cached_pairs[i], 1);
 			strset_func(&renames->cached_irrelevant[i]);
@@ -1249,7 +1273,9 @@ static int handle_deferred_entries(struct merge_options *opt,
 	struct hashmap_iter iter;
 	struct strmap_entry *entry;
 	int side, ret = 0;
+	int path_count_before, path_count_after = 0;
 
+	path_count_before = strmap_get_size(&opt->priv->paths);
 	for (side = MERGE_SIDE1; side <= MERGE_SIDE2; side++) {
 		unsigned optimization_okay = 1;
 		struct strintmap copy;
@@ -1385,7 +1411,32 @@ static int handle_deferred_entries(struct merge_options *opt,
 						path));
 			resolve_trivial_directory_merge(ci, side);
 		}
+		if (!optimization_okay || path_count_after)
+			path_count_after = strmap_get_size(&opt->priv->paths);
 	}
+	if (path_count_after) {
+		/*
+		 * The choice of wanted_factor here does not affect
+		 * correctness, only performance.  When the
+		 *    path_count_after / path_count_before
+		 * ratio is high, redoing after renames is a big
+		 * performance boost.  I suspect that redoing is a wash
+		 * somewhere near a value of 2, and below that redoing will
+		 * slow things down.  I applied a fudge factor and picked
+		 * 3; see the commit message when this was introduced for
+		 * back of the envelope calculations for this ratio.
+		 */
+		const int wanted_factor = 3;
+
+		/* We should only redo collect_merge_info one time */
+		assert(renames->redo_after_renames == 0);
+
+		if (path_count_after / path_count_before >= wanted_factor) {
+			renames->redo_after_renames = 1;
+			renames->cached_pairs_valid_side = -1;
+		}
+	} else if (renames->redo_after_renames == 2)
+		renames->redo_after_renames = 0;
 	return ret;
 }
 
@@ -2828,8 +2879,8 @@ static int compare_pairs(const void *a_, const void *b_)
 }
 
 /* Call diffcore_rename() to update deleted/added pairs into rename pairs */
-static void detect_regular_renames(struct merge_options *opt,
-				   unsigned side_index)
+static int detect_regular_renames(struct merge_options *opt,
+				  unsigned side_index)
 {
 	struct diff_options diff_opts;
 	struct rename_info *renames = &opt->priv->renames;
@@ -2842,7 +2893,7 @@ static void detect_regular_renames(struct merge_options *opt,
 		 * side had directory renames.
 		 */
 		resolve_diffpair_statuses(&renames->pairs[side_index]);
-		return;
+		return 0;
 	}
 
 	partial_clear_dir_rename_count(&renames->dir_rename_count[side_index]);
@@ -2868,6 +2919,8 @@ static void detect_regular_renames(struct merge_options *opt,
 	trace2_region_leave("diff", "diffcore_rename", opt->repo);
 	resolve_diffpair_statuses(&diff_queued_diff);
 
+	if (diff_opts.needed_rename_limit > 0)
+		renames->redo_after_renames = 0;
 	if (diff_opts.needed_rename_limit > renames->needed_limit)
 		renames->needed_limit = diff_opts.needed_rename_limit;
 
@@ -2877,6 +2930,8 @@ static void detect_regular_renames(struct merge_options *opt,
 	diff_queued_diff.nr = 0;
 	diff_queued_diff.queue = NULL;
 	diff_flush(&diff_opts);
+
+	return 1;
 }
 
 /*
@@ -2966,14 +3021,32 @@ static int detect_and_process_renames(struct merge_options *opt,
 	struct diff_queue_struct combined;
 	struct rename_info *renames = &opt->priv->renames;
 	int need_dir_renames, s, clean = 1;
+	unsigned detection_run = 0;
 
 	memset(&combined, 0, sizeof(combined));
 	if (!possible_renames(renames))
 		goto cleanup;
 
 	trace2_region_enter("merge", "regular renames", opt->repo);
-	detect_regular_renames(opt, MERGE_SIDE1);
-	detect_regular_renames(opt, MERGE_SIDE2);
+	detection_run |= detect_regular_renames(opt, MERGE_SIDE1);
+	detection_run |= detect_regular_renames(opt, MERGE_SIDE2);
+	if (renames->redo_after_renames && detection_run) {
+		int i, side;
+		struct diff_filepair *p;
+
+		/* Cache the renames, we found */
+		for (side = MERGE_SIDE1; side <= MERGE_SIDE2; side++) {
+			for (i = 0; i < renames->pairs[side].nr; ++i) {
+				p = renames->pairs[side].queue[i];
+				possibly_cache_new_pair(renames, p, side, NULL);
+			}
+		}
+
+		/* Restart the merge with the cached renames */
+		renames->redo_after_renames = 2;
+		trace2_region_leave("merge", "regular renames", opt->repo);
+		goto cleanup;
+	}
 	use_cached_pairs(opt, &renames->cached_pairs[1], &renames->pairs[1]);
 	use_cached_pairs(opt, &renames->cached_pairs[2], &renames->pairs[2]);
 	trace2_region_leave("merge", "regular renames", opt->repo);
@@ -4390,6 +4463,7 @@ static void merge_ort_nonrecursive_internal(struct merge_options *opt,
 					       opt->subtree_shift);
 	}
 
+redo:
 	trace2_region_enter("merge", "collect_merge_info", opt->repo);
 	if (collect_merge_info(opt, merge_base, side1, side2) != 0) {
 		/*
@@ -4409,6 +4483,12 @@ static void merge_ort_nonrecursive_internal(struct merge_options *opt,
 	result->clean = detect_and_process_renames(opt, merge_base,
 						   side1, side2);
 	trace2_region_leave("merge", "renames", opt->repo);
+	if (opt->priv->renames.redo_after_renames == 2) {
+		trace2_region_enter("merge", "reset_maps", opt->repo);
+		clear_or_reinit_internal_opts(opt->priv, 1);
+		trace2_region_leave("merge", "reset_maps", opt->repo);
+		goto redo;
+	}
 
 	trace2_region_enter("merge", "process_entries", opt->repo);
 	process_entries(opt, &working_tree_oid);
diff --git a/t/t6423-merge-rename-directories.sh b/t/t6423-merge-rename-directories.sh
index e834b7e6efe..d8919d276a1 100755
--- a/t/t6423-merge-rename-directories.sh
+++ b/t/t6423-merge-rename-directories.sh
@@ -4797,7 +4797,7 @@ test_setup_12f () {
 	)
 }
 
-test_expect_merge_algorithm failure failure '12f: Trivial directory resolve, caching, all kinds of fun' '
+test_expect_merge_algorithm failure success '12f: Trivial directory resolve, caching, all kinds of fun' '
 	test_setup_12f &&
 	(
 		cd 12f &&
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 0/7] Optimization batch 14: trivial directory resolution
  2021-07-16  5:22   ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
                       ` (6 preceding siblings ...)
  2021-07-16  5:22     ` [PATCH v3 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
@ 2021-07-20 13:00     ` Derrick Stolee
  2021-07-20 21:43       ` Junio C Hamano
  2021-07-21  4:23     ` [PATCH v4 " Elijah Newren via GitGitGadget
  8 siblings, 1 reply; 52+ messages in thread
From: Derrick Stolee @ 2021-07-20 13:00 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Ævar Arnfjörð Bjarmason, Elijah Newren, Bagas Sanjaya

On 7/16/2021 1:22 AM, Elijah Newren via GitGitGadget wrote:
> This series depends textually on ort-perf-batch-12, but is semantically
> independent. (It is both semantically and textually independent of
> ort-perf-batch-13.)
...
> Range-diff vs v2:

Thank you for making these edits.

>      ++struct deferred_traversal_data {

In particular, I think this new struct works rather well.

This version LGTM.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v3 0/7] Optimization batch 14: trivial directory resolution
  2021-07-20 13:00     ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Derrick Stolee
@ 2021-07-20 21:43       ` Junio C Hamano
  0 siblings, 0 replies; 52+ messages in thread
From: Junio C Hamano @ 2021-07-20 21:43 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, git,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Bagas Sanjaya

Derrick Stolee <stolee@gmail.com> writes:

> On 7/16/2021 1:22 AM, Elijah Newren via GitGitGadget wrote:
>> This series depends textually on ort-perf-batch-12, but is semantically
>> independent. (It is both semantically and textually independent of
>> ort-perf-batch-13.)
> ...
>> Range-diff vs v2:
>
> Thank you for making these edits.
>
>>      ++struct deferred_traversal_data {
>
> In particular, I think this new struct works rather well.
>
> This version LGTM.
>
> Thanks,
> -Stolee

Thanks, both.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 0/7] Optimization batch 14: trivial directory resolution
  2021-07-16  5:22   ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
                       ` (7 preceding siblings ...)
  2021-07-20 13:00     ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Derrick Stolee
@ 2021-07-21  4:23     ` Elijah Newren via GitGitGadget
  2021-07-21  4:23       ` [PATCH v4 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
                         ` (6 more replies)
  8 siblings, 7 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-21  4:23 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Bagas Sanjaya, Elijah Newren

This series depends textually on ort-perf-batch-12, but is semantically
independent. (It is both semantically and textually independent of
ort-perf-batch-13.)

Most of my previous series dramatically accelerated cases with lots of
renames, while providing comparatively minor benefits for cases with few or
no renames. This series is the opposite; it provides huge benefits when
there are few or no renames, and comparatively smaller (though still quite
decent) benefits for cases with many uncached renames.

Changes since v3:

 * Add Stolee's Acked-by.

Changes since v2, addressing feedback from Stolee:

 * Created a separate struct for three related variables to hint they are
   related
 * Simplified a lengthy comment that was duplicated by the commit message
 * Various other minor cleanups

Changes since v1:

 * Minor tweak to the final patch to correct implicit assumption that rename
   detection running implies all renames were found (rename limits could
   have been exceeded and prevented finding renames)

=== Basic Optimization idea ===

unpack_trees has had a concept of trivial merges for individual files (see
Documentation/technical/trivial-merge.txt). The same idea can be applied in
merge-ort. It'd be really nice to extend that idea to trees as well, as it
could provide a huge performance boost; sadly however, applying it in
general would wreck both regular rename detection (the unmatched side can
have new files that serve as potential destinations in rename detection) and
directory rename detection (the unmatched side could have a new directory
that was moved into it).

If we somehow knew rename detection wasn't needed, we could do trivial
directory resolution. In the past, this wasn't possible. However...

With recent optimizations we have created a possibility to do trivial
directory resolutions in some cases. These came from the addition of the
"skipping irrelevant renames" optimizations (from ort-perf-batch-9 and
ort-perf-batch-10), and in particular noting that we added an ability to
entirely skip rename detection in commit f89b4f2bee ("merge-ort: skip rename
detection entirely if possible", 2021-03-11) when there are no relevant
sources. We can detect if there are no relevant sources without recursing
into the directories in question.

As a cherry on top, the caching of renames (from ort-perf-batch-11) allows
us to cover additional cases.

This series is all about adding all the special checks needed to safely
perform trival directory resolutions.

=== Results ===

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28), the
changes in just this series improves the performance as follows:

                     Before Series           After Series
no-renames:        5.235 s ±  0.042 s   204.2  ms ±  3.0  ms
mega-renames:      9.419 s ±  0.107 s     1.076 s ±  0.015 s
just-one-mega:   480.1  ms ±  3.9  ms   364.1  ms ±  7.0  ms


As a reminder, before any merge-ort/diffcore-rename performance work, the
performance results we started with (for merge-recursive as of git-2.30.0)
were:

no-renames-am:      6.940 s ±  0.485 s
no-renames:        18.912 s ±  0.174 s
mega-renames:    5964.031 s ± 10.459 s
just-one-mega:    149.583 s ±  0.751 s


Elijah Newren (7):
  merge-ort: resolve paths early when we have sufficient information
  merge-ort: add some more explanations in collect_merge_info_callback()
  merge-ort: add data structures for allowable trivial directory
    resolves
  merge-ort: add a handle_deferred_entries() helper function
  merge-ort: defer recursing into directories when merge base is matched
  merge-ort: avoid recursing into directories when we don't need to
  merge-ort: restart merge with cached renames to reduce process entry
    cost

 merge-ort.c                         | 399 +++++++++++++++++++++++++++-
 t/t6423-merge-rename-directories.sh |   2 +-
 2 files changed, 389 insertions(+), 12 deletions(-)


base-commit: 2eeee12b02e441ac05054a5a5ecbcea6964a1e6b
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-988%2Fnewren%2Fort-perf-batch-14-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-988/newren/ort-perf-batch-14-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/988

Range-diff vs v3:

 1:  5dca982c0b0 ! 1:  7fdfeb159d0 merge-ort: resolve paths early when we have sufficient information
     @@ Commit message
          range of noise.  However, this idea serves as the beginning of some
          bigger optimizations coming in the following patches.
      
     +    Acked-by: Derrick Stolee <stolee@gmail.com>
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## merge-ort.c ##
 2:  8aea3713902 ! 2:  7a0085f2da9 merge-ort: add some more explanations in collect_merge_info_callback()
     @@ Commit message
          better.  While we're at it, add a few comments to denote what a few
          boolean '0' or '1' values stand for.
      
     +    Acked-by: Derrick Stolee <stolee@gmail.com>
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## merge-ort.c ##
 3:  c2b45fef1d7 ! 3:  d8165740316 merge-ort: add data structures for allowable trivial directory resolves
     @@ Commit message
          Add some data structures that we will use to do these deferrals, with
          some lengthy comments explaining their purpose.
      
     +    Acked-by: Derrick Stolee <stolee@gmail.com>
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## merge-ort.c ##
 4:  1cf4a47562a ! 4:  ff568a612f5 merge-ort: add a handle_deferred_entries() helper function
     @@ Commit message
          but a subsequent commit will change that.  Future commits will also make
          the function sometimes resolve directories instead of traversing inside.
      
     +    Acked-by: Derrick Stolee <stolee@gmail.com>
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## merge-ort.c ##
 5:  79c51536829 ! 5:  5b01c118f10 merge-ort: defer recursing into directories when merge base is matched
     @@ Commit message
          subdirectories.  The extra conditions are there for such deferred cases
          and will be used more as we do more with those variables.
      
     +    Acked-by: Derrick Stolee <stolee@gmail.com>
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## merge-ort.c ##
 6:  572cc5e94d2 ! 6:  7b211271815 merge-ort: avoid recursing into directories when we don't need to
     @@ Commit message
              mega-renames:      9.419 s ±  0.107 s     1.564 s ±  0.010 s
              just-one-mega:   480.1  ms ±  3.9  ms   479.5  ms ±  3.9  ms
      
     +    Acked-by: Derrick Stolee <stolee@gmail.com>
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## merge-ort.c ##
 7:  a9cbc1d4f18 ! 7:  c9ada8369e6 merge-ort: restart merge with cached renames to reduce process entry cost
     @@ Commit message
              mega-renames:      1.564 s ±  0.010 s     1.076 s ±  0.015 s
              just-one-mega:   479.5  ms ±  3.9  ms   364.1  ms ±  7.0  ms
      
     +    Acked-by: Derrick Stolee <stolee@gmail.com>
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## merge-ort.c ##

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 1/7] merge-ort: resolve paths early when we have sufficient information
  2021-07-21  4:23     ` [PATCH v4 " Elijah Newren via GitGitGadget
@ 2021-07-21  4:23       ` Elijah Newren via GitGitGadget
  2021-07-21  4:23       ` [PATCH v4 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
                         ` (5 subsequent siblings)
  6 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-21  4:23 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Bagas Sanjaya, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

When there are no directories involved at a given path, and all three
sides have a file at that path, and two of the three sides of history
match, we can immediately resolve the merge of that path in
collect_merge_info() and do not need to wait until process_entries().

This is actually a very minor improvement: half the time when I run it,
I see an improvement; the other half a slowdown.  It seems to be in the
range of noise.  However, this idea serves as the beginning of some
bigger optimizations coming in the following patches.

Acked-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index e3a5dfc7b31..6299b4f9413 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1023,6 +1023,43 @@ static int collect_merge_info_callback(int n,
 		return mask;
 	}
 
+	/*
+	 * If the sides match, and all three paths are present and are
+	 * files, then we can take either as the resolution.  We can't do
+	 * this with trees, because there may be rename sources from the
+	 * merge_base.
+	 */
+	if (sides_match && filemask == 0x07) {
+		/* use side1 (== side2) version as resolution */
+		setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
+				names, names+1, side1_null, 0,
+				filemask, dirmask, 1);
+		return mask;
+	}
+
+	/*
+	 * If side1 matches mbase and all three paths are present and are
+	 * files, then we can use side2 as the resolution.  We cannot
+	 * necessarily do so this for trees, because there may be rename
+	 * destinations within side2.
+	 */
+	if (side1_matches_mbase && filemask == 0x07) {
+		/* use side2 version as resolution */
+		setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
+				names, names+2, side2_null, 0,
+				filemask, dirmask, 1);
+		return mask;
+	}
+
+	/* Similar to above but swapping sides 1 and 2 */
+	if (side2_matches_mbase && filemask == 0x07) {
+		/* use side1 version as resolution */
+		setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
+				names, names+1, side1_null, 0,
+				filemask, dirmask, 1);
+		return mask;
+	}
+
 	/*
 	 * Gather additional information used in rename detection.
 	 */
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 2/7] merge-ort: add some more explanations in collect_merge_info_callback()
  2021-07-21  4:23     ` [PATCH v4 " Elijah Newren via GitGitGadget
  2021-07-21  4:23       ` [PATCH v4 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
@ 2021-07-21  4:23       ` Elijah Newren via GitGitGadget
  2021-07-21  4:24       ` [PATCH v4 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
                         ` (4 subsequent siblings)
  6 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-21  4:23 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Bagas Sanjaya, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

The previous patch possibly raises some questions about whether
additional cases in collect_merge_info_callback() can be handled early.
Add some explanations in the form of comments to help explain these
better.  While we're at it, add a few comments to denote what a few
boolean '0' or '1' values stand for.

Acked-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index 6299b4f9413..843fa693145 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1018,8 +1018,8 @@ static int collect_merge_info_callback(int n,
 	if (side1_matches_mbase && side2_matches_mbase) {
 		/* mbase, side1, & side2 all match; use mbase as resolution */
 		setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
-				names, names+0, mbase_null, 0,
-				filemask, dirmask, 1);
+				names, names+0, mbase_null, 0 /* df_conflict */,
+				filemask, dirmask, 1 /* resolved */);
 		return mask;
 	}
 
@@ -1061,14 +1061,24 @@ static int collect_merge_info_callback(int n,
 	}
 
 	/*
-	 * Gather additional information used in rename detection.
+	 * Sometimes we can tell that a source path need not be included in
+	 * rename detection -- namely, whenever either
+	 *    side1_matches_mbase && side2_null
+	 * or
+	 *    side2_matches_mbase && side1_null
+	 * However, we call collect_rename_info() even in those cases,
+	 * because exact renames are cheap and would let us remove both a
+	 * source and destination path.  We'll cull the unneeded sources
+	 * later.
 	 */
 	collect_rename_info(opt, names, dirname, fullpath,
 			    filemask, dirmask, match_mask);
 
 	/*
-	 * Record information about the path so we can resolve later in
-	 * process_entries.
+	 * None of the special cases above matched, so we have a
+	 * provisional conflict.  (Rename detection might allow us to
+	 * unconflict some more cases, but that comes later so all we can
+	 * do now is record the different non-null file hashes.)
 	 */
 	setup_path_info(opt, &pi, dirname, info->pathlen, fullpath,
 			names, NULL, 0, df_conflict, filemask, dirmask, 0);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 3/7] merge-ort: add data structures for allowable trivial directory resolves
  2021-07-21  4:23     ` [PATCH v4 " Elijah Newren via GitGitGadget
  2021-07-21  4:23       ` [PATCH v4 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
  2021-07-21  4:23       ` [PATCH v4 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
@ 2021-07-21  4:24       ` Elijah Newren via GitGitGadget
  2021-07-21  4:24       ` [PATCH v4 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
                         ` (3 subsequent siblings)
  6 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-21  4:24 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Bagas Sanjaya, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

As noted a few commits ago, we can resolve individual files early if all
three sides of the merge have a file at the path and two of the three
sides match.  We would really like to do the same thing with
directories, because being able to do a trivial directory resolve means
we don't have to recurse into the directory, potentially saving us a
huge amount of time in both collect_merge_info() and process_entries().
Unfortunately, resolving directories early would mean missing any
renames whose source or destination is underneath that directory.

If we somehow knew there weren't any renames under the directory in
question, then we could resolve it early.  Sadly, it is impossible to
determine whether there are renames under the directory in question
without recursing into it, and this has traditionally kept us from ever
implementing such an optimization.

In commit f89b4f2bee ("merge-ort: skip rename detection entirely if
possible", 2021-03-11), we added an additional reason that rename
detection could be skipped entirely -- namely, if no *relevant* sources
were present.  Without completing collect_merge_info_callback(), we do
not yet know if there are no relevant sources.  However, we do know that
if the current directory on one side matches the merge base, then every
source file within that directory will not be RELEVANT_CONTENT, and a
few simple checks can often let us rule out RELEVANT_LOCATION as well.
This suggests we can just defer recursing into such directories until
the end of collect_merge_info.

Since the deferred directories are known to not add any relevant sources
due to the above properties, then if there are no relevant sources after
we've traversed all paths other than the deferred ones, then we know
there are not any relevant sources.  Under those conditions, rename
detection is unnecessary, and that means we can resolve the deferred
directories without recursing into them.

Note that the logic for skipping rename detection was also modified
further in commit 76e253793c ("merge-ort, diffcore-rename: employ cached
renames when possible", 2021-01-30); in particular rename detection can
be skipped if we already have cached renames for each relevant source.
We can take advantage of this information as well with our deferral of
recursing into directories where one side matches the merge base.

Add some data structures that we will use to do these deferrals, with
some lengthy comments explaining their purpose.

Acked-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index 843fa693145..d9263ec5aca 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -62,6 +62,53 @@ struct traversal_callback_data {
 	struct name_entry names[3];
 };
 
+struct deferred_traversal_data {
+	/*
+	 * possible_trivial_merges: directories to be explored only when needed
+	 *
+	 * possible_trivial_merges is a map of directory names to
+	 * dir_rename_mask.  When we detect that a directory is unchanged on
+	 * one side, we can sometimes resolve the directory without recursing
+	 * into it.  Renames are the only things that can prevent such an
+	 * optimization.  However, for rename sources:
+	 *   - If no parent directory needed directory rename detection, then
+	 *     no path under such a directory can be a relevant_source.
+	 * and for rename destinations:
+	 *   - If no cached rename has a target path under the directory AND
+	 *   - If there are no unpaired relevant_sources elsewhere in the
+	 *     repository
+	 * then we don't need any path under this directory for a rename
+	 * destination.  The only way to know the last item above is to defer
+	 * handling such directories until the end of collect_merge_info(),
+	 * in handle_deferred_entries().
+	 *
+	 * For each we store dir_rename_mask, since that's the only bit of
+	 * information we need, other than the path, to resume the recursive
+	 * traversal.
+	 */
+	struct strintmap possible_trivial_merges;
+
+	/*
+	 * trivial_merges_okay: if trivial directory merges are okay
+	 *
+	 * See possible_trivial_merges above.  The "no unpaired
+	 * relevant_sources elsewhere in the repository" is a single boolean
+	 * per merge side, which we store here.  Note that while 0 means no,
+	 * 1 only means "maybe" rather than "yes"; we optimistically set it
+	 * to 1 initially and only clear when we determine it is unsafe to
+	 * do trivial directory merges.
+	 */
+	unsigned trivial_merges_okay;
+
+	/*
+	 * target_dirs: ancestor directories of rename targets
+	 *
+	 * target_dirs contains all directory names that are an ancestor of
+	 * any rename destination.
+	 */
+	struct strset target_dirs;
+};
+
 struct rename_info {
 	/*
 	 * All variables that are arrays of size 3 correspond to data tracked
@@ -119,6 +166,8 @@ struct rename_info {
 	 */
 	struct strintmap relevant_sources[3];
 
+	struct deferred_traversal_data deferred[3];
+
 	/*
 	 * dir_rename_mask:
 	 *   0: optimization removing unmodified potential rename source okay
@@ -501,6 +550,11 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 				strmap_clear(&renames->dir_rename_count[i], 1);
 		}
 	}
+	for (i = MERGE_SIDE1; i <= MERGE_SIDE2; ++i) {
+		strintmap_func(&renames->deferred[i].possible_trivial_merges);
+		strset_func(&renames->deferred[i].target_dirs);
+		renames->deferred[i].trivial_merges_okay = 1; /* 1 == maybe */
+	}
 	renames->cached_pairs_valid_side = 0;
 	renames->dir_rename_mask = 0;
 
@@ -4052,6 +4106,13 @@ static void merge_start(struct merge_options *opt, struct merge_result *result)
 		strset_init_with_options(&renames->cached_target_names[i],
 					 NULL, 0);
 	}
+	for (i = MERGE_SIDE1; i <= MERGE_SIDE2; i++) {
+		strintmap_init_with_options(&renames->deferred[i].possible_trivial_merges,
+					    0, NULL, 0);
+		strset_init_with_options(&renames->deferred[i].target_dirs,
+					 NULL, 1);
+		renames->deferred[i].trivial_merges_okay = 1; /* 1 == maybe */
+	}
 
 	/*
 	 * Although we initialize opt->priv->paths with strdup_strings=0,
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 4/7] merge-ort: add a handle_deferred_entries() helper function
  2021-07-21  4:23     ` [PATCH v4 " Elijah Newren via GitGitGadget
                         ` (2 preceding siblings ...)
  2021-07-21  4:24       ` [PATCH v4 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
@ 2021-07-21  4:24       ` Elijah Newren via GitGitGadget
  2021-07-21  4:24       ` [PATCH v4 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
                         ` (2 subsequent siblings)
  6 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-21  4:24 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Bagas Sanjaya, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

In order to allow trivial directory resolution, we first need to be able
to gather more information to determine if the optimization is safe.  To
enable that, we need a way of deferring the recursion into the directory
until a later time.  Naturally, deferring the entry into a subtree means
that we need some function that will later recurse into the subdirectory
exactly the same way that collect_merge_info_callback() would have done.

Add a helper function that does this.  For now this function is not used
but a subsequent commit will change that.  Future commits will also make
the function sometimes resolve directories instead of traversing inside.

Acked-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 64 insertions(+)

diff --git a/merge-ort.c b/merge-ort.c
index d9263ec5aca..f0a07684df6 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1202,6 +1202,70 @@ static int collect_merge_info_callback(int n,
 	return mask;
 }
 
+MAYBE_UNUSED
+static int handle_deferred_entries(struct merge_options *opt,
+				   struct traverse_info *info)
+{
+	struct rename_info *renames = &opt->priv->renames;
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
+	int side, ret = 0;
+
+	for (side = MERGE_SIDE1; side <= MERGE_SIDE2; side++) {
+		renames->deferred[side].trivial_merges_okay = 0;
+		strintmap_for_each_entry(&renames->deferred[side].possible_trivial_merges,
+					 &iter, entry) {
+			const char *path = entry->key;
+			unsigned dir_rename_mask = (intptr_t)entry->value;
+			struct conflict_info *ci;
+			unsigned dirmask;
+			struct tree_desc t[3];
+			void *buf[3] = {NULL,};
+			int i;
+
+			ci = strmap_get(&opt->priv->paths, path);
+			VERIFY_CI(ci);
+			dirmask = ci->dirmask;
+
+			info->name = path;
+			info->namelen = strlen(path);
+			info->pathlen = info->namelen + 1;
+
+			for (i = 0; i < 3; i++, dirmask >>= 1) {
+				if (i == 1 && ci->match_mask == 3)
+					t[1] = t[0];
+				else if (i == 2 && ci->match_mask == 5)
+					t[2] = t[0];
+				else if (i == 2 && ci->match_mask == 6)
+					t[2] = t[1];
+				else {
+					const struct object_id *oid = NULL;
+					if (dirmask & 1)
+						oid = &ci->stages[i].oid;
+					buf[i] = fill_tree_descriptor(opt->repo,
+								      t+i, oid);
+				}
+			}
+
+			ci->match_mask &= ci->filemask;
+			opt->priv->current_dir_name = path;
+			renames->dir_rename_mask = dir_rename_mask;
+			if (renames->dir_rename_mask == 0 ||
+			    renames->dir_rename_mask == 0x07)
+				ret = traverse_trees(NULL, 3, t, info);
+			else
+				ret = traverse_trees_wrapper(NULL, 3, t, info);
+
+			for (i = MERGE_BASE; i <= MERGE_SIDE2; i++)
+				free(buf[i]);
+
+			if (ret < 0)
+				return ret;
+		}
+	}
+	return ret;
+}
+
 static int collect_merge_info(struct merge_options *opt,
 			      struct tree *merge_base,
 			      struct tree *side1,
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 5/7] merge-ort: defer recursing into directories when merge base is matched
  2021-07-21  4:23     ` [PATCH v4 " Elijah Newren via GitGitGadget
                         ` (3 preceding siblings ...)
  2021-07-21  4:24       ` [PATCH v4 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
@ 2021-07-21  4:24       ` Elijah Newren via GitGitGadget
  2021-07-21  4:24       ` [PATCH v4 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
  2021-07-21  4:24       ` [PATCH v4 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
  6 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-21  4:24 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Bagas Sanjaya, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

When one side of history matches the merge base (including when the
merge base has no entry for the given directory), have
collect_merge_info_callback() defer recursing into the directory.  To
ensure those entries are eventually handled, add a call to
handled_deferred_entries() in collect_merge_info() after
traverse_trees() returns.

Note that the condition in collect_merge_info_callback() may look more
complicated than necessary at first glance;
renames->trivial_merges_okay[side] is always true until
handle_deferred_entries() is called, and possible_trivial_merges[side]
is always empty right now (and in the future won't be filled until
handle_deferred_entries() is called).  However, when
handle_deferred_entries() calls traverse_trees() for the relevant
deferred directories, those traverse_trees() calls will once again end
up in collect_merge_info_callback() for all the entries under those
subdirectories.  The extra conditions are there for such deferred cases
and will be used more as we do more with those variables.

Acked-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 33 +++++++++++++++++++++++++++++++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index f0a07684df6..dbccf8c62e2 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1147,8 +1147,36 @@ static int collect_merge_info_callback(int n,
 		struct tree_desc t[3];
 		void *buf[3] = {NULL, NULL, NULL};
 		const char *original_dir_name;
-		int i, ret;
+		int i, ret, side;
 
+		/*
+		 * Check for whether we can avoid recursing due to one side
+		 * matching the merge base.  The side that does NOT match is
+		 * the one that might have a rename destination we need.
+		 */
+		assert(!side1_matches_mbase || !side2_matches_mbase);
+		side = side1_matches_mbase ? MERGE_SIDE2 :
+			side2_matches_mbase ? MERGE_SIDE1 : MERGE_BASE;
+		if (filemask == 0 && (dirmask == 2 || dirmask == 4)) {
+			/*
+			 * Also defer recursing into new directories; set up a
+			 * few variables to let us do so.
+			 */
+			ci->match_mask = (7 - dirmask);
+			side = dirmask / 2;
+		}
+		if (renames->dir_rename_mask != 0x07 &&
+		    side != MERGE_BASE &&
+		    renames->deferred[side].trivial_merges_okay &&
+		    !strset_contains(&renames->deferred[side].target_dirs,
+				     pi.string)) {
+			strintmap_set(&renames->deferred[side].possible_trivial_merges,
+				      pi.string, renames->dir_rename_mask);
+			renames->dir_rename_mask = prev_dir_rename_mask;
+			return mask;
+		}
+
+		/* We need to recurse */
 		ci->match_mask &= filemask;
 		newinfo = *info;
 		newinfo.prev = info;
@@ -1202,7 +1230,6 @@ static int collect_merge_info_callback(int n,
 	return mask;
 }
 
-MAYBE_UNUSED
 static int handle_deferred_entries(struct merge_options *opt,
 				   struct traverse_info *info)
 {
@@ -1291,6 +1318,8 @@ static int collect_merge_info(struct merge_options *opt,
 
 	trace2_region_enter("merge", "traverse_trees", opt->repo);
 	ret = traverse_trees(NULL, 3, t, &info);
+	if (ret == 0)
+		ret = handle_deferred_entries(opt, &info);
 	trace2_region_leave("merge", "traverse_trees", opt->repo);
 
 	return ret;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 6/7] merge-ort: avoid recursing into directories when we don't need to
  2021-07-21  4:23     ` [PATCH v4 " Elijah Newren via GitGitGadget
                         ` (4 preceding siblings ...)
  2021-07-21  4:24       ` [PATCH v4 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
@ 2021-07-21  4:24       ` Elijah Newren via GitGitGadget
  2021-07-21  4:24       ` [PATCH v4 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
  6 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-21  4:24 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Bagas Sanjaya, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

This combines the work of the last several patches, and implements the
conditions when we don't need to recurse into directories.  It's perhaps
easiest to see the logic by separating the fact that a directory might
have both rename sources and rename destinations:

  * rename sources: only files present in the merge base can serve as
    rename sources, and only when one side deletes that file.  When the
    tree on one side matches the merge base, that means every file
    within the subtree matches the merge base.  This means that the
    skip-irrelevant-rename-detection optimization from before kicks in
    and we don't need any of these files as rename sources.

  * rename destinations: the tree that does not match the merge base
    might have newly added and hence unmatched destination files.
    This is what usually prevents us from doing trivial directory
    resolutions in the merge machinery.  However, the fact that we have
    deferred recursing into this directory until the end means we know
    whether there are any unmatched relevant potential rename sources
    elsewhere in this merge.  If there are no unmatched such relevant
    sources anywhere, then there is no need to look for unmatched
    potential rename destinations to match them with.

This informs our algorithm:
  * Search through relevant_sources; if we have entries, they better all
    be reflected in cached_pairs or cached_irrelevant, otherwise they
    represent an unmatched potential rename source (causing the
    optimization to be disallowed).
  * For any relevant_source represented in cached_pairs, we do need to
    to make sure to get the destination for each source, meaning we need
    to recurse into any ancestor directories of those destinations.
  * Once we've recursed into all the rename destinations for any
    relevant_sources in cached_pairs, we can then do the trivial
    directory resolution for the remaining directories.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:        5.235 s ±  0.042 s   205.1  ms ±  3.8  ms
    mega-renames:      9.419 s ±  0.107 s     1.564 s ±  0.010 s
    just-one-mega:   480.1  ms ±  3.9  ms   479.5  ms ±  3.9  ms

Acked-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 102 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 99 insertions(+), 3 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index dbccf8c62e2..a013708fa79 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1230,6 +1230,18 @@ static int collect_merge_info_callback(int n,
 	return mask;
 }
 
+static void resolve_trivial_directory_merge(struct conflict_info *ci, int side)
+{
+	VERIFY_CI(ci);
+	assert((side == 1 && ci->match_mask == 5) ||
+	       (side == 2 && ci->match_mask == 3));
+	oidcpy(&ci->merged.result.oid, &ci->stages[side].oid);
+	ci->merged.result.mode = ci->stages[side].mode;
+	ci->merged.is_null = is_null_oid(&ci->stages[side].oid);
+	ci->match_mask = 0;
+	ci->merged.clean = 1; /* (ci->filemask == 0); */
+}
+
 static int handle_deferred_entries(struct merge_options *opt,
 				   struct traverse_info *info)
 {
@@ -1239,9 +1251,72 @@ static int handle_deferred_entries(struct merge_options *opt,
 	int side, ret = 0;
 
 	for (side = MERGE_SIDE1; side <= MERGE_SIDE2; side++) {
-		renames->deferred[side].trivial_merges_okay = 0;
-		strintmap_for_each_entry(&renames->deferred[side].possible_trivial_merges,
-					 &iter, entry) {
+		unsigned optimization_okay = 1;
+		struct strintmap copy;
+
+		/* Loop over the set of paths we need to know rename info for */
+		strset_for_each_entry(&renames->relevant_sources[side],
+				      &iter, entry) {
+			char *rename_target, *dir, *dir_marker;
+			struct strmap_entry *e;
+
+			/*
+			 * If we don't know delete/rename info for this path,
+			 * then we need to recurse into all trees to get all
+			 * adds to make sure we have it.
+			 */
+			if (strset_contains(&renames->cached_irrelevant[side],
+					    entry->key))
+				continue;
+			e = strmap_get_entry(&renames->cached_pairs[side],
+					     entry->key);
+			if (!e) {
+				optimization_okay = 0;
+				break;
+			}
+
+			/* If this is a delete, we have enough info already */
+			rename_target = e->value;
+			if (!rename_target)
+				continue;
+
+			/* If we already walked the rename target, we're good */
+			if (strmap_contains(&opt->priv->paths, rename_target))
+				continue;
+
+			/*
+			 * Otherwise, we need to get a list of directories that
+			 * will need to be recursed into to get this
+			 * rename_target.
+			 */
+			dir = xstrdup(rename_target);
+			while ((dir_marker = strrchr(dir, '/'))) {
+				*dir_marker = '\0';
+				if (strset_contains(&renames->deferred[side].target_dirs,
+						    dir))
+					break;
+				strset_add(&renames->deferred[side].target_dirs,
+					   dir);
+			}
+			free(dir);
+		}
+		renames->deferred[side].trivial_merges_okay = optimization_okay;
+		/*
+		 * We need to recurse into any directories in
+		 * possible_trivial_merges[side] found in target_dirs[side].
+		 * But when we recurse, we may need to queue up some of the
+		 * subdirectories for possible_trivial_merges[side].  Since
+		 * we can't safely iterate through a hashmap while also adding
+		 * entries, move the entries into 'copy', iterate over 'copy',
+		 * and then we'll also iterate anything added into
+		 * possible_trivial_merges[side] once this loop is done.
+		 */
+		copy = renames->deferred[side].possible_trivial_merges;
+		strintmap_init_with_options(&renames->deferred[side].possible_trivial_merges,
+					    0,
+					    NULL,
+					    0);
+		strintmap_for_each_entry(&copy, &iter, entry) {
 			const char *path = entry->key;
 			unsigned dir_rename_mask = (intptr_t)entry->value;
 			struct conflict_info *ci;
@@ -1254,6 +1329,13 @@ static int handle_deferred_entries(struct merge_options *opt,
 			VERIFY_CI(ci);
 			dirmask = ci->dirmask;
 
+			if (optimization_okay &&
+			    !strset_contains(&renames->deferred[side].target_dirs,
+					     path)) {
+				resolve_trivial_directory_merge(ci, side);
+				continue;
+			}
+
 			info->name = path;
 			info->namelen = strlen(path);
 			info->pathlen = info->namelen + 1;
@@ -1289,6 +1371,20 @@ static int handle_deferred_entries(struct merge_options *opt,
 			if (ret < 0)
 				return ret;
 		}
+		strintmap_clear(&copy);
+		strintmap_for_each_entry(&renames->deferred[side].possible_trivial_merges,
+					 &iter, entry) {
+			const char *path = entry->key;
+			struct conflict_info *ci;
+
+			ci = strmap_get(&opt->priv->paths, path);
+			VERIFY_CI(ci);
+
+			assert(renames->deferred[side].trivial_merges_okay &&
+			       !strset_contains(&renames->deferred[side].target_dirs,
+						path));
+			resolve_trivial_directory_merge(ci, side);
+		}
 	}
 	return ret;
 }
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v4 7/7] merge-ort: restart merge with cached renames to reduce process entry cost
  2021-07-21  4:23     ` [PATCH v4 " Elijah Newren via GitGitGadget
                         ` (5 preceding siblings ...)
  2021-07-21  4:24       ` [PATCH v4 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
@ 2021-07-21  4:24       ` Elijah Newren via GitGitGadget
  6 siblings, 0 replies; 52+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-07-21  4:24 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Ævar Arnfjörð Bjarmason,
	Elijah Newren, Bagas Sanjaya, Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

The merge algorithm mostly consists of the following three functions:
   collect_merge_info()
   detect_and_process_renames()
   process_entries()
Prior to the trivial directory resolution optimization of the last half
dozen commits, process_entries() was consistently the slowest, followed
by collect_merge_info(), then detect_and_process_renames().  When the
trivial directory resolution applies, it often dramatically decreases
the amount of time spent in the two slower functions.

Looking at the performance results in the previous commit, the trivial
directory resolution optimization helps amazingly well when there are no
relevant renames.  It also helps really well when reapplying a long
series of linear commits (such as in a rebase or cherry-pick), since the
relevant renames may well be cached from the first reapplied commit.
But when there are any relevant renames that are not cached (represented
by the just-one-mega testcase), then the optimization does not help at
all.

Often, I noticed that when the optimization does not apply, it is
because there are a handful of relevant sources -- maybe even only one.
It felt frustrating to need to recurse into potentially hundreds or even
thousands of directories just for a single rename, but it was needed for
correctness.

However, staring at this list of functions and noticing that
process_entries() is the most expensive and knowing I could avoid it if
I had cached renames suggested a simple idea: change
   collect_merge_info()
   detect_and_process_renames()
   process_entries()
into
   collect_merge_info()
   detect_and_process_renames()
   <cache all the renames, and restart>
   collect_merge_info()
   detect_and_process_renames()
   process_entries()

This may seem odd and look like more work.  However, note that although
we run collect_merge_info() twice, the second time we get to employ
trivial directory resolves, which makes it much faster, so the increased
time in collect_merge_info() is small.  While we run
detect_and_process_renames() again, all renames are cached so it's
nearly a no-op (we don't call into diffcore_rename_extended() but we do
have a little bit of data structure checking and fixing up).  And the
big payoff comes from the fact that process_entries(), will be much
faster due to having far fewer entries to process.

This restarting only makes sense if we can save recursing into enough
directories to make it worth our while.  Introduce a simple heuristic to
guide this.  Note that this heuristic uses a "wanted_factor" that I have
virtually no actual real world data for, just some back-of-the-envelope
quasi-scientific calculations that I included in some comments and then
plucked a simple round number out of thin air.  It could be that
tweaking this number to make it either higher or lower improves the
optimization.  (There's slightly more here; when I first introduced this
optimization, I used a factor of 10, because I was completely confident
it was big enough to not cause slowdowns in special cases.  I was
certain it was higher than needed.  Several months later, I added the
rough calculations which make me think the optimal number is close to 2;
but instead of pushing to the limit, I just bumped it to 3 to reduce the
risk that there are special cases where this optimization can result in
slowing down the code a little.  If the ratio of path counts is below 3,
we probably will only see minor performance improvements at best
anyway.)

Also, note that while the diffstat looks kind of long (nearly 100
lines), more than half of it is in two comments explaining how things
work.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:      205.1  ms ±  3.8  ms   204.2  ms ±  3.0  ms
    mega-renames:      1.564 s ±  0.010 s     1.076 s ±  0.015 s
    just-one-mega:   479.5  ms ±  3.9  ms   364.1  ms ±  7.0  ms

Acked-by: Derrick Stolee <stolee@gmail.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c                         | 92 +++++++++++++++++++++++++++--
 t/t6423-merge-rename-directories.sh |  2 +-
 2 files changed, 87 insertions(+), 7 deletions(-)

diff --git a/merge-ort.c b/merge-ort.c
index a013708fa79..e361443087a 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -213,6 +213,7 @@ struct rename_info {
 	 *   MERGE_SIDE2: cached data from side2 can be reused
 	 *   MERGE_SIDE1: cached data from side1 can be reused
 	 *   0:           no cached data can be reused
+	 *   -1:          See redo_after_renames; both sides can be reused.
 	 */
 	int cached_pairs_valid_side;
 
@@ -258,6 +259,28 @@ struct rename_info {
 	 */
 	struct strset cached_irrelevant[3];
 
+	/*
+	 * redo_after_renames: optimization flag for "restarting" the merge
+	 *
+	 * Sometimes it pays to detect renames, cache them, and then
+	 * restart the merge operation from the beginning.  The reason for
+	 * this is that when we know where all the renames are, we know
+	 * whether a certain directory has any paths under it affected --
+	 * and if a directory is not affected then it permits us to do
+	 * trivial tree merging in more cases.  Doing trivial tree merging
+	 * prevents the need to run process_entry() on every path
+	 * underneath trees that can be trivially merged, and
+	 * process_entry() is more expensive than collect_merge_info() --
+	 * plus, the second collect_merge_info() will be much faster since
+	 * it doesn't have to recurse into the relevant trees.
+	 *
+	 * Values for this flag:
+	 *   0 = don't bother, not worth it (or conditions not yet checked)
+	 *   1 = conditions for optimization met, optimization worthwhile
+	 *   2 = we already did it (don't restart merge yet again)
+	 */
+	unsigned redo_after_renames;
+
 	/*
 	 * needed_limit: value needed for inexact rename detection to run
 	 *
@@ -541,7 +564,8 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 		strintmap_func(&renames->relevant_sources[i]);
 		if (!reinitialize)
 			assert(renames->cached_pairs_valid_side == 0);
-		if (i != renames->cached_pairs_valid_side) {
+		if (i != renames->cached_pairs_valid_side &&
+		    -1 != renames->cached_pairs_valid_side) {
 			strset_func(&renames->cached_target_names[i]);
 			strmap_func(&renames->cached_pairs[i], 1);
 			strset_func(&renames->cached_irrelevant[i]);
@@ -1249,7 +1273,9 @@ static int handle_deferred_entries(struct merge_options *opt,
 	struct hashmap_iter iter;
 	struct strmap_entry *entry;
 	int side, ret = 0;
+	int path_count_before, path_count_after = 0;
 
+	path_count_before = strmap_get_size(&opt->priv->paths);
 	for (side = MERGE_SIDE1; side <= MERGE_SIDE2; side++) {
 		unsigned optimization_okay = 1;
 		struct strintmap copy;
@@ -1385,7 +1411,32 @@ static int handle_deferred_entries(struct merge_options *opt,
 						path));
 			resolve_trivial_directory_merge(ci, side);
 		}
+		if (!optimization_okay || path_count_after)
+			path_count_after = strmap_get_size(&opt->priv->paths);
 	}
+	if (path_count_after) {
+		/*
+		 * The choice of wanted_factor here does not affect
+		 * correctness, only performance.  When the
+		 *    path_count_after / path_count_before
+		 * ratio is high, redoing after renames is a big
+		 * performance boost.  I suspect that redoing is a wash
+		 * somewhere near a value of 2, and below that redoing will
+		 * slow things down.  I applied a fudge factor and picked
+		 * 3; see the commit message when this was introduced for
+		 * back of the envelope calculations for this ratio.
+		 */
+		const int wanted_factor = 3;
+
+		/* We should only redo collect_merge_info one time */
+		assert(renames->redo_after_renames == 0);
+
+		if (path_count_after / path_count_before >= wanted_factor) {
+			renames->redo_after_renames = 1;
+			renames->cached_pairs_valid_side = -1;
+		}
+	} else if (renames->redo_after_renames == 2)
+		renames->redo_after_renames = 0;
 	return ret;
 }
 
@@ -2828,8 +2879,8 @@ static int compare_pairs(const void *a_, const void *b_)
 }
 
 /* Call diffcore_rename() to update deleted/added pairs into rename pairs */
-static void detect_regular_renames(struct merge_options *opt,
-				   unsigned side_index)
+static int detect_regular_renames(struct merge_options *opt,
+				  unsigned side_index)
 {
 	struct diff_options diff_opts;
 	struct rename_info *renames = &opt->priv->renames;
@@ -2842,7 +2893,7 @@ static void detect_regular_renames(struct merge_options *opt,
 		 * side had directory renames.
 		 */
 		resolve_diffpair_statuses(&renames->pairs[side_index]);
-		return;
+		return 0;
 	}
 
 	partial_clear_dir_rename_count(&renames->dir_rename_count[side_index]);
@@ -2868,6 +2919,8 @@ static void detect_regular_renames(struct merge_options *opt,
 	trace2_region_leave("diff", "diffcore_rename", opt->repo);
 	resolve_diffpair_statuses(&diff_queued_diff);
 
+	if (diff_opts.needed_rename_limit > 0)
+		renames->redo_after_renames = 0;
 	if (diff_opts.needed_rename_limit > renames->needed_limit)
 		renames->needed_limit = diff_opts.needed_rename_limit;
 
@@ -2877,6 +2930,8 @@ static void detect_regular_renames(struct merge_options *opt,
 	diff_queued_diff.nr = 0;
 	diff_queued_diff.queue = NULL;
 	diff_flush(&diff_opts);
+
+	return 1;
 }
 
 /*
@@ -2966,14 +3021,32 @@ static int detect_and_process_renames(struct merge_options *opt,
 	struct diff_queue_struct combined;
 	struct rename_info *renames = &opt->priv->renames;
 	int need_dir_renames, s, clean = 1;
+	unsigned detection_run = 0;
 
 	memset(&combined, 0, sizeof(combined));
 	if (!possible_renames(renames))
 		goto cleanup;
 
 	trace2_region_enter("merge", "regular renames", opt->repo);
-	detect_regular_renames(opt, MERGE_SIDE1);
-	detect_regular_renames(opt, MERGE_SIDE2);
+	detection_run |= detect_regular_renames(opt, MERGE_SIDE1);
+	detection_run |= detect_regular_renames(opt, MERGE_SIDE2);
+	if (renames->redo_after_renames && detection_run) {
+		int i, side;
+		struct diff_filepair *p;
+
+		/* Cache the renames, we found */
+		for (side = MERGE_SIDE1; side <= MERGE_SIDE2; side++) {
+			for (i = 0; i < renames->pairs[side].nr; ++i) {
+				p = renames->pairs[side].queue[i];
+				possibly_cache_new_pair(renames, p, side, NULL);
+			}
+		}
+
+		/* Restart the merge with the cached renames */
+		renames->redo_after_renames = 2;
+		trace2_region_leave("merge", "regular renames", opt->repo);
+		goto cleanup;
+	}
 	use_cached_pairs(opt, &renames->cached_pairs[1], &renames->pairs[1]);
 	use_cached_pairs(opt, &renames->cached_pairs[2], &renames->pairs[2]);
 	trace2_region_leave("merge", "regular renames", opt->repo);
@@ -4390,6 +4463,7 @@ static void merge_ort_nonrecursive_internal(struct merge_options *opt,
 					       opt->subtree_shift);
 	}
 
+redo:
 	trace2_region_enter("merge", "collect_merge_info", opt->repo);
 	if (collect_merge_info(opt, merge_base, side1, side2) != 0) {
 		/*
@@ -4409,6 +4483,12 @@ static void merge_ort_nonrecursive_internal(struct merge_options *opt,
 	result->clean = detect_and_process_renames(opt, merge_base,
 						   side1, side2);
 	trace2_region_leave("merge", "renames", opt->repo);
+	if (opt->priv->renames.redo_after_renames == 2) {
+		trace2_region_enter("merge", "reset_maps", opt->repo);
+		clear_or_reinit_internal_opts(opt->priv, 1);
+		trace2_region_leave("merge", "reset_maps", opt->repo);
+		goto redo;
+	}
 
 	trace2_region_enter("merge", "process_entries", opt->repo);
 	process_entries(opt, &working_tree_oid);
diff --git a/t/t6423-merge-rename-directories.sh b/t/t6423-merge-rename-directories.sh
index e834b7e6efe..d8919d276a1 100755
--- a/t/t6423-merge-rename-directories.sh
+++ b/t/t6423-merge-rename-directories.sh
@@ -4797,7 +4797,7 @@ test_setup_12f () {
 	)
 }
 
-test_expect_merge_algorithm failure failure '12f: Trivial directory resolve, caching, all kinds of fun' '
+test_expect_merge_algorithm failure success '12f: Trivial directory resolve, caching, all kinds of fun' '
 	test_setup_12f &&
 	(
 		cd 12f &&
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2021-07-21  4:25 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-01  3:46 [PATCH 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-01  3:46 ` [PATCH 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
2021-07-01 13:21 ` [PATCH 0/7] Optimization batch 14: trivial directory resolution Ævar Arnfjörð Bjarmason
2021-07-01 15:04   ` Elijah Newren
2021-07-01 19:22     ` Elijah Newren
2021-07-13 19:32 ` [PATCH v2 " Elijah Newren via GitGitGadget
2021-07-13 19:32   ` [PATCH v2 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-13 19:32   ` [PATCH v2 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-13 23:34     ` Bagas Sanjaya
2021-07-14  0:19       ` Elijah Newren
2021-07-13 19:32   ` [PATCH v2 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-15 13:54     ` Derrick Stolee
2021-07-15 15:54       ` Elijah Newren
2021-07-13 19:33   ` [PATCH v2 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-15 14:32     ` Derrick Stolee
2021-07-15 15:59       ` Elijah Newren
2021-07-13 19:33   ` [PATCH v2 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-15 14:43     ` Derrick Stolee
2021-07-15 16:03       ` Elijah Newren
2021-07-15 17:14         ` Derrick Stolee
2021-07-13 19:33   ` [PATCH v2 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-15 14:55     ` Derrick Stolee
2021-07-15 16:28       ` Elijah Newren
2021-07-13 19:33   ` [PATCH v2 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
2021-07-15 15:09     ` Derrick Stolee
2021-07-15 16:53       ` Elijah Newren
2021-07-15 17:19         ` Derrick Stolee
2021-07-15 17:32           ` Elijah Newren
2021-07-16  5:22   ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-16  5:22     ` [PATCH v3 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget
2021-07-20 13:00     ` [PATCH v3 0/7] Optimization batch 14: trivial directory resolution Derrick Stolee
2021-07-20 21:43       ` Junio C Hamano
2021-07-21  4:23     ` [PATCH v4 " Elijah Newren via GitGitGadget
2021-07-21  4:23       ` [PATCH v4 1/7] merge-ort: resolve paths early when we have sufficient information Elijah Newren via GitGitGadget
2021-07-21  4:23       ` [PATCH v4 2/7] merge-ort: add some more explanations in collect_merge_info_callback() Elijah Newren via GitGitGadget
2021-07-21  4:24       ` [PATCH v4 3/7] merge-ort: add data structures for allowable trivial directory resolves Elijah Newren via GitGitGadget
2021-07-21  4:24       ` [PATCH v4 4/7] merge-ort: add a handle_deferred_entries() helper function Elijah Newren via GitGitGadget
2021-07-21  4:24       ` [PATCH v4 5/7] merge-ort: defer recursing into directories when merge base is matched Elijah Newren via GitGitGadget
2021-07-21  4:24       ` [PATCH v4 6/7] merge-ort: avoid recursing into directories when we don't need to Elijah Newren via GitGitGadget
2021-07-21  4:24       ` [PATCH v4 7/7] merge-ort: restart merge with cached renames to reduce process entry cost Elijah Newren via GitGitGadget

git@vger.kernel.org list mirror (unofficial, one of many)

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 git git/ https://public-inbox.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://7fh6tueqddpjyxjmgtdiueylzoqt6pt7hec3pukyptlmohoowvhde4yd.onion/inbox.comp.version-control.git
	nntp://ie5yzdi7fg72h7s4sdcztq5evakq23rdt33mfyfcddc5u3ndnw24ogqd.onion/inbox.comp.version-control.git
	nntp://4uok3hntl7oi7b4uf4rtfwefqeexfzil2w6kgk2jn5z2f764irre7byd.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git