git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [PATCH 00/10] Optimization batch 8: use file basenames even more
@ 2021-02-14  7:58 Elijah Newren via GitGitGadget
  2021-02-14  7:58 ` [PATCH 01/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
                   ` (10 more replies)
  0 siblings, 11 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-14  7:58 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

This series depends on en/diffcore-rename (a concatenation of what I was
calling ort-perf-batch-6 and ort-perf-batch-7).

I apologize for submitting this so soon after a re-roll of the other two
series, but ideas from it kept coming up in the discussions, so I figured
it'd help to just have them all be submitted. This series can also be rolled
into en/diffcore-rename if it helps simplify.

=== Optimization idea ===

This series uses file basenames (portions of the path after the last '/',
including file extension) in a more involved fashion to guide rename
detection. It's a follow-on improvement to "Optimization #3" from my Git
Merge 2020 talk[1]. The basic idea behind this series is the same as the
last series: people frequently move files across directories while keeping
the filenames the same, thus files with the same basename are likely rename
candidates. However, the previous optimization only applies when basenames
are unique among remaining adds and deletes after exact rename detection, so
we need to do something else to match up the remaining basenames. When there
are many files with the same basename (e.g. .gitignore, Makefile,
build.gradle, or maybe even setup.c, AbtractFactory.java, etc.), being able
to "guess" which directory a given file likely would have moved to can
provide us with a likely rename candidate if there is a file with the same
basename in that directory. Since exact rename detection is done first, we
can use nearby exact renames to help us guess where any given non-unique
basename file may have moved; it just means doing "directory rename
detection" limited to exact renames.

There are definitely cases when this strategy still won't help us: (1) We
only use this strategy when the directory in which the original file was
found has also been removed, (2) a lack of exact renames from the given
directory will prevents us from making a new directory prediction, (3) even
if we predict a new directory there may be no file with the given basename
in it, and (4) even if there is an unmatched add with the appropriate
basename in the predicted directory, it may not meet the higher
min_basename_score similarity threshold.

It may be worth noting that directory rename detection at most predicts one
new directory, which we use to ensure that we only compare any given file
with at most one other file. That's important for compatibility with future
optimizations.

However, despite the caveats and limited applicability, this idea provides
some nice speedups.

=== Results ===

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28), the
changes in just this series improves the performance as follows:

                     Before Series           After Series
no-renames:       12.775 s ±  0.062 s    12.596 s ±  0.061 s
mega-renames:    188.754 s ±  0.284 s   130.465 s ±  0.259 s
just-one-mega:     5.599 s ±  0.019 s     3.958 s ±  0.010 s


As a reminder, before any merge-ort/diffcore-rename performance work, the
performance results we started with (as noted in the same commit message)
were:

no-renames-am:      6.940 s ±  0.485 s
no-renames:        18.912 s ±  0.174 s
mega-renames:    5964.031 s ± 10.459 s
just-one-mega:    149.583 s ±  0.751 s


=== Alternative, rejected idea ===

There was an alternative idea to the series presented here that I also
tried: instead of using directory rename detection based on exact renames to
predict where files would be renamed and then comparing to the file with the
same basename in the new directory, one could instead take all files with
the same basename -- both sources and destinations -- and then do a smaller
M x N comparison on all those files to find renames. Any non-matches after
that step could be combined with all other files for the big inexact rename
detection step.

There are two problems with such a strategy, though.

One is that in the worst case, you approximately double the cost of rename
detection (if most potential rename pairs all have the same basename but
they aren't actually matches, you end up comparing twice).

The second issue isn't clear until trying to combine this idea with later
performance optimizations. The next optimization will provide a way to
filter out several of the rename sources. If our inexact rename detection
matrix is sized 1 x 4000 because we can remove all but one source file, but
we have 100 files with the same basename, then a 100 x 100 comparison is
actually more costly than a 1 x 4000 comparison -- and we don't need most of
the renames from the 100 x 100 comparison. The advantage of the directory
rename detection based idea for finding which basenames to match up, is that
the cost for each file is linear (or, said another way, scales proportional
to doing a diff on that file). As such, the costs for this preliminary
optimization are nicely controlled and the worst case scenario is it has
spent a little extra time upfront but still has to do the full inexact
rename detection.

[1]
https://github.com/newren/presentations/blob/pdfs/merge-performance/merge-performance-slides.pdf

Elijah Newren (10):
  Move computation of dir_rename_count from merge-ort to diffcore-rename
  diffcore-rename: add functions for clearing dir_rename_count
  diffcore-rename: move dir_rename_counts into a dir_rename_info struct
  diffcore-rename: extend cleanup_dir_rename_info()
  diffcore-rename: compute dir_rename_counts in stages
  diffcore-rename: add a mapping of destination names to their indices
  diffcore-rename: add a dir_rename_guess field to dir_rename_info
  diffcore-rename: add a new idx_possible_rename function
  diffcore-rename: limit dir_rename_counts computation to relevant dirs
  diffcore-rename: use directory rename guided basename comparisons

 Documentation/gitdiffcore.txt |   2 +-
 diffcore-rename.c             | 439 ++++++++++++++++++++++++++++++++--
 diffcore.h                    |   7 +
 merge-ort.c                   | 144 +----------
 4 files changed, 439 insertions(+), 153 deletions(-)


base-commit: aeca14f748afc7fb5b65bca56ea2ebd970729814
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-844%2Fnewren%2Fort-perf-batch-8-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-844/newren/ort-perf-batch-8-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/844
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 01/10] Move computation of dir_rename_count from merge-ort to diffcore-rename
  2021-02-14  7:58 [PATCH 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
@ 2021-02-14  7:58 ` Elijah Newren via GitGitGadget
  2021-02-14  7:58 ` [PATCH 02/10] diffcore-rename: add functions for clearing dir_rename_count Elijah Newren via GitGitGadget
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-14  7:58 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

A previous commit noted that it is very common for people to move files
across directories while keeping their filename the same.  The last few
commits took advantage of this and showed that we can accelerate rename
detection significantly using basenames; since files with the same
basename serve as likely rename candidates, we can check those first and
remove them from the rename candidate pool if they are sufficiently
similar.

Unfortunately, the previous optimization was limited by the fact that
the remaining basenames after exact rename detection are not always
unique.  Many repositories have hundreds of build files with the same
name (e.g. Makefile, .gitignore, build.gradle, etc.), and may even have
hundreds of source files with the same name.  (For example, the linux
kernel has 100 setup.c, 87 irq.c, and 112 core.c files.  A repository at
$DAYJOB has a lot of ObjectFactory.java and Plugin.java files).

For these files with non-unique basenames, we are faced with the task of
attempting to determine or guess which directory they may have been
relocated to.  Such a task is precisely the job of directory rename
detection.  However, there are two catches: (1) the directory rename
detection code has traditionally been part of the merge machinery rather
than diffcore-rename.c, and (2) directory rename detection currently
runs after regular rename detection is complete.  The 1st catch is just
an implementation issue that can be overcome by some code shuffling.
The 2nd requires us to add a further approximation: we only have access
to exact renames at this point, so we need to do directory rename
detection based on just exact renames.  In some cases we won't have
exact renames, in which case this extra optimization won't apply.  We
also choose to not apply the optimization unless we know that the
underlying directory was removed, which will require extra data to be
passed in to diffcore_rename_extended().  Also, even if we get a
prediction about which directory a file may have relocated to, we will
still need to check to see if there is a file in the predicted
directory, and then compare the two files to see if they meet the higher
min_basename_score threshold required for marking the two files as
renames.

This commit and the next few will set up the necessary infrastructure to
do such computations.  This commit merely moves the computation of
dir_rename_count from merge-ort.c to diffcore-rename.c, making slight
adjustments to the data structures based on the move.  While the
diffstat looks large, viewing this commit with --color-moved makes it
clear that only about 20 lines changed.  With this patch, the
computation of dir_rename_count is still only done after inexact rename
detection, but subsequent commits will add a preliminary computation of
dir_rename_count after exact rename detection, followed by some updates
after inexact rename detection.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 134 +++++++++++++++++++++++++++++++++++++++++++++-
 diffcore.h        |   5 ++
 merge-ort.c       | 132 ++-------------------------------------------
 3 files changed, 141 insertions(+), 130 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 41558185ae1d..33cfc5848611 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -367,6 +367,125 @@ static int find_exact_renames(struct diff_options *options)
 	return renames;
 }
 
+static void dirname_munge(char *filename)
+{
+	char *slash = strrchr(filename, '/');
+	if (!slash)
+		slash = filename;
+	*slash = '\0';
+}
+
+static void increment_count(struct strmap *dir_rename_count,
+			    char *old_dir,
+			    char *new_dir)
+{
+	struct strintmap *counts;
+	struct strmap_entry *e;
+
+	/* Get the {new_dirs -> counts} mapping using old_dir */
+	e = strmap_get_entry(dir_rename_count, old_dir);
+	if (e) {
+		counts = e->value;
+	} else {
+		counts = xmalloc(sizeof(*counts));
+		strintmap_init_with_options(counts, 0, NULL, 1);
+		strmap_put(dir_rename_count, old_dir, counts);
+	}
+
+	/* Increment the count for new_dir */
+	strintmap_incr(counts, new_dir, 1);
+}
+
+static void update_dir_rename_counts(struct strmap *dir_rename_count,
+				     struct strset *dirs_removed,
+				     const char *oldname,
+				     const char *newname)
+{
+	char *old_dir = xstrdup(oldname);
+	char *new_dir = xstrdup(newname);
+	char new_dir_first_char = new_dir[0];
+	int first_time_in_loop = 1;
+
+	while (1) {
+		dirname_munge(old_dir);
+		dirname_munge(new_dir);
+
+		/*
+		 * When renaming
+		 *   "a/b/c/d/e/foo.c" -> "a/b/some/thing/else/e/foo.c"
+		 * then this suggests that both
+		 *   a/b/c/d/e/ => a/b/some/thing/else/e/
+		 *   a/b/c/d/   => a/b/some/thing/else/
+		 * so we want to increment counters for both.  We do NOT,
+		 * however, also want to suggest that there was the following
+		 * rename:
+		 *   a/b/c/ => a/b/some/thing/
+		 * so we need to quit at that point.
+		 *
+		 * Note the when first_time_in_loop, we only strip off the
+		 * basename, and we don't care if that's different.
+		 */
+		if (!first_time_in_loop) {
+			char *old_sub_dir = strchr(old_dir, '\0')+1;
+			char *new_sub_dir = strchr(new_dir, '\0')+1;
+			if (!*new_dir) {
+				/*
+				 * Special case when renaming to root directory,
+				 * i.e. when new_dir == "".  In this case, we had
+				 * something like
+				 *    a/b/subdir => subdir
+				 * and so dirname_munge() sets things up so that
+				 *    old_dir = "a/b\0subdir\0"
+				 *    new_dir = "\0ubdir\0"
+				 * We didn't have a '/' to overwrite a '\0' onto
+				 * in new_dir, so we have to compare differently.
+				 */
+				if (new_dir_first_char != old_sub_dir[0] ||
+				    strcmp(old_sub_dir+1, new_sub_dir))
+					break;
+			} else {
+				if (strcmp(old_sub_dir, new_sub_dir))
+					break;
+			}
+		}
+
+		if (strset_contains(dirs_removed, old_dir))
+			increment_count(dir_rename_count, old_dir, new_dir);
+		else
+			break;
+
+		/* If we hit toplevel directory ("") for old or new dir, quit */
+		if (!*old_dir || !*new_dir)
+			break;
+
+		first_time_in_loop = 0;
+	}
+
+	/* Free resources we don't need anymore */
+	free(old_dir);
+	free(new_dir);
+}
+
+static void compute_dir_rename_counts(struct strmap *dir_rename_count,
+				      struct strset *dirs_removed)
+{
+	int i;
+
+	/* Set up dir_rename_count */
+	for (i = 0; i < rename_dst_nr; ++i) {
+		/*
+		 * Make dir_rename_count contain a map of a map:
+		 *   old_directory -> {new_directory -> count}
+		 * In other words, for every pair look at the directories for
+		 * the old filename and the new filename and count how many
+		 * times that pairing occurs.
+		 */
+		update_dir_rename_counts(dir_rename_count, dirs_removed,
+					 rename_dst[i].p->one->path,
+					 rename_dst[i].p->two->path);
+	}
+}
+
 static const char *get_basename(const char *filename)
 {
 	/*
@@ -640,7 +759,9 @@ static void remove_unneeded_paths_from_src(int detecting_copies)
 	rename_src_nr = new_num_src;
 }
 
-void diffcore_rename(struct diff_options *options)
+void diffcore_rename_extended(struct diff_options *options,
+			      struct strset *dirs_removed,
+			      struct strmap *dir_rename_count)
 {
 	int detect_rename = options->detect_rename;
 	int minimum_score = options->rename_score;
@@ -653,6 +774,7 @@ void diffcore_rename(struct diff_options *options)
 	struct progress *progress = NULL;
 
 	trace2_region_enter("diff", "setup", options->repo);
+	assert(!dir_rename_count || strmap_empty(dir_rename_count));
 	want_copies = (detect_rename == DIFF_DETECT_COPY);
 	if (!minimum_score)
 		minimum_score = DEFAULT_RENAME_SCORE;
@@ -841,6 +963,11 @@ void diffcore_rename(struct diff_options *options)
 	trace2_region_leave("diff", "inexact renames", options->repo);
 
  cleanup:
+	/*
+	 * Now that renames have been computed, compute dir_rename_count */
+	if (dirs_removed && dir_rename_count)
+		compute_dir_rename_counts(dir_rename_count, dirs_removed);
+
 	/* At this point, we have found some renames and copies and they
 	 * are recorded in rename_dst.  The original list is still in *q.
 	 */
@@ -923,3 +1050,8 @@ void diffcore_rename(struct diff_options *options)
 	trace2_region_leave("diff", "write back to queue", options->repo);
 	return;
 }
+
+void diffcore_rename(struct diff_options *options)
+{
+	diffcore_rename_extended(options, NULL, NULL);
+}
diff --git a/diffcore.h b/diffcore.h
index d2a63c5c71f4..db55d3853071 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -8,6 +8,8 @@
 
 struct diff_options;
 struct repository;
+struct strmap;
+struct strset;
 struct userdiff_driver;
 
 /* This header file is internal between diff.c and its diff transformers
@@ -161,6 +163,9 @@ void diff_q(struct diff_queue_struct *, struct diff_filepair *);
 
 void diffcore_break(struct repository *, int);
 void diffcore_rename(struct diff_options *);
+void diffcore_rename_extended(struct diff_options *options,
+			      struct strset *dirs_removed,
+			      struct strmap *dir_rename_count);
 void diffcore_merge_broken(void);
 void diffcore_pickaxe(struct diff_options *);
 void diffcore_order(const char *orderfile);
diff --git a/merge-ort.c b/merge-ort.c
index 603d30c52170..c4467e073b45 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1302,131 +1302,6 @@ static char *handle_path_level_conflicts(struct merge_options *opt,
 	return new_path;
 }
 
-static void dirname_munge(char *filename)
-{
-	char *slash = strrchr(filename, '/');
-	if (!slash)
-		slash = filename;
-	*slash = '\0';
-}
-
-static void increment_count(struct strmap *dir_rename_count,
-			    char *old_dir,
-			    char *new_dir)
-{
-	struct strintmap *counts;
-	struct strmap_entry *e;
-
-	/* Get the {new_dirs -> counts} mapping using old_dir */
-	e = strmap_get_entry(dir_rename_count, old_dir);
-	if (e) {
-		counts = e->value;
-	} else {
-		counts = xmalloc(sizeof(*counts));
-		strintmap_init_with_options(counts, 0, NULL, 1);
-		strmap_put(dir_rename_count, old_dir, counts);
-	}
-
-	/* Increment the count for new_dir */
-	strintmap_incr(counts, new_dir, 1);
-}
-
-static void update_dir_rename_counts(struct strmap *dir_rename_count,
-				     struct strset *dirs_removed,
-				     const char *oldname,
-				     const char *newname)
-{
-	char *old_dir = xstrdup(oldname);
-	char *new_dir = xstrdup(newname);
-	char new_dir_first_char = new_dir[0];
-	int first_time_in_loop = 1;
-
-	while (1) {
-		dirname_munge(old_dir);
-		dirname_munge(new_dir);
-
-		/*
-		 * When renaming
-		 *   "a/b/c/d/e/foo.c" -> "a/b/some/thing/else/e/foo.c"
-		 * then this suggests that both
-		 *   a/b/c/d/e/ => a/b/some/thing/else/e/
-		 *   a/b/c/d/   => a/b/some/thing/else/
-		 * so we want to increment counters for both.  We do NOT,
-		 * however, also want to suggest that there was the following
-		 * rename:
-		 *   a/b/c/ => a/b/some/thing/
-		 * so we need to quit at that point.
-		 *
-		 * Note the when first_time_in_loop, we only strip off the
-		 * basename, and we don't care if that's different.
-		 */
-		if (!first_time_in_loop) {
-			char *old_sub_dir = strchr(old_dir, '\0')+1;
-			char *new_sub_dir = strchr(new_dir, '\0')+1;
-			if (!*new_dir) {
-				/*
-				 * Special case when renaming to root directory,
-				 * i.e. when new_dir == "".  In this case, we had
-				 * something like
-				 *    a/b/subdir => subdir
-				 * and so dirname_munge() sets things up so that
-				 *    old_dir = "a/b\0subdir\0"
-				 *    new_dir = "\0ubdir\0"
-				 * We didn't have a '/' to overwrite a '\0' onto
-				 * in new_dir, so we have to compare differently.
-				 */
-				if (new_dir_first_char != old_sub_dir[0] ||
-				    strcmp(old_sub_dir+1, new_sub_dir))
-					break;
-			} else {
-				if (strcmp(old_sub_dir, new_sub_dir))
-					break;
-			}
-		}
-
-		if (strset_contains(dirs_removed, old_dir))
-			increment_count(dir_rename_count, old_dir, new_dir);
-		else
-			break;
-
-		/* If we hit toplevel directory ("") for old or new dir, quit */
-		if (!*old_dir || !*new_dir)
-			break;
-
-		first_time_in_loop = 0;
-	}
-
-	/* Free resources we don't need anymore */
-	free(old_dir);
-	free(new_dir);
-}
-
-static void compute_rename_counts(struct diff_queue_struct *pairs,
-				  struct strmap *dir_rename_count,
-				  struct strset *dirs_removed)
-{
-	int i;
-
-	for (i = 0; i < pairs->nr; ++i) {
-		struct diff_filepair *pair = pairs->queue[i];
-
-		/* File not part of directory rename if it wasn't renamed */
-		if (pair->status != 'R')
-			continue;
-
-		/*
-		 * Make dir_rename_count contain a map of a map:
-		 *   old_directory -> {new_directory -> count}
-		 * In other words, for every pair look at the directories for
-		 * the old filename and the new filename and count how many
-		 * times that pairing occurs.
-		 */
-		update_dir_rename_counts(dir_rename_count, dirs_removed,
-					 pair->one->path,
-					 pair->two->path);
-	}
-}
-
 static void get_provisional_directory_renames(struct merge_options *opt,
 					      unsigned side,
 					      int *clean)
@@ -1435,9 +1310,6 @@ static void get_provisional_directory_renames(struct merge_options *opt,
 	struct strmap_entry *entry;
 	struct rename_info *renames = &opt->priv->renames;
 
-	compute_rename_counts(&renames->pairs[side],
-			      &renames->dir_rename_count[side],
-			      &renames->dirs_removed[side]);
 	/*
 	 * Collapse
 	 *    dir_rename_count: old_directory -> {new_directory -> count}
@@ -2162,7 +2034,9 @@ static void detect_regular_renames(struct merge_options *opt,
 
 	diff_queued_diff = renames->pairs[side_index];
 	trace2_region_enter("diff", "diffcore_rename", opt->repo);
-	diffcore_rename(&diff_opts);
+	diffcore_rename_extended(&diff_opts,
+				 &renames->dirs_removed[side_index],
+				 &renames->dir_rename_count[side_index]);
 	trace2_region_leave("diff", "diffcore_rename", opt->repo);
 	resolve_diffpair_statuses(&diff_queued_diff);
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 02/10] diffcore-rename: add functions for clearing dir_rename_count
  2021-02-14  7:58 [PATCH 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
  2021-02-14  7:58 ` [PATCH 01/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
@ 2021-02-14  7:58 ` Elijah Newren via GitGitGadget
  2021-02-14  7:58 ` [PATCH 03/10] diffcore-rename: move dir_rename_counts into a dir_rename_info struct Elijah Newren via GitGitGadget
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-14  7:58 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

As we adjust the usage of dir_rename_count we want to have functions for
clearing, or partially clearing it out.  Add such functions.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 19 +++++++++++++++++++
 diffcore.h        |  2 ++
 merge-ort.c       | 12 +++---------
 3 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 33cfc5848611..614a8d63012d 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -486,6 +486,25 @@ static void compute_dir_rename_counts(struct strmap *dir_rename_count,
 	}
 }
 
+void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
+{
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
+
+	strmap_for_each_entry(dir_rename_count, &iter, entry) {
+		struct strintmap *counts = entry->value;
+		strintmap_clear(counts);
+	}
+	strmap_partial_clear(dir_rename_count, 1);
+}
+
+MAYBE_UNUSED
+static void clear_dir_rename_count(struct strmap *dir_rename_count)
+{
+	partial_clear_dir_rename_count(dir_rename_count);
+	strmap_clear(dir_rename_count, 1);
+}
+
 static const char *get_basename(const char *filename)
 {
 	/*
diff --git a/diffcore.h b/diffcore.h
index db55d3853071..c6ba64abd198 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -161,6 +161,8 @@ struct diff_filepair *diff_queue(struct diff_queue_struct *,
 				 struct diff_filespec *);
 void diff_q(struct diff_queue_struct *, struct diff_filepair *);
 
+void partial_clear_dir_rename_count(struct strmap *dir_rename_count);
+
 void diffcore_break(struct repository *, int);
 void diffcore_rename(struct diff_options *);
 void diffcore_rename_extended(struct diff_options *options,
diff --git a/merge-ort.c b/merge-ort.c
index c4467e073b45..467404cc0a35 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -351,17 +351,11 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 
 	/* Free memory used by various renames maps */
 	for (i = MERGE_SIDE1; i <= MERGE_SIDE2; ++i) {
-		struct hashmap_iter iter;
-		struct strmap_entry *entry;
-
 		strset_func(&renames->dirs_removed[i]);
 
-		strmap_for_each_entry(&renames->dir_rename_count[i],
-				      &iter, entry) {
-			struct strintmap *counts = entry->value;
-			strintmap_clear(counts);
-		}
-		strmap_func(&renames->dir_rename_count[i], 1);
+		partial_clear_dir_rename_count(&renames->dir_rename_count[i]);
+		if (!reinitialize)
+			strmap_clear(&renames->dir_rename_count[i], 1);
 
 		strmap_func(&renames->dir_renames[i], 0);
 	}
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 03/10] diffcore-rename: move dir_rename_counts into a dir_rename_info struct
  2021-02-14  7:58 [PATCH 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
  2021-02-14  7:58 ` [PATCH 01/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
  2021-02-14  7:58 ` [PATCH 02/10] diffcore-rename: add functions for clearing dir_rename_count Elijah Newren via GitGitGadget
@ 2021-02-14  7:58 ` Elijah Newren via GitGitGadget
  2021-02-14  7:58 ` [PATCH 04/10] diffcore-rename: extend cleanup_dir_rename_info() Elijah Newren via GitGitGadget
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-14  7:58 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

This is a purely cosmetic change for now, but we will be adding
additional information to the struct and changing where and how it is
setup and used in subsequent patches.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 39 ++++++++++++++++++++++++++-------------
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 614a8d63012d..7759c9a3a2ed 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -367,6 +367,11 @@ static int find_exact_renames(struct diff_options *options)
 	return renames;
 }
 
+struct dir_rename_info {
+	struct strmap *dir_rename_count;
+	unsigned setup;
+};
+
 static void dirname_munge(char *filename)
 {
 	char *slash = strrchr(filename, '/');
@@ -375,7 +380,7 @@ static void dirname_munge(char *filename)
 	*slash = '\0';
 }
 
-static void increment_count(struct strmap *dir_rename_count,
+static void increment_count(struct dir_rename_info *info,
 			    char *old_dir,
 			    char *new_dir)
 {
@@ -383,20 +388,20 @@ static void increment_count(struct strmap *dir_rename_count,
 	struct strmap_entry *e;
 
 	/* Get the {new_dirs -> counts} mapping using old_dir */
-	e = strmap_get_entry(dir_rename_count, old_dir);
+	e = strmap_get_entry(info->dir_rename_count, old_dir);
 	if (e) {
 		counts = e->value;
 	} else {
 		counts = xmalloc(sizeof(*counts));
 		strintmap_init_with_options(counts, 0, NULL, 1);
-		strmap_put(dir_rename_count, old_dir, counts);
+		strmap_put(info->dir_rename_count, old_dir, counts);
 	}
 
 	/* Increment the count for new_dir */
 	strintmap_incr(counts, new_dir, 1);
 }
 
-static void update_dir_rename_counts(struct strmap *dir_rename_count,
+static void update_dir_rename_counts(struct dir_rename_info *info,
 				     struct strset *dirs_removed,
 				     const char *oldname,
 				     const char *newname)
@@ -450,7 +455,7 @@ static void update_dir_rename_counts(struct strmap *dir_rename_count,
 		}
 
 		if (strset_contains(dirs_removed, old_dir))
-			increment_count(dir_rename_count, old_dir, new_dir);
+			increment_count(info, old_dir, new_dir);
 		else
 			break;
 
@@ -466,12 +471,15 @@ static void update_dir_rename_counts(struct strmap *dir_rename_count,
 	free(new_dir);
 }
 
-static void compute_dir_rename_counts(struct strmap *dir_rename_count,
-				      struct strset *dirs_removed)
+static void compute_dir_rename_counts(struct dir_rename_info *info,
+				      struct strset *dirs_removed,
+				      struct strmap *dir_rename_count)
 {
 	int i;
 
-	/* Set up dir_rename_count */
+	info->setup = 1;
+	info->dir_rename_count = dir_rename_count;
+
 	for (i = 0; i < rename_dst_nr; ++i) {
 		/*
 		 * Make dir_rename_count contain a map of a map:
@@ -480,7 +488,7 @@ static void compute_dir_rename_counts(struct strmap *dir_rename_count,
 		 * the old filename and the new filename and count how many
 		 * times that pairing occurs.
 		 */
-		update_dir_rename_counts(dir_rename_count, dirs_removed,
+		update_dir_rename_counts(info, dirs_removed,
 					 rename_dst[i].p->one->path,
 					 rename_dst[i].p->two->path);
 	}
@@ -499,10 +507,13 @@ void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
 }
 
 MAYBE_UNUSED
-static void clear_dir_rename_count(struct strmap *dir_rename_count)
+static void cleanup_dir_rename_info(struct dir_rename_info *info)
 {
-	partial_clear_dir_rename_count(dir_rename_count);
-	strmap_clear(dir_rename_count, 1);
+	if (!info->setup)
+		return;
+
+	partial_clear_dir_rename_count(info->dir_rename_count);
+	strmap_clear(info->dir_rename_count, 1);
 }
 
 static const char *get_basename(const char *filename)
@@ -791,8 +802,10 @@ void diffcore_rename_extended(struct diff_options *options,
 	int num_destinations, dst_cnt;
 	int num_sources, want_copies;
 	struct progress *progress = NULL;
+	struct dir_rename_info info;
 
 	trace2_region_enter("diff", "setup", options->repo);
+	info.setup = 0;
 	assert(!dir_rename_count || strmap_empty(dir_rename_count));
 	want_copies = (detect_rename == DIFF_DETECT_COPY);
 	if (!minimum_score)
@@ -985,7 +998,7 @@ void diffcore_rename_extended(struct diff_options *options,
 	/*
 	 * Now that renames have been computed, compute dir_rename_count */
 	if (dirs_removed && dir_rename_count)
-		compute_dir_rename_counts(dir_rename_count, dirs_removed);
+		compute_dir_rename_counts(&info, dirs_removed, dir_rename_count);
 
 	/* At this point, we have found some renames and copies and they
 	 * are recorded in rename_dst.  The original list is still in *q.
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 04/10] diffcore-rename: extend cleanup_dir_rename_info()
  2021-02-14  7:58 [PATCH 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
                   ` (2 preceding siblings ...)
  2021-02-14  7:58 ` [PATCH 03/10] diffcore-rename: move dir_rename_counts into a dir_rename_info struct Elijah Newren via GitGitGadget
@ 2021-02-14  7:58 ` Elijah Newren via GitGitGadget
  2021-02-14  7:58 ` [PATCH 05/10] diffcore-rename: compute dir_rename_counts in stages Elijah Newren via GitGitGadget
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-14  7:58 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

When diffcore_rename_extended() is passed a NULL dir_rename_count, we
will still want to create a temporary one for use by
find_basename_matches(), but have it fully deallocated before
diffcore_rename_extended() returns.  However, when
diffcore_rename_extended() is passed a dir_rename_count, we want to fill
that strmap with appropriate values and return it.  However, for our
interim purposes we may also add entries corresponding to directories
that cannot have been renamed due to still existing on both sides.

Extend cleanup_dir_rename_info() to handle these two different cases,
cleaning up the relevant bits of information for each case.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 38 +++++++++++++++++++++++++++++++++++---
 1 file changed, 35 insertions(+), 3 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 7759c9a3a2ed..aa21d4e7175c 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -507,13 +507,45 @@ void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
 }
 
 MAYBE_UNUSED
-static void cleanup_dir_rename_info(struct dir_rename_info *info)
+static void cleanup_dir_rename_info(struct dir_rename_info *info,
+				    struct strset *dirs_removed,
+				    int keep_dir_rename_count)
 {
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
+
 	if (!info->setup)
 		return;
 
-	partial_clear_dir_rename_count(info->dir_rename_count);
-	strmap_clear(info->dir_rename_count, 1);
+	if (!keep_dir_rename_count) {
+		partial_clear_dir_rename_count(info->dir_rename_count);
+		strmap_clear(info->dir_rename_count, 1);
+		FREE_AND_NULL(info->dir_rename_count);
+	} else {
+		/*
+		 * Although dir_rename_count was passed in
+		 * diffcore_rename_extended() and we want to keep it around and
+		 * return it to that caller, we first want to remove any data
+		 * associated with directories that weren't renamed.
+		 */
+		struct string_list to_remove = STRING_LIST_INIT_NODUP;
+		int i;
+
+		strmap_for_each_entry(info->dir_rename_count, &iter, entry) {
+			const char *source_dir = entry->key;
+			struct strintmap *counts = entry->value;
+
+			if (!strset_contains(dirs_removed, source_dir)) {
+				string_list_append(&to_remove, source_dir);
+				strintmap_clear(counts);
+				continue;
+			}
+		}
+		for (i=0; i<to_remove.nr; ++i)
+			strmap_remove(info->dir_rename_count,
+				      to_remove.items[i].string, 1);
+		string_list_clear(&to_remove, 0);
+	}
 }
 
 static const char *get_basename(const char *filename)
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 05/10] diffcore-rename: compute dir_rename_counts in stages
  2021-02-14  7:58 [PATCH 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
                   ` (3 preceding siblings ...)
  2021-02-14  7:58 ` [PATCH 04/10] diffcore-rename: extend cleanup_dir_rename_info() Elijah Newren via GitGitGadget
@ 2021-02-14  7:58 ` Elijah Newren via GitGitGadget
  2021-02-14  7:58 ` [PATCH 06/10] diffcore-rename: add a mapping of destination names to their indices Elijah Newren via GitGitGadget
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-14  7:58 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

We want to first compute dir_rename_counts based just on exact renames
to start, as that can provide us useful information in
find_basename_matches().  That will give us an incomplete result, which
we can then later augment as basename and inexact rename matches are
found.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 76 ++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 62 insertions(+), 14 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index aa21d4e7175c..489e9cb0871e 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -411,6 +411,28 @@ static void update_dir_rename_counts(struct dir_rename_info *info,
 	char new_dir_first_char = new_dir[0];
 	int first_time_in_loop = 1;
 
+	if (!info->setup)
+		/*
+		 * info->setup is 0 here in two cases: (1) all auxiliary
+		 * vars (like dirs_removed) were NULL so
+		 * initialize_dir_rename_info() returned early, or (2)
+		 * either break detection or copy detection are active so
+		 * that we never called initialize_dir_rename_info().  In
+		 * the former case, we don't have enough info to know if
+		 * directories were renamed (because dirs_removed lets us
+		 * know about a necessary prerequisite, namely if they were
+		 * removed), and in the latter, we don't care about
+		 * directory renames or find_basename_matches.
+		 *
+		 * This matters because both basename and inexact matching
+		 * will also call update_dir_rename_counts().  In either of
+		 * the above two cases info->dir_rename_counts will not
+		 * have been properly initialized which prevents us from
+		 * updating it, but in these two cases we don't care about
+		 * dir_rename_counts anyway, so we can just exit early.
+		 */
+		return;
+
 	while (1) {
 		dirname_munge(old_dir);
 		dirname_munge(new_dir);
@@ -471,14 +493,22 @@ static void update_dir_rename_counts(struct dir_rename_info *info,
 	free(new_dir);
 }
 
-static void compute_dir_rename_counts(struct dir_rename_info *info,
-				      struct strset *dirs_removed,
-				      struct strmap *dir_rename_count)
+static void initialize_dir_rename_info(struct dir_rename_info *info,
+				       struct strset *dirs_removed,
+				       struct strmap *dir_rename_count)
 {
 	int i;
 
+	info->setup = 0;
+	if (!dirs_removed)
+		return;
 	info->setup = 1;
+
 	info->dir_rename_count = dir_rename_count;
+	if (!info->dir_rename_count) {
+		info->dir_rename_count = xmalloc(sizeof(*dir_rename_count));
+		strmap_init(info->dir_rename_count);
+	}
 
 	for (i = 0; i < rename_dst_nr; ++i) {
 		/*
@@ -506,7 +536,6 @@ void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
 	strmap_partial_clear(dir_rename_count, 1);
 }
 
-MAYBE_UNUSED
 static void cleanup_dir_rename_info(struct dir_rename_info *info,
 				    struct strset *dirs_removed,
 				    int keep_dir_rename_count)
@@ -561,7 +590,9 @@ static const char *get_basename(const char *filename)
 }
 
 static int find_basename_matches(struct diff_options *options,
-				 int minimum_score)
+				 int minimum_score,
+				 struct dir_rename_info *info,
+				 struct strset *dirs_removed)
 {
 	/*
 	 * When I checked in early 2020, over 76% of file renames in linux
@@ -669,6 +700,8 @@ static int find_basename_matches(struct diff_options *options,
 				continue;
 			record_rename_pair(dst_index, src_index, score);
 			renames++;
+			update_dir_rename_counts(info, dirs_removed,
+						 one->path, two->path);
 
 			/*
 			 * Found a rename so don't need text anymore; if we
@@ -752,7 +785,12 @@ static int too_many_rename_candidates(int num_destinations, int num_sources,
 	return 1;
 }
 
-static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, int copies)
+static int find_renames(struct diff_score *mx,
+			int dst_cnt,
+			int minimum_score,
+			int copies,
+			struct dir_rename_info *info,
+			struct strset *dirs_removed)
 {
 	int count = 0, i;
 
@@ -769,6 +807,9 @@ static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, i
 			continue;
 		record_rename_pair(mx[i].dst, mx[i].src, mx[i].score);
 		count++;
+		update_dir_rename_counts(info, dirs_removed,
+					 rename_src[mx[i].src].p->one->path,
+					 rename_dst[mx[i].dst].p->two->path);
 	}
 	return count;
 }
@@ -840,6 +881,8 @@ void diffcore_rename_extended(struct diff_options *options,
 	info.setup = 0;
 	assert(!dir_rename_count || strmap_empty(dir_rename_count));
 	want_copies = (detect_rename == DIFF_DETECT_COPY);
+	if (dirs_removed && (break_idx || want_copies))
+		BUG("dirs_removed incompatible with break/copy detection");
 	if (!minimum_score)
 		minimum_score = DEFAULT_RENAME_SCORE;
 
@@ -931,10 +974,17 @@ void diffcore_rename_extended(struct diff_options *options,
 		remove_unneeded_paths_from_src(want_copies);
 		trace2_region_leave("diff", "cull after exact", options->repo);
 
+		/* Preparation for basename-driven matching. */
+		trace2_region_enter("diff", "dir rename setup", options->repo);
+		initialize_dir_rename_info(&info,
+					   dirs_removed, dir_rename_count);
+		trace2_region_leave("diff", "dir rename setup", options->repo);
+
 		/* Utilize file basenames to quickly find renames. */
 		trace2_region_enter("diff", "basename matches", options->repo);
 		rename_count += find_basename_matches(options,
-						      min_basename_score);
+						      min_basename_score,
+						      &info, dirs_removed);
 		trace2_region_leave("diff", "basename matches", options->repo);
 
 		/*
@@ -1020,18 +1070,15 @@ void diffcore_rename_extended(struct diff_options *options,
 	/* cost matrix sorted by most to least similar pair */
 	STABLE_QSORT(mx, dst_cnt * NUM_CANDIDATE_PER_DST, score_compare);
 
-	rename_count += find_renames(mx, dst_cnt, minimum_score, 0);
+	rename_count += find_renames(mx, dst_cnt, minimum_score, 0,
+				     &info, dirs_removed);
 	if (want_copies)
-		rename_count += find_renames(mx, dst_cnt, minimum_score, 1);
+		rename_count += find_renames(mx, dst_cnt, minimum_score, 1,
+					     &info, dirs_removed);
 	free(mx);
 	trace2_region_leave("diff", "inexact renames", options->repo);
 
  cleanup:
-	/*
-	 * Now that renames have been computed, compute dir_rename_count */
-	if (dirs_removed && dir_rename_count)
-		compute_dir_rename_counts(&info, dirs_removed, dir_rename_count);
-
 	/* At this point, we have found some renames and copies and they
 	 * are recorded in rename_dst.  The original list is still in *q.
 	 */
@@ -1103,6 +1150,7 @@ void diffcore_rename_extended(struct diff_options *options,
 		if (rename_dst[i].filespec_to_free)
 			free_filespec(rename_dst[i].filespec_to_free);
 
+	cleanup_dir_rename_info(&info, dirs_removed, dir_rename_count != NULL);
 	FREE_AND_NULL(rename_dst);
 	rename_dst_nr = rename_dst_alloc = 0;
 	FREE_AND_NULL(rename_src);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 06/10] diffcore-rename: add a mapping of destination names to their indices
  2021-02-14  7:58 [PATCH 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
                   ` (4 preceding siblings ...)
  2021-02-14  7:58 ` [PATCH 05/10] diffcore-rename: compute dir_rename_counts in stages Elijah Newren via GitGitGadget
@ 2021-02-14  7:58 ` Elijah Newren via GitGitGadget
  2021-02-14  7:59 ` [PATCH 07/10] diffcore-rename: add a dir_rename_guess field to dir_rename_info Elijah Newren via GitGitGadget
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-14  7:58 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Add an idx_map member to struct dir_rename_info, which tracks a mapping
of the full filename to the index within rename_dst where that filename
is found.  We will later use this for quickly finding an array entry in
rename_dst given the pathname.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 489e9cb0871e..db569e4a0b0a 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -368,6 +368,7 @@ static int find_exact_renames(struct diff_options *options)
 }
 
 struct dir_rename_info {
+	struct strintmap idx_map;
 	struct strmap *dir_rename_count;
 	unsigned setup;
 };
@@ -509,10 +510,26 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 		info->dir_rename_count = xmalloc(sizeof(*dir_rename_count));
 		strmap_init(info->dir_rename_count);
 	}
+	strintmap_init_with_options(&info->idx_map, -1, NULL, 0);
 
+	/*
+	 * Loop setting up both info->idx_map, and doing setup of
+	 * info->dir_rename_count.
+	 */
 	for (i = 0; i < rename_dst_nr; ++i) {
 		/*
-		 * Make dir_rename_count contain a map of a map:
+		 * For non-renamed files, make idx_map contain mapping of
+		 *   filename -> index (index within rename_dst, that is)
+		 */
+		if (!rename_dst[i].is_rename) {
+			char *filename = rename_dst[i].p->two->path;
+			strintmap_set(&info->idx_map, filename, i);
+			continue;
+		}
+
+		/*
+		 * For everything else (i.e. renamed files), make
+		 * dir_rename_count contain a map of a map:
 		 *   old_directory -> {new_directory -> count}
 		 * In other words, for every pair look at the directories for
 		 * the old filename and the new filename and count how many
@@ -546,6 +563,9 @@ static void cleanup_dir_rename_info(struct dir_rename_info *info,
 	if (!info->setup)
 		return;
 
+	/* idx_map */
+	strintmap_clear(&info->idx_map);
+
 	if (!keep_dir_rename_count) {
 		partial_clear_dir_rename_count(info->dir_rename_count);
 		strmap_clear(info->dir_rename_count, 1);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 07/10] diffcore-rename: add a dir_rename_guess field to dir_rename_info
  2021-02-14  7:58 [PATCH 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
                   ` (5 preceding siblings ...)
  2021-02-14  7:58 ` [PATCH 06/10] diffcore-rename: add a mapping of destination names to their indices Elijah Newren via GitGitGadget
@ 2021-02-14  7:59 ` Elijah Newren via GitGitGadget
  2021-02-14  7:59 ` [PATCH 08/10] diffcore-rename: add a new idx_possible_rename function Elijah Newren via GitGitGadget
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-14  7:59 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

dir_rename_counts has a mapping of a mapping, in particular, it has
   old_dir => { new_dir => count }
We want a simple mapping of
   old_dir => new_dir
based on which new_dir had the highest count for a given old_dir.
Introduce dir_rename_guess for this purpose.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index db569e4a0b0a..d24f104aa81c 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -369,6 +369,7 @@ static int find_exact_renames(struct diff_options *options)
 
 struct dir_rename_info {
 	struct strintmap idx_map;
+	struct strmap dir_rename_guess;
 	struct strmap *dir_rename_count;
 	unsigned setup;
 };
@@ -381,6 +382,24 @@ static void dirname_munge(char *filename)
 	*slash = '\0';
 }
 
+static const char *get_highest_rename_path(struct strintmap *counts)
+{
+	int highest_count = 0;
+	const char *highest_destination_dir = NULL;
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
+
+	strintmap_for_each_entry(counts, &iter, entry) {
+		const char *destination_dir = entry->key;
+		intptr_t count = (intptr_t)entry->value;
+		if (count > highest_count) {
+			highest_count = count;
+			highest_destination_dir = destination_dir;
+		}
+	}
+	return highest_destination_dir;
+}
+
 static void increment_count(struct dir_rename_info *info,
 			    char *old_dir,
 			    char *new_dir)
@@ -498,6 +517,8 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 				       struct strset *dirs_removed,
 				       struct strmap *dir_rename_count)
 {
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
 	int i;
 
 	info->setup = 0;
@@ -511,6 +532,7 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 		strmap_init(info->dir_rename_count);
 	}
 	strintmap_init_with_options(&info->idx_map, -1, NULL, 0);
+	strmap_init_with_options(&info->dir_rename_guess, NULL, 0);
 
 	/*
 	 * Loop setting up both info->idx_map, and doing setup of
@@ -539,6 +561,23 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 					 rename_dst[i].p->one->path,
 					 rename_dst[i].p->two->path);
 	}
+
+	/*
+	 * Now we collapse
+	 *    dir_rename_count: old_directory -> {new_directory -> count}
+	 * down to
+	 *    dir_rename_guess: old_directory -> best_new_directory
+	 * where best_new_directory is the one with the highest count.
+	 */
+	strmap_for_each_entry(info->dir_rename_count, &iter, entry) {
+		/* entry->key is source_dir */
+		struct strintmap *counts = entry->value;
+		char *best_newdir;
+
+		best_newdir = xstrdup(get_highest_rename_path(counts));
+		strmap_put(&info->dir_rename_guess, entry->key,
+			   best_newdir);
+	}
 }
 
 void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
@@ -566,6 +605,9 @@ static void cleanup_dir_rename_info(struct dir_rename_info *info,
 	/* idx_map */
 	strintmap_clear(&info->idx_map);
 
+	/* dir_rename_guess */
+	strmap_clear(&info->dir_rename_guess, 1);
+
 	if (!keep_dir_rename_count) {
 		partial_clear_dir_rename_count(info->dir_rename_count);
 		strmap_clear(info->dir_rename_count, 1);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 08/10] diffcore-rename: add a new idx_possible_rename function
  2021-02-14  7:58 [PATCH 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
                   ` (6 preceding siblings ...)
  2021-02-14  7:59 ` [PATCH 07/10] diffcore-rename: add a dir_rename_guess field to dir_rename_info Elijah Newren via GitGitGadget
@ 2021-02-14  7:59 ` Elijah Newren via GitGitGadget
  2021-02-14  7:59 ` [PATCH 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs Elijah Newren via GitGitGadget
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-14  7:59 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

find_basename_matches() is great when both the remaining set of possible
rename sources and the remaining set of possible rename destinations
have exactly one file each with a given basename.  It allows us to match
up files that have been moved to different directories without changing
filenames.

When basenames are not unique, though, we want to be able to guess which
directories the source files have been moved to.  Since this is the job
of directory rename detection, we employ it.  However, since it is a
directory rename detection idea, we also limit it to cases where we know
there could have been a directory rename, i.e. where the source
directory has been removed.  This has to be signalled by dirs_removed
being non-NULL and containing an entry for the relevant directory.
Since merge-ort.c is the only caller that currently does so, this
optimization is only effective for merge-ort right now.  In the future,
this condition could be reconsidered or we could modify other callers to
pass the necessary strset.

Anyway, that's a lot of background so that we can actually describe the
new function.  Add an idx_possible_rename() function which combines the
recently added dir_rename_guess and idx_map fields to provide the index
within rename_dst of a potential match for a given file.

Future commits will add checks after calling this function to compare
the resulting 'likely rename' candidates to see if the two files meet
the elevated min_basename_score threshold for marking them as actual
renames.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index d24f104aa81c..1e4a56adde2c 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -374,6 +374,12 @@ struct dir_rename_info {
 	unsigned setup;
 };
 
+static char *get_dirname(const char *filename)
+{
+	char *slash = strrchr(filename, '/');
+	return slash ? xstrndup(filename, slash-filename) : xstrdup("");
+}
+
 static void dirname_munge(char *filename)
 {
 	char *slash = strrchr(filename, '/');
@@ -651,6 +657,81 @@ static const char *get_basename(const char *filename)
 	return base ? base + 1 : filename;
 }
 
+MAYBE_UNUSED
+static int idx_possible_rename(char *filename, struct dir_rename_info *info)
+{
+	/*
+	 * Our comparison of files with the same basename (see
+	 * find_basename_matches() below), is only helpful when after exact
+	 * rename detection we have exactly one file with a given basename
+	 * among the rename sources and also only exactly one file with
+	 * that basename among the rename destinations.  When we have
+	 * multiple files with the same basename in either set, we do not
+	 * know which to compare against.  However, there are some
+	 * filenames that occur in large numbers (particularly
+	 * build-related filenames such as 'Makefile', '.gitignore', or
+	 * 'build.gradle' that potentially exist within every single
+	 * subdirectory), and for performance we want to be able to quickly
+	 * find renames for these files too.
+	 *
+	 * The reason basename comparisons are a useful heuristic was that it
+	 * is common for people to move files across directories while keeping
+	 * their filename the same.  If we had a way of determining or even
+	 * making a good educated guess about which directory these non-unique
+	 * basename files had moved the file to, we could check it.
+	 * Luckily...
+	 *
+	 * When an entire directory is in fact renamed, we have two factors
+	 * helping us out:
+	 *   (a) the original directory disappeared giving us a hint
+	 *       about when we can apply an extra heuristic.
+	 *   (a) we often have several files within that directory and
+	 *       subdirectories that are renamed without changes
+	 * So, rules for a heuristic:
+	 *   (0) If there basename matches are non-unique (the condition under
+	 *       which this function is called) AND
+	 *   (1) the directory in which the file was found has disappeared
+	 *       (i.e. dirs_removed is non-NULL and has a relevant entry) THEN
+	 *   (2) use exact renames of files within the directory to determine
+	 *       where the directory is likely to have been renamed to.  IF
+	 *       there is at least one exact rename from within that
+	 *       directory, we can proceed.
+	 *   (3) If there are multiple places the directory could have been
+	 *       renamed to based on exact renames, ignore all but one of them.
+	 *       Just use the destination with the most renames going to it.
+	 *   (4) Check if applying that directory rename to the original file
+	 *       would result in a destination filename that is in the
+	 *       potential rename set.  If so, return the index of the
+	 *       destination file (the index within rename_dst).
+	 *   (5) Compare the original file and returned destination for
+	 *       similarity, and if they are sufficiently similar, record the
+	 *       rename.
+	 *
+	 * This function, idx_possible_rename(), is only responsible for (4).
+	 * The conditions/steps in (1)-(3) are handled via setting up
+	 * dir_rename_count and dir_rename_guess in
+	 * initialize_dir_rename_info().  Steps (0) and (5) are handled by
+	 * the caller of this function.
+	 */
+	char *old_dir, *new_dir, *new_path;
+	int idx;
+
+	if (!info->setup)
+		return -1;
+
+	old_dir = get_dirname(filename);
+	new_dir = strmap_get(&info->dir_rename_guess, old_dir);
+	free(old_dir);
+	if (!new_dir)
+		return -1;
+
+	new_path = xstrfmt("%s/%s", new_dir, get_basename(filename));
+
+	idx = strintmap_get(&info->idx_map, new_path);
+	free(new_path);
+	return idx;
+}
+
 static int find_basename_matches(struct diff_options *options,
 				 int minimum_score,
 				 struct dir_rename_info *info,
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs
  2021-02-14  7:58 [PATCH 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
                   ` (7 preceding siblings ...)
  2021-02-14  7:59 ` [PATCH 08/10] diffcore-rename: add a new idx_possible_rename function Elijah Newren via GitGitGadget
@ 2021-02-14  7:59 ` Elijah Newren via GitGitGadget
  2021-02-14  7:59 ` [PATCH 10/10] diffcore-rename: use directory rename guided basename comparisons Elijah Newren via GitGitGadget
  2021-02-23 23:43 ` [PATCH v2 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
  10 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-14  7:59 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

We are using dir_rename_counts to count the number of other directories
that files within a directory moved to.  We only need this information
for directories that disappeared, though, so we can return early from
update_dir_rename_counts() for other paths.

While dirs_removed provides the relevant information for us right now,
we introduce a new info->relevant_source_dirs parameter because future
optimizations will want to change how things are called somewhat.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 1e4a56adde2c..5de4497e04fa 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -371,6 +371,7 @@ struct dir_rename_info {
 	struct strintmap idx_map;
 	struct strmap dir_rename_guess;
 	struct strmap *dir_rename_count;
+	struct strset *relevant_source_dirs;
 	unsigned setup;
 };
 
@@ -460,7 +461,13 @@ static void update_dir_rename_counts(struct dir_rename_info *info,
 		return;
 
 	while (1) {
+		/* Get old_dir, skip if its directory isn't relevant. */
 		dirname_munge(old_dir);
+		if (info->relevant_source_dirs &&
+		    !strset_contains(info->relevant_source_dirs, old_dir))
+			break;
+
+		/* Get new_dir */
 		dirname_munge(new_dir);
 
 		/*
@@ -540,6 +547,9 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 	strintmap_init_with_options(&info->idx_map, -1, NULL, 0);
 	strmap_init_with_options(&info->dir_rename_guess, NULL, 0);
 
+	/* Setup info->relevant_source_dirs */
+	info->relevant_source_dirs = dirs_removed;
+
 	/*
 	 * Loop setting up both info->idx_map, and doing setup of
 	 * info->dir_rename_count.
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 10/10] diffcore-rename: use directory rename guided basename comparisons
  2021-02-14  7:58 [PATCH 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
                   ` (8 preceding siblings ...)
  2021-02-14  7:59 ` [PATCH 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs Elijah Newren via GitGitGadget
@ 2021-02-14  7:59 ` Elijah Newren via GitGitGadget
  2021-02-23 23:43 ` [PATCH v2 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
  10 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-14  7:59 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Hook the work from the last several patches together so that when
basenames in the sets of possible remaining rename sources or
destinations aren't unique, we can guess which directory source files
were renamed into.  When that guess gives us a pairing of files, and
those files are sufficiently similar, we record the two files as a
rename and remove them from the large matrix of comparisons for inexact
rename detection.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       12.775 s ±  0.062 s    12.596 s ±  0.061 s
    mega-renames:    188.754 s ±  0.284 s   130.465 s ±  0.259 s
    just-one-mega:     5.599 s ±  0.019 s     3.958 s ±  0.010 s

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/gitdiffcore.txt |  2 +-
 diffcore-rename.c             | 32 +++++++++++++++++++++++---------
 2 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
index 80fcf9542441..8673a5c5b2f2 100644
--- a/Documentation/gitdiffcore.txt
+++ b/Documentation/gitdiffcore.txt
@@ -186,7 +186,7 @@ mark a file pair as a rename and stop considering other candidates for
 better matches.  At most, one comparison is done per file in this
 preliminary pass; so if there are several remaining ext.txt files
 throughout the directory hierarchy after exact rename detection, this
-preliminary step will be skipped for those files.
+preliminary step may be skipped for those files.
 
 Note.  When the "-C" option is used with `--find-copies-harder`
 option, 'git diff-{asterisk}' commands feed unmodified filepairs to
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 5de4497e04fa..70a484b9b63e 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -667,7 +667,6 @@ static const char *get_basename(const char *filename)
 	return base ? base + 1 : filename;
 }
 
-MAYBE_UNUSED
 static int idx_possible_rename(char *filename, struct dir_rename_info *info)
 {
 	/*
@@ -780,8 +779,6 @@ static int find_basename_matches(struct diff_options *options,
 	int i, renames = 0;
 	struct strintmap sources;
 	struct strintmap dests;
-	struct hashmap_iter iter;
-	struct strmap_entry *entry;
 
 	/*
 	 * The prefeteching stuff wants to know if it can skip prefetching
@@ -831,17 +828,34 @@ static int find_basename_matches(struct diff_options *options,
 	}
 
 	/* Now look for basename matchups and do similarity estimation */
-	strintmap_for_each_entry(&sources, &iter, entry) {
-		const char *base = entry->key;
-		intptr_t src_index = (intptr_t)entry->value;
+	for (i = 0; i < rename_src_nr; ++i) {
+		char *filename = rename_src[i].p->one->path;
+		const char *base = NULL;
+		intptr_t src_index;
 		intptr_t dst_index;
-		if (src_index == -1)
-			continue;
 
-		if (0 <= (dst_index = strintmap_get(&dests, base))) {
+		/* Is this basename unique among remaining sources? */
+		base = get_basename(filename);
+		src_index = strintmap_get(&sources, base);
+		assert(src_index == -1 || src_index == i);
+
+		if (strintmap_contains(&dests, base)) {
 			struct diff_filespec *one, *two;
 			int score;
 
+			/* Find a matching destination, if possible */
+			dst_index = strintmap_get(&dests, base);
+			if (src_index == -1 || dst_index == -1) {
+				src_index = i;
+				dst_index = idx_possible_rename(filename, info);
+			}
+			if (dst_index == -1)
+				continue;
+
+			/* Ignore this dest if already used in a rename */
+			if (rename_dst[dst_index].is_rename)
+				continue; /* already used previously */
+
 			/* Estimate the similarity */
 			one = rename_src[src_index].p->one;
 			two = rename_dst[dst_index].p->two;
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v2 00/10] Optimization batch 8: use file basenames even more
  2021-02-14  7:58 [PATCH 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
                   ` (9 preceding siblings ...)
  2021-02-14  7:59 ` [PATCH 10/10] diffcore-rename: use directory rename guided basename comparisons Elijah Newren via GitGitGadget
@ 2021-02-23 23:43 ` Elijah Newren via GitGitGadget
  2021-02-23 23:43   ` [PATCH v2 01/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
                     ` (11 more replies)
  10 siblings, 12 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-23 23:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren

This series depends on en/diffcore-rename (a concatenation of what I was
calling ort-perf-batch-6 and ort-perf-batch-7).

There are no changes since v1; it's just a resend a week and a half later to
bump it so it isn't lost.

=== Optimization idea ===

This series uses file basenames (portions of the path after the last '/',
including file extension) in a more involved fashion to guide rename
detection. It's a follow-on improvement to "Optimization #3" from my Git
Merge 2020 talk[1]. The basic idea behind this series is the same as the
last series: people frequently move files across directories while keeping
the filenames the same, thus files with the same basename are likely rename
candidates. However, the previous optimization only applies when basenames
are unique among remaining adds and deletes after exact rename detection, so
we need to do something else to match up the remaining basenames. When there
are many files with the same basename (e.g. .gitignore, Makefile,
build.gradle, or maybe even setup.c, AbtractFactory.java, etc.), being able
to "guess" which directory a given file likely would have moved to can
provide us with a likely rename candidate if there is a file with the same
basename in that directory. Since exact rename detection is done first, we
can use nearby exact renames to help us guess where any given non-unique
basename file may have moved; it just means doing "directory rename
detection" limited to exact renames.

There are definitely cases when this strategy still won't help us: (1) We
only use this strategy when the directory in which the original file was
found has also been removed, (2) a lack of exact renames from the given
directory will prevents us from making a new directory prediction, (3) even
if we predict a new directory there may be no file with the given basename
in it, and (4) even if there is an unmatched add with the appropriate
basename in the predicted directory, it may not meet the higher
min_basename_score similarity threshold.

It may be worth noting that directory rename detection at most predicts one
new directory, which we use to ensure that we only compare any given file
with at most one other file. That's important for compatibility with future
optimizations.

However, despite the caveats and limited applicability, this idea provides
some nice speedups.

=== Results ===

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28), the
changes in just this series improves the performance as follows:

                     Before Series           After Series
no-renames:       12.775 s ±  0.062 s    12.596 s ±  0.061 s
mega-renames:    188.754 s ±  0.284 s   130.465 s ±  0.259 s
just-one-mega:     5.599 s ±  0.019 s     3.958 s ±  0.010 s


As a reminder, before any merge-ort/diffcore-rename performance work, the
performance results we started with (as noted in the same commit message)
were:

no-renames-am:      6.940 s ±  0.485 s
no-renames:        18.912 s ±  0.174 s
mega-renames:    5964.031 s ± 10.459 s
just-one-mega:    149.583 s ±  0.751 s


=== Alternative, rejected idea ===

There was an alternative idea to the series presented here that I also
tried: instead of using directory rename detection based on exact renames to
predict where files would be renamed and then comparing to the file with the
same basename in the new directory, one could instead take all files with
the same basename -- both sources and destinations -- and then do a smaller
M x N comparison on all those files to find renames. Any non-matches after
that step could be combined with all other files for the big inexact rename
detection step.

There are two problems with such a strategy, though.

One is that in the worst case, you approximately double the cost of rename
detection (if most potential rename pairs all have the same basename but
they aren't actually matches, you end up comparing twice).

The second issue isn't clear until trying to combine this idea with later
performance optimizations. The next optimization will provide a way to
filter out several of the rename sources. If our inexact rename detection
matrix is sized 1 x 4000 because we can remove all but one source file, but
we have 100 files with the same basename, then a 100 x 100 comparison is
actually more costly than a 1 x 4000 comparison -- and we don't need most of
the renames from the 100 x 100 comparison. The advantage of the directory
rename detection based idea for finding which basenames to match up, is that
the cost for each file is linear (or, said another way, scales proportional
to doing a diff on that file). As such, the costs for this preliminary
optimization are nicely controlled and the worst case scenario is it has
spent a little extra time upfront but still has to do the full inexact
rename detection.

[1]
https://github.com/newren/presentations/blob/pdfs/merge-performance/merge-performance-slides.pdf

Elijah Newren (10):
  Move computation of dir_rename_count from merge-ort to diffcore-rename
  diffcore-rename: add functions for clearing dir_rename_count
  diffcore-rename: move dir_rename_counts into a dir_rename_info struct
  diffcore-rename: extend cleanup_dir_rename_info()
  diffcore-rename: compute dir_rename_counts in stages
  diffcore-rename: add a mapping of destination names to their indices
  diffcore-rename: add a dir_rename_guess field to dir_rename_info
  diffcore-rename: add a new idx_possible_rename function
  diffcore-rename: limit dir_rename_counts computation to relevant dirs
  diffcore-rename: use directory rename guided basename comparisons

 Documentation/gitdiffcore.txt |   2 +-
 diffcore-rename.c             | 439 ++++++++++++++++++++++++++++++++--
 diffcore.h                    |   7 +
 merge-ort.c                   | 144 +----------
 4 files changed, 439 insertions(+), 153 deletions(-)


base-commit: aeca14f748afc7fb5b65bca56ea2ebd970729814
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-844%2Fnewren%2Fort-perf-batch-8-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-844/newren/ort-perf-batch-8-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/844

Range-diff vs v1:

  1:  fec4f1d44c06 =  1:  fec4f1d44c06 Move computation of dir_rename_count from merge-ort to diffcore-rename
  2:  612da82f049c =  2:  612da82f049c diffcore-rename: add functions for clearing dir_rename_count
  3:  93f98fc0b264 =  3:  93f98fc0b264 diffcore-rename: move dir_rename_counts into a dir_rename_info struct
  4:  f7bdad78219d =  4:  f7bdad78219d diffcore-rename: extend cleanup_dir_rename_info()
  5:  3a29cf9e526f =  5:  3a29cf9e526f diffcore-rename: compute dir_rename_counts in stages
  6:  dffecc064dd3 =  6:  dffecc064dd3 diffcore-rename: add a mapping of destination names to their indices
  7:  4983a1c2f908 =  7:  4983a1c2f908 diffcore-rename: add a dir_rename_guess field to dir_rename_info
  8:  cbd055ab3399 =  8:  cbd055ab3399 diffcore-rename: add a new idx_possible_rename function
  9:  4e095ea7c439 =  9:  4e095ea7c439 diffcore-rename: limit dir_rename_counts computation to relevant dirs
 10:  1df498b3a2f0 = 10:  805c101cfd84 diffcore-rename: use directory rename guided basename comparisons

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v2 01/10] Move computation of dir_rename_count from merge-ort to diffcore-rename
  2021-02-23 23:43 ` [PATCH v2 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
@ 2021-02-23 23:43   ` Elijah Newren via GitGitGadget
  2021-02-24 15:25     ` Derrick Stolee
  2021-02-23 23:43   ` [PATCH v2 02/10] diffcore-rename: add functions for clearing dir_rename_count Elijah Newren via GitGitGadget
                     ` (10 subsequent siblings)
  11 siblings, 1 reply; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-23 23:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

A previous commit noted that it is very common for people to move files
across directories while keeping their filename the same.  The last few
commits took advantage of this and showed that we can accelerate rename
detection significantly using basenames; since files with the same
basename serve as likely rename candidates, we can check those first and
remove them from the rename candidate pool if they are sufficiently
similar.

Unfortunately, the previous optimization was limited by the fact that
the remaining basenames after exact rename detection are not always
unique.  Many repositories have hundreds of build files with the same
name (e.g. Makefile, .gitignore, build.gradle, etc.), and may even have
hundreds of source files with the same name.  (For example, the linux
kernel has 100 setup.c, 87 irq.c, and 112 core.c files.  A repository at
$DAYJOB has a lot of ObjectFactory.java and Plugin.java files).

For these files with non-unique basenames, we are faced with the task of
attempting to determine or guess which directory they may have been
relocated to.  Such a task is precisely the job of directory rename
detection.  However, there are two catches: (1) the directory rename
detection code has traditionally been part of the merge machinery rather
than diffcore-rename.c, and (2) directory rename detection currently
runs after regular rename detection is complete.  The 1st catch is just
an implementation issue that can be overcome by some code shuffling.
The 2nd requires us to add a further approximation: we only have access
to exact renames at this point, so we need to do directory rename
detection based on just exact renames.  In some cases we won't have
exact renames, in which case this extra optimization won't apply.  We
also choose to not apply the optimization unless we know that the
underlying directory was removed, which will require extra data to be
passed in to diffcore_rename_extended().  Also, even if we get a
prediction about which directory a file may have relocated to, we will
still need to check to see if there is a file in the predicted
directory, and then compare the two files to see if they meet the higher
min_basename_score threshold required for marking the two files as
renames.

This commit and the next few will set up the necessary infrastructure to
do such computations.  This commit merely moves the computation of
dir_rename_count from merge-ort.c to diffcore-rename.c, making slight
adjustments to the data structures based on the move.  While the
diffstat looks large, viewing this commit with --color-moved makes it
clear that only about 20 lines changed.  With this patch, the
computation of dir_rename_count is still only done after inexact rename
detection, but subsequent commits will add a preliminary computation of
dir_rename_count after exact rename detection, followed by some updates
after inexact rename detection.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 134 +++++++++++++++++++++++++++++++++++++++++++++-
 diffcore.h        |   5 ++
 merge-ort.c       | 132 ++-------------------------------------------
 3 files changed, 141 insertions(+), 130 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 41558185ae1d..33cfc5848611 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -367,6 +367,125 @@ static int find_exact_renames(struct diff_options *options)
 	return renames;
 }
 
+static void dirname_munge(char *filename)
+{
+	char *slash = strrchr(filename, '/');
+	if (!slash)
+		slash = filename;
+	*slash = '\0';
+}
+
+static void increment_count(struct strmap *dir_rename_count,
+			    char *old_dir,
+			    char *new_dir)
+{
+	struct strintmap *counts;
+	struct strmap_entry *e;
+
+	/* Get the {new_dirs -> counts} mapping using old_dir */
+	e = strmap_get_entry(dir_rename_count, old_dir);
+	if (e) {
+		counts = e->value;
+	} else {
+		counts = xmalloc(sizeof(*counts));
+		strintmap_init_with_options(counts, 0, NULL, 1);
+		strmap_put(dir_rename_count, old_dir, counts);
+	}
+
+	/* Increment the count for new_dir */
+	strintmap_incr(counts, new_dir, 1);
+}
+
+static void update_dir_rename_counts(struct strmap *dir_rename_count,
+				     struct strset *dirs_removed,
+				     const char *oldname,
+				     const char *newname)
+{
+	char *old_dir = xstrdup(oldname);
+	char *new_dir = xstrdup(newname);
+	char new_dir_first_char = new_dir[0];
+	int first_time_in_loop = 1;
+
+	while (1) {
+		dirname_munge(old_dir);
+		dirname_munge(new_dir);
+
+		/*
+		 * When renaming
+		 *   "a/b/c/d/e/foo.c" -> "a/b/some/thing/else/e/foo.c"
+		 * then this suggests that both
+		 *   a/b/c/d/e/ => a/b/some/thing/else/e/
+		 *   a/b/c/d/   => a/b/some/thing/else/
+		 * so we want to increment counters for both.  We do NOT,
+		 * however, also want to suggest that there was the following
+		 * rename:
+		 *   a/b/c/ => a/b/some/thing/
+		 * so we need to quit at that point.
+		 *
+		 * Note the when first_time_in_loop, we only strip off the
+		 * basename, and we don't care if that's different.
+		 */
+		if (!first_time_in_loop) {
+			char *old_sub_dir = strchr(old_dir, '\0')+1;
+			char *new_sub_dir = strchr(new_dir, '\0')+1;
+			if (!*new_dir) {
+				/*
+				 * Special case when renaming to root directory,
+				 * i.e. when new_dir == "".  In this case, we had
+				 * something like
+				 *    a/b/subdir => subdir
+				 * and so dirname_munge() sets things up so that
+				 *    old_dir = "a/b\0subdir\0"
+				 *    new_dir = "\0ubdir\0"
+				 * We didn't have a '/' to overwrite a '\0' onto
+				 * in new_dir, so we have to compare differently.
+				 */
+				if (new_dir_first_char != old_sub_dir[0] ||
+				    strcmp(old_sub_dir+1, new_sub_dir))
+					break;
+			} else {
+				if (strcmp(old_sub_dir, new_sub_dir))
+					break;
+			}
+		}
+
+		if (strset_contains(dirs_removed, old_dir))
+			increment_count(dir_rename_count, old_dir, new_dir);
+		else
+			break;
+
+		/* If we hit toplevel directory ("") for old or new dir, quit */
+		if (!*old_dir || !*new_dir)
+			break;
+
+		first_time_in_loop = 0;
+	}
+
+	/* Free resources we don't need anymore */
+	free(old_dir);
+	free(new_dir);
+}
+
+static void compute_dir_rename_counts(struct strmap *dir_rename_count,
+				      struct strset *dirs_removed)
+{
+	int i;
+
+	/* Set up dir_rename_count */
+	for (i = 0; i < rename_dst_nr; ++i) {
+		/*
+		 * Make dir_rename_count contain a map of a map:
+		 *   old_directory -> {new_directory -> count}
+		 * In other words, for every pair look at the directories for
+		 * the old filename and the new filename and count how many
+		 * times that pairing occurs.
+		 */
+		update_dir_rename_counts(dir_rename_count, dirs_removed,
+					 rename_dst[i].p->one->path,
+					 rename_dst[i].p->two->path);
+	}
+}
+
 static const char *get_basename(const char *filename)
 {
 	/*
@@ -640,7 +759,9 @@ static void remove_unneeded_paths_from_src(int detecting_copies)
 	rename_src_nr = new_num_src;
 }
 
-void diffcore_rename(struct diff_options *options)
+void diffcore_rename_extended(struct diff_options *options,
+			      struct strset *dirs_removed,
+			      struct strmap *dir_rename_count)
 {
 	int detect_rename = options->detect_rename;
 	int minimum_score = options->rename_score;
@@ -653,6 +774,7 @@ void diffcore_rename(struct diff_options *options)
 	struct progress *progress = NULL;
 
 	trace2_region_enter("diff", "setup", options->repo);
+	assert(!dir_rename_count || strmap_empty(dir_rename_count));
 	want_copies = (detect_rename == DIFF_DETECT_COPY);
 	if (!minimum_score)
 		minimum_score = DEFAULT_RENAME_SCORE;
@@ -841,6 +963,11 @@ void diffcore_rename(struct diff_options *options)
 	trace2_region_leave("diff", "inexact renames", options->repo);
 
  cleanup:
+	/*
+	 * Now that renames have been computed, compute dir_rename_count */
+	if (dirs_removed && dir_rename_count)
+		compute_dir_rename_counts(dir_rename_count, dirs_removed);
+
 	/* At this point, we have found some renames and copies and they
 	 * are recorded in rename_dst.  The original list is still in *q.
 	 */
@@ -923,3 +1050,8 @@ void diffcore_rename(struct diff_options *options)
 	trace2_region_leave("diff", "write back to queue", options->repo);
 	return;
 }
+
+void diffcore_rename(struct diff_options *options)
+{
+	diffcore_rename_extended(options, NULL, NULL);
+}
diff --git a/diffcore.h b/diffcore.h
index d2a63c5c71f4..db55d3853071 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -8,6 +8,8 @@
 
 struct diff_options;
 struct repository;
+struct strmap;
+struct strset;
 struct userdiff_driver;
 
 /* This header file is internal between diff.c and its diff transformers
@@ -161,6 +163,9 @@ void diff_q(struct diff_queue_struct *, struct diff_filepair *);
 
 void diffcore_break(struct repository *, int);
 void diffcore_rename(struct diff_options *);
+void diffcore_rename_extended(struct diff_options *options,
+			      struct strset *dirs_removed,
+			      struct strmap *dir_rename_count);
 void diffcore_merge_broken(void);
 void diffcore_pickaxe(struct diff_options *);
 void diffcore_order(const char *orderfile);
diff --git a/merge-ort.c b/merge-ort.c
index 603d30c52170..c4467e073b45 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1302,131 +1302,6 @@ static char *handle_path_level_conflicts(struct merge_options *opt,
 	return new_path;
 }
 
-static void dirname_munge(char *filename)
-{
-	char *slash = strrchr(filename, '/');
-	if (!slash)
-		slash = filename;
-	*slash = '\0';
-}
-
-static void increment_count(struct strmap *dir_rename_count,
-			    char *old_dir,
-			    char *new_dir)
-{
-	struct strintmap *counts;
-	struct strmap_entry *e;
-
-	/* Get the {new_dirs -> counts} mapping using old_dir */
-	e = strmap_get_entry(dir_rename_count, old_dir);
-	if (e) {
-		counts = e->value;
-	} else {
-		counts = xmalloc(sizeof(*counts));
-		strintmap_init_with_options(counts, 0, NULL, 1);
-		strmap_put(dir_rename_count, old_dir, counts);
-	}
-
-	/* Increment the count for new_dir */
-	strintmap_incr(counts, new_dir, 1);
-}
-
-static void update_dir_rename_counts(struct strmap *dir_rename_count,
-				     struct strset *dirs_removed,
-				     const char *oldname,
-				     const char *newname)
-{
-	char *old_dir = xstrdup(oldname);
-	char *new_dir = xstrdup(newname);
-	char new_dir_first_char = new_dir[0];
-	int first_time_in_loop = 1;
-
-	while (1) {
-		dirname_munge(old_dir);
-		dirname_munge(new_dir);
-
-		/*
-		 * When renaming
-		 *   "a/b/c/d/e/foo.c" -> "a/b/some/thing/else/e/foo.c"
-		 * then this suggests that both
-		 *   a/b/c/d/e/ => a/b/some/thing/else/e/
-		 *   a/b/c/d/   => a/b/some/thing/else/
-		 * so we want to increment counters for both.  We do NOT,
-		 * however, also want to suggest that there was the following
-		 * rename:
-		 *   a/b/c/ => a/b/some/thing/
-		 * so we need to quit at that point.
-		 *
-		 * Note the when first_time_in_loop, we only strip off the
-		 * basename, and we don't care if that's different.
-		 */
-		if (!first_time_in_loop) {
-			char *old_sub_dir = strchr(old_dir, '\0')+1;
-			char *new_sub_dir = strchr(new_dir, '\0')+1;
-			if (!*new_dir) {
-				/*
-				 * Special case when renaming to root directory,
-				 * i.e. when new_dir == "".  In this case, we had
-				 * something like
-				 *    a/b/subdir => subdir
-				 * and so dirname_munge() sets things up so that
-				 *    old_dir = "a/b\0subdir\0"
-				 *    new_dir = "\0ubdir\0"
-				 * We didn't have a '/' to overwrite a '\0' onto
-				 * in new_dir, so we have to compare differently.
-				 */
-				if (new_dir_first_char != old_sub_dir[0] ||
-				    strcmp(old_sub_dir+1, new_sub_dir))
-					break;
-			} else {
-				if (strcmp(old_sub_dir, new_sub_dir))
-					break;
-			}
-		}
-
-		if (strset_contains(dirs_removed, old_dir))
-			increment_count(dir_rename_count, old_dir, new_dir);
-		else
-			break;
-
-		/* If we hit toplevel directory ("") for old or new dir, quit */
-		if (!*old_dir || !*new_dir)
-			break;
-
-		first_time_in_loop = 0;
-	}
-
-	/* Free resources we don't need anymore */
-	free(old_dir);
-	free(new_dir);
-}
-
-static void compute_rename_counts(struct diff_queue_struct *pairs,
-				  struct strmap *dir_rename_count,
-				  struct strset *dirs_removed)
-{
-	int i;
-
-	for (i = 0; i < pairs->nr; ++i) {
-		struct diff_filepair *pair = pairs->queue[i];
-
-		/* File not part of directory rename if it wasn't renamed */
-		if (pair->status != 'R')
-			continue;
-
-		/*
-		 * Make dir_rename_count contain a map of a map:
-		 *   old_directory -> {new_directory -> count}
-		 * In other words, for every pair look at the directories for
-		 * the old filename and the new filename and count how many
-		 * times that pairing occurs.
-		 */
-		update_dir_rename_counts(dir_rename_count, dirs_removed,
-					 pair->one->path,
-					 pair->two->path);
-	}
-}
-
 static void get_provisional_directory_renames(struct merge_options *opt,
 					      unsigned side,
 					      int *clean)
@@ -1435,9 +1310,6 @@ static void get_provisional_directory_renames(struct merge_options *opt,
 	struct strmap_entry *entry;
 	struct rename_info *renames = &opt->priv->renames;
 
-	compute_rename_counts(&renames->pairs[side],
-			      &renames->dir_rename_count[side],
-			      &renames->dirs_removed[side]);
 	/*
 	 * Collapse
 	 *    dir_rename_count: old_directory -> {new_directory -> count}
@@ -2162,7 +2034,9 @@ static void detect_regular_renames(struct merge_options *opt,
 
 	diff_queued_diff = renames->pairs[side_index];
 	trace2_region_enter("diff", "diffcore_rename", opt->repo);
-	diffcore_rename(&diff_opts);
+	diffcore_rename_extended(&diff_opts,
+				 &renames->dirs_removed[side_index],
+				 &renames->dir_rename_count[side_index]);
 	trace2_region_leave("diff", "diffcore_rename", opt->repo);
 	resolve_diffpair_statuses(&diff_queued_diff);
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v2 02/10] diffcore-rename: add functions for clearing dir_rename_count
  2021-02-23 23:43 ` [PATCH v2 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
  2021-02-23 23:43   ` [PATCH v2 01/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
@ 2021-02-23 23:43   ` Elijah Newren via GitGitGadget
  2021-02-23 23:44   ` [PATCH v2 03/10] diffcore-rename: move dir_rename_counts into a dir_rename_info struct Elijah Newren via GitGitGadget
                     ` (9 subsequent siblings)
  11 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-23 23:43 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

As we adjust the usage of dir_rename_count we want to have functions for
clearing, or partially clearing it out.  Add such functions.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 19 +++++++++++++++++++
 diffcore.h        |  2 ++
 merge-ort.c       | 12 +++---------
 3 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 33cfc5848611..614a8d63012d 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -486,6 +486,25 @@ static void compute_dir_rename_counts(struct strmap *dir_rename_count,
 	}
 }
 
+void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
+{
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
+
+	strmap_for_each_entry(dir_rename_count, &iter, entry) {
+		struct strintmap *counts = entry->value;
+		strintmap_clear(counts);
+	}
+	strmap_partial_clear(dir_rename_count, 1);
+}
+
+MAYBE_UNUSED
+static void clear_dir_rename_count(struct strmap *dir_rename_count)
+{
+	partial_clear_dir_rename_count(dir_rename_count);
+	strmap_clear(dir_rename_count, 1);
+}
+
 static const char *get_basename(const char *filename)
 {
 	/*
diff --git a/diffcore.h b/diffcore.h
index db55d3853071..c6ba64abd198 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -161,6 +161,8 @@ struct diff_filepair *diff_queue(struct diff_queue_struct *,
 				 struct diff_filespec *);
 void diff_q(struct diff_queue_struct *, struct diff_filepair *);
 
+void partial_clear_dir_rename_count(struct strmap *dir_rename_count);
+
 void diffcore_break(struct repository *, int);
 void diffcore_rename(struct diff_options *);
 void diffcore_rename_extended(struct diff_options *options,
diff --git a/merge-ort.c b/merge-ort.c
index c4467e073b45..467404cc0a35 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -351,17 +351,11 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 
 	/* Free memory used by various renames maps */
 	for (i = MERGE_SIDE1; i <= MERGE_SIDE2; ++i) {
-		struct hashmap_iter iter;
-		struct strmap_entry *entry;
-
 		strset_func(&renames->dirs_removed[i]);
 
-		strmap_for_each_entry(&renames->dir_rename_count[i],
-				      &iter, entry) {
-			struct strintmap *counts = entry->value;
-			strintmap_clear(counts);
-		}
-		strmap_func(&renames->dir_rename_count[i], 1);
+		partial_clear_dir_rename_count(&renames->dir_rename_count[i]);
+		if (!reinitialize)
+			strmap_clear(&renames->dir_rename_count[i], 1);
 
 		strmap_func(&renames->dir_renames[i], 0);
 	}
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v2 03/10] diffcore-rename: move dir_rename_counts into a dir_rename_info struct
  2021-02-23 23:43 ` [PATCH v2 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
  2021-02-23 23:43   ` [PATCH v2 01/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
  2021-02-23 23:43   ` [PATCH v2 02/10] diffcore-rename: add functions for clearing dir_rename_count Elijah Newren via GitGitGadget
@ 2021-02-23 23:44   ` Elijah Newren via GitGitGadget
  2021-02-23 23:44   ` [PATCH v2 04/10] diffcore-rename: extend cleanup_dir_rename_info() Elijah Newren via GitGitGadget
                     ` (8 subsequent siblings)
  11 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-23 23:44 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

This is a purely cosmetic change for now, but we will be adding
additional information to the struct and changing where and how it is
setup and used in subsequent patches.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 39 ++++++++++++++++++++++++++-------------
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 614a8d63012d..7759c9a3a2ed 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -367,6 +367,11 @@ static int find_exact_renames(struct diff_options *options)
 	return renames;
 }
 
+struct dir_rename_info {
+	struct strmap *dir_rename_count;
+	unsigned setup;
+};
+
 static void dirname_munge(char *filename)
 {
 	char *slash = strrchr(filename, '/');
@@ -375,7 +380,7 @@ static void dirname_munge(char *filename)
 	*slash = '\0';
 }
 
-static void increment_count(struct strmap *dir_rename_count,
+static void increment_count(struct dir_rename_info *info,
 			    char *old_dir,
 			    char *new_dir)
 {
@@ -383,20 +388,20 @@ static void increment_count(struct strmap *dir_rename_count,
 	struct strmap_entry *e;
 
 	/* Get the {new_dirs -> counts} mapping using old_dir */
-	e = strmap_get_entry(dir_rename_count, old_dir);
+	e = strmap_get_entry(info->dir_rename_count, old_dir);
 	if (e) {
 		counts = e->value;
 	} else {
 		counts = xmalloc(sizeof(*counts));
 		strintmap_init_with_options(counts, 0, NULL, 1);
-		strmap_put(dir_rename_count, old_dir, counts);
+		strmap_put(info->dir_rename_count, old_dir, counts);
 	}
 
 	/* Increment the count for new_dir */
 	strintmap_incr(counts, new_dir, 1);
 }
 
-static void update_dir_rename_counts(struct strmap *dir_rename_count,
+static void update_dir_rename_counts(struct dir_rename_info *info,
 				     struct strset *dirs_removed,
 				     const char *oldname,
 				     const char *newname)
@@ -450,7 +455,7 @@ static void update_dir_rename_counts(struct strmap *dir_rename_count,
 		}
 
 		if (strset_contains(dirs_removed, old_dir))
-			increment_count(dir_rename_count, old_dir, new_dir);
+			increment_count(info, old_dir, new_dir);
 		else
 			break;
 
@@ -466,12 +471,15 @@ static void update_dir_rename_counts(struct strmap *dir_rename_count,
 	free(new_dir);
 }
 
-static void compute_dir_rename_counts(struct strmap *dir_rename_count,
-				      struct strset *dirs_removed)
+static void compute_dir_rename_counts(struct dir_rename_info *info,
+				      struct strset *dirs_removed,
+				      struct strmap *dir_rename_count)
 {
 	int i;
 
-	/* Set up dir_rename_count */
+	info->setup = 1;
+	info->dir_rename_count = dir_rename_count;
+
 	for (i = 0; i < rename_dst_nr; ++i) {
 		/*
 		 * Make dir_rename_count contain a map of a map:
@@ -480,7 +488,7 @@ static void compute_dir_rename_counts(struct strmap *dir_rename_count,
 		 * the old filename and the new filename and count how many
 		 * times that pairing occurs.
 		 */
-		update_dir_rename_counts(dir_rename_count, dirs_removed,
+		update_dir_rename_counts(info, dirs_removed,
 					 rename_dst[i].p->one->path,
 					 rename_dst[i].p->two->path);
 	}
@@ -499,10 +507,13 @@ void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
 }
 
 MAYBE_UNUSED
-static void clear_dir_rename_count(struct strmap *dir_rename_count)
+static void cleanup_dir_rename_info(struct dir_rename_info *info)
 {
-	partial_clear_dir_rename_count(dir_rename_count);
-	strmap_clear(dir_rename_count, 1);
+	if (!info->setup)
+		return;
+
+	partial_clear_dir_rename_count(info->dir_rename_count);
+	strmap_clear(info->dir_rename_count, 1);
 }
 
 static const char *get_basename(const char *filename)
@@ -791,8 +802,10 @@ void diffcore_rename_extended(struct diff_options *options,
 	int num_destinations, dst_cnt;
 	int num_sources, want_copies;
 	struct progress *progress = NULL;
+	struct dir_rename_info info;
 
 	trace2_region_enter("diff", "setup", options->repo);
+	info.setup = 0;
 	assert(!dir_rename_count || strmap_empty(dir_rename_count));
 	want_copies = (detect_rename == DIFF_DETECT_COPY);
 	if (!minimum_score)
@@ -985,7 +998,7 @@ void diffcore_rename_extended(struct diff_options *options,
 	/*
 	 * Now that renames have been computed, compute dir_rename_count */
 	if (dirs_removed && dir_rename_count)
-		compute_dir_rename_counts(dir_rename_count, dirs_removed);
+		compute_dir_rename_counts(&info, dirs_removed, dir_rename_count);
 
 	/* At this point, we have found some renames and copies and they
 	 * are recorded in rename_dst.  The original list is still in *q.
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v2 04/10] diffcore-rename: extend cleanup_dir_rename_info()
  2021-02-23 23:43 ` [PATCH v2 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
                     ` (2 preceding siblings ...)
  2021-02-23 23:44   ` [PATCH v2 03/10] diffcore-rename: move dir_rename_counts into a dir_rename_info struct Elijah Newren via GitGitGadget
@ 2021-02-23 23:44   ` Elijah Newren via GitGitGadget
  2021-02-24 15:37     ` Derrick Stolee
  2021-02-25  2:16     ` Ævar Arnfjörð Bjarmason
  2021-02-23 23:44   ` [PATCH v2 05/10] diffcore-rename: compute dir_rename_counts in stages Elijah Newren via GitGitGadget
                     ` (7 subsequent siblings)
  11 siblings, 2 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-23 23:44 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

When diffcore_rename_extended() is passed a NULL dir_rename_count, we
will still want to create a temporary one for use by
find_basename_matches(), but have it fully deallocated before
diffcore_rename_extended() returns.  However, when
diffcore_rename_extended() is passed a dir_rename_count, we want to fill
that strmap with appropriate values and return it.  However, for our
interim purposes we may also add entries corresponding to directories
that cannot have been renamed due to still existing on both sides.

Extend cleanup_dir_rename_info() to handle these two different cases,
cleaning up the relevant bits of information for each case.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 38 +++++++++++++++++++++++++++++++++++---
 1 file changed, 35 insertions(+), 3 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 7759c9a3a2ed..aa21d4e7175c 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -507,13 +507,45 @@ void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
 }
 
 MAYBE_UNUSED
-static void cleanup_dir_rename_info(struct dir_rename_info *info)
+static void cleanup_dir_rename_info(struct dir_rename_info *info,
+				    struct strset *dirs_removed,
+				    int keep_dir_rename_count)
 {
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
+
 	if (!info->setup)
 		return;
 
-	partial_clear_dir_rename_count(info->dir_rename_count);
-	strmap_clear(info->dir_rename_count, 1);
+	if (!keep_dir_rename_count) {
+		partial_clear_dir_rename_count(info->dir_rename_count);
+		strmap_clear(info->dir_rename_count, 1);
+		FREE_AND_NULL(info->dir_rename_count);
+	} else {
+		/*
+		 * Although dir_rename_count was passed in
+		 * diffcore_rename_extended() and we want to keep it around and
+		 * return it to that caller, we first want to remove any data
+		 * associated with directories that weren't renamed.
+		 */
+		struct string_list to_remove = STRING_LIST_INIT_NODUP;
+		int i;
+
+		strmap_for_each_entry(info->dir_rename_count, &iter, entry) {
+			const char *source_dir = entry->key;
+			struct strintmap *counts = entry->value;
+
+			if (!strset_contains(dirs_removed, source_dir)) {
+				string_list_append(&to_remove, source_dir);
+				strintmap_clear(counts);
+				continue;
+			}
+		}
+		for (i=0; i<to_remove.nr; ++i)
+			strmap_remove(info->dir_rename_count,
+				      to_remove.items[i].string, 1);
+		string_list_clear(&to_remove, 0);
+	}
 }
 
 static const char *get_basename(const char *filename)
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v2 05/10] diffcore-rename: compute dir_rename_counts in stages
  2021-02-23 23:43 ` [PATCH v2 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
                     ` (3 preceding siblings ...)
  2021-02-23 23:44   ` [PATCH v2 04/10] diffcore-rename: extend cleanup_dir_rename_info() Elijah Newren via GitGitGadget
@ 2021-02-23 23:44   ` Elijah Newren via GitGitGadget
  2021-02-24 15:43     ` Derrick Stolee
  2021-02-23 23:44   ` [PATCH v2 06/10] diffcore-rename: add a mapping of destination names to their indices Elijah Newren via GitGitGadget
                     ` (6 subsequent siblings)
  11 siblings, 1 reply; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-23 23:44 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

We want to first compute dir_rename_counts based just on exact renames
to start, as that can provide us useful information in
find_basename_matches().  That will give us an incomplete result, which
we can then later augment as basename and inexact rename matches are
found.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 76 ++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 62 insertions(+), 14 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index aa21d4e7175c..489e9cb0871e 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -411,6 +411,28 @@ static void update_dir_rename_counts(struct dir_rename_info *info,
 	char new_dir_first_char = new_dir[0];
 	int first_time_in_loop = 1;
 
+	if (!info->setup)
+		/*
+		 * info->setup is 0 here in two cases: (1) all auxiliary
+		 * vars (like dirs_removed) were NULL so
+		 * initialize_dir_rename_info() returned early, or (2)
+		 * either break detection or copy detection are active so
+		 * that we never called initialize_dir_rename_info().  In
+		 * the former case, we don't have enough info to know if
+		 * directories were renamed (because dirs_removed lets us
+		 * know about a necessary prerequisite, namely if they were
+		 * removed), and in the latter, we don't care about
+		 * directory renames or find_basename_matches.
+		 *
+		 * This matters because both basename and inexact matching
+		 * will also call update_dir_rename_counts().  In either of
+		 * the above two cases info->dir_rename_counts will not
+		 * have been properly initialized which prevents us from
+		 * updating it, but in these two cases we don't care about
+		 * dir_rename_counts anyway, so we can just exit early.
+		 */
+		return;
+
 	while (1) {
 		dirname_munge(old_dir);
 		dirname_munge(new_dir);
@@ -471,14 +493,22 @@ static void update_dir_rename_counts(struct dir_rename_info *info,
 	free(new_dir);
 }
 
-static void compute_dir_rename_counts(struct dir_rename_info *info,
-				      struct strset *dirs_removed,
-				      struct strmap *dir_rename_count)
+static void initialize_dir_rename_info(struct dir_rename_info *info,
+				       struct strset *dirs_removed,
+				       struct strmap *dir_rename_count)
 {
 	int i;
 
+	info->setup = 0;
+	if (!dirs_removed)
+		return;
 	info->setup = 1;
+
 	info->dir_rename_count = dir_rename_count;
+	if (!info->dir_rename_count) {
+		info->dir_rename_count = xmalloc(sizeof(*dir_rename_count));
+		strmap_init(info->dir_rename_count);
+	}
 
 	for (i = 0; i < rename_dst_nr; ++i) {
 		/*
@@ -506,7 +536,6 @@ void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
 	strmap_partial_clear(dir_rename_count, 1);
 }
 
-MAYBE_UNUSED
 static void cleanup_dir_rename_info(struct dir_rename_info *info,
 				    struct strset *dirs_removed,
 				    int keep_dir_rename_count)
@@ -561,7 +590,9 @@ static const char *get_basename(const char *filename)
 }
 
 static int find_basename_matches(struct diff_options *options,
-				 int minimum_score)
+				 int minimum_score,
+				 struct dir_rename_info *info,
+				 struct strset *dirs_removed)
 {
 	/*
 	 * When I checked in early 2020, over 76% of file renames in linux
@@ -669,6 +700,8 @@ static int find_basename_matches(struct diff_options *options,
 				continue;
 			record_rename_pair(dst_index, src_index, score);
 			renames++;
+			update_dir_rename_counts(info, dirs_removed,
+						 one->path, two->path);
 
 			/*
 			 * Found a rename so don't need text anymore; if we
@@ -752,7 +785,12 @@ static int too_many_rename_candidates(int num_destinations, int num_sources,
 	return 1;
 }
 
-static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, int copies)
+static int find_renames(struct diff_score *mx,
+			int dst_cnt,
+			int minimum_score,
+			int copies,
+			struct dir_rename_info *info,
+			struct strset *dirs_removed)
 {
 	int count = 0, i;
 
@@ -769,6 +807,9 @@ static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, i
 			continue;
 		record_rename_pair(mx[i].dst, mx[i].src, mx[i].score);
 		count++;
+		update_dir_rename_counts(info, dirs_removed,
+					 rename_src[mx[i].src].p->one->path,
+					 rename_dst[mx[i].dst].p->two->path);
 	}
 	return count;
 }
@@ -840,6 +881,8 @@ void diffcore_rename_extended(struct diff_options *options,
 	info.setup = 0;
 	assert(!dir_rename_count || strmap_empty(dir_rename_count));
 	want_copies = (detect_rename == DIFF_DETECT_COPY);
+	if (dirs_removed && (break_idx || want_copies))
+		BUG("dirs_removed incompatible with break/copy detection");
 	if (!minimum_score)
 		minimum_score = DEFAULT_RENAME_SCORE;
 
@@ -931,10 +974,17 @@ void diffcore_rename_extended(struct diff_options *options,
 		remove_unneeded_paths_from_src(want_copies);
 		trace2_region_leave("diff", "cull after exact", options->repo);
 
+		/* Preparation for basename-driven matching. */
+		trace2_region_enter("diff", "dir rename setup", options->repo);
+		initialize_dir_rename_info(&info,
+					   dirs_removed, dir_rename_count);
+		trace2_region_leave("diff", "dir rename setup", options->repo);
+
 		/* Utilize file basenames to quickly find renames. */
 		trace2_region_enter("diff", "basename matches", options->repo);
 		rename_count += find_basename_matches(options,
-						      min_basename_score);
+						      min_basename_score,
+						      &info, dirs_removed);
 		trace2_region_leave("diff", "basename matches", options->repo);
 
 		/*
@@ -1020,18 +1070,15 @@ void diffcore_rename_extended(struct diff_options *options,
 	/* cost matrix sorted by most to least similar pair */
 	STABLE_QSORT(mx, dst_cnt * NUM_CANDIDATE_PER_DST, score_compare);
 
-	rename_count += find_renames(mx, dst_cnt, minimum_score, 0);
+	rename_count += find_renames(mx, dst_cnt, minimum_score, 0,
+				     &info, dirs_removed);
 	if (want_copies)
-		rename_count += find_renames(mx, dst_cnt, minimum_score, 1);
+		rename_count += find_renames(mx, dst_cnt, minimum_score, 1,
+					     &info, dirs_removed);
 	free(mx);
 	trace2_region_leave("diff", "inexact renames", options->repo);
 
  cleanup:
-	/*
-	 * Now that renames have been computed, compute dir_rename_count */
-	if (dirs_removed && dir_rename_count)
-		compute_dir_rename_counts(&info, dirs_removed, dir_rename_count);
-
 	/* At this point, we have found some renames and copies and they
 	 * are recorded in rename_dst.  The original list is still in *q.
 	 */
@@ -1103,6 +1150,7 @@ void diffcore_rename_extended(struct diff_options *options,
 		if (rename_dst[i].filespec_to_free)
 			free_filespec(rename_dst[i].filespec_to_free);
 
+	cleanup_dir_rename_info(&info, dirs_removed, dir_rename_count != NULL);
 	FREE_AND_NULL(rename_dst);
 	rename_dst_nr = rename_dst_alloc = 0;
 	FREE_AND_NULL(rename_src);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v2 06/10] diffcore-rename: add a mapping of destination names to their indices
  2021-02-23 23:43 ` [PATCH v2 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
                     ` (4 preceding siblings ...)
  2021-02-23 23:44   ` [PATCH v2 05/10] diffcore-rename: compute dir_rename_counts in stages Elijah Newren via GitGitGadget
@ 2021-02-23 23:44   ` Elijah Newren via GitGitGadget
  2021-02-23 23:44   ` [PATCH v2 07/10] diffcore-rename: add a dir_rename_guess field to dir_rename_info Elijah Newren via GitGitGadget
                     ` (5 subsequent siblings)
  11 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-23 23:44 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Add an idx_map member to struct dir_rename_info, which tracks a mapping
of the full filename to the index within rename_dst where that filename
is found.  We will later use this for quickly finding an array entry in
rename_dst given the pathname.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 489e9cb0871e..db569e4a0b0a 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -368,6 +368,7 @@ static int find_exact_renames(struct diff_options *options)
 }
 
 struct dir_rename_info {
+	struct strintmap idx_map;
 	struct strmap *dir_rename_count;
 	unsigned setup;
 };
@@ -509,10 +510,26 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 		info->dir_rename_count = xmalloc(sizeof(*dir_rename_count));
 		strmap_init(info->dir_rename_count);
 	}
+	strintmap_init_with_options(&info->idx_map, -1, NULL, 0);
 
+	/*
+	 * Loop setting up both info->idx_map, and doing setup of
+	 * info->dir_rename_count.
+	 */
 	for (i = 0; i < rename_dst_nr; ++i) {
 		/*
-		 * Make dir_rename_count contain a map of a map:
+		 * For non-renamed files, make idx_map contain mapping of
+		 *   filename -> index (index within rename_dst, that is)
+		 */
+		if (!rename_dst[i].is_rename) {
+			char *filename = rename_dst[i].p->two->path;
+			strintmap_set(&info->idx_map, filename, i);
+			continue;
+		}
+
+		/*
+		 * For everything else (i.e. renamed files), make
+		 * dir_rename_count contain a map of a map:
 		 *   old_directory -> {new_directory -> count}
 		 * In other words, for every pair look at the directories for
 		 * the old filename and the new filename and count how many
@@ -546,6 +563,9 @@ static void cleanup_dir_rename_info(struct dir_rename_info *info,
 	if (!info->setup)
 		return;
 
+	/* idx_map */
+	strintmap_clear(&info->idx_map);
+
 	if (!keep_dir_rename_count) {
 		partial_clear_dir_rename_count(info->dir_rename_count);
 		strmap_clear(info->dir_rename_count, 1);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v2 07/10] diffcore-rename: add a dir_rename_guess field to dir_rename_info
  2021-02-23 23:43 ` [PATCH v2 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
                     ` (5 preceding siblings ...)
  2021-02-23 23:44   ` [PATCH v2 06/10] diffcore-rename: add a mapping of destination names to their indices Elijah Newren via GitGitGadget
@ 2021-02-23 23:44   ` Elijah Newren via GitGitGadget
  2021-02-23 23:44   ` [PATCH v2 08/10] diffcore-rename: add a new idx_possible_rename function Elijah Newren via GitGitGadget
                     ` (4 subsequent siblings)
  11 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-23 23:44 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

dir_rename_counts has a mapping of a mapping, in particular, it has
   old_dir => { new_dir => count }
We want a simple mapping of
   old_dir => new_dir
based on which new_dir had the highest count for a given old_dir.
Introduce dir_rename_guess for this purpose.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index db569e4a0b0a..d24f104aa81c 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -369,6 +369,7 @@ static int find_exact_renames(struct diff_options *options)
 
 struct dir_rename_info {
 	struct strintmap idx_map;
+	struct strmap dir_rename_guess;
 	struct strmap *dir_rename_count;
 	unsigned setup;
 };
@@ -381,6 +382,24 @@ static void dirname_munge(char *filename)
 	*slash = '\0';
 }
 
+static const char *get_highest_rename_path(struct strintmap *counts)
+{
+	int highest_count = 0;
+	const char *highest_destination_dir = NULL;
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
+
+	strintmap_for_each_entry(counts, &iter, entry) {
+		const char *destination_dir = entry->key;
+		intptr_t count = (intptr_t)entry->value;
+		if (count > highest_count) {
+			highest_count = count;
+			highest_destination_dir = destination_dir;
+		}
+	}
+	return highest_destination_dir;
+}
+
 static void increment_count(struct dir_rename_info *info,
 			    char *old_dir,
 			    char *new_dir)
@@ -498,6 +517,8 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 				       struct strset *dirs_removed,
 				       struct strmap *dir_rename_count)
 {
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
 	int i;
 
 	info->setup = 0;
@@ -511,6 +532,7 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 		strmap_init(info->dir_rename_count);
 	}
 	strintmap_init_with_options(&info->idx_map, -1, NULL, 0);
+	strmap_init_with_options(&info->dir_rename_guess, NULL, 0);
 
 	/*
 	 * Loop setting up both info->idx_map, and doing setup of
@@ -539,6 +561,23 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 					 rename_dst[i].p->one->path,
 					 rename_dst[i].p->two->path);
 	}
+
+	/*
+	 * Now we collapse
+	 *    dir_rename_count: old_directory -> {new_directory -> count}
+	 * down to
+	 *    dir_rename_guess: old_directory -> best_new_directory
+	 * where best_new_directory is the one with the highest count.
+	 */
+	strmap_for_each_entry(info->dir_rename_count, &iter, entry) {
+		/* entry->key is source_dir */
+		struct strintmap *counts = entry->value;
+		char *best_newdir;
+
+		best_newdir = xstrdup(get_highest_rename_path(counts));
+		strmap_put(&info->dir_rename_guess, entry->key,
+			   best_newdir);
+	}
 }
 
 void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
@@ -566,6 +605,9 @@ static void cleanup_dir_rename_info(struct dir_rename_info *info,
 	/* idx_map */
 	strintmap_clear(&info->idx_map);
 
+	/* dir_rename_guess */
+	strmap_clear(&info->dir_rename_guess, 1);
+
 	if (!keep_dir_rename_count) {
 		partial_clear_dir_rename_count(info->dir_rename_count);
 		strmap_clear(info->dir_rename_count, 1);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v2 08/10] diffcore-rename: add a new idx_possible_rename function
  2021-02-23 23:43 ` [PATCH v2 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
                     ` (6 preceding siblings ...)
  2021-02-23 23:44   ` [PATCH v2 07/10] diffcore-rename: add a dir_rename_guess field to dir_rename_info Elijah Newren via GitGitGadget
@ 2021-02-23 23:44   ` Elijah Newren via GitGitGadget
  2021-02-24 17:35     ` Derrick Stolee
  2021-02-23 23:44   ` [PATCH v2 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs Elijah Newren via GitGitGadget
                     ` (3 subsequent siblings)
  11 siblings, 1 reply; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-23 23:44 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

find_basename_matches() is great when both the remaining set of possible
rename sources and the remaining set of possible rename destinations
have exactly one file each with a given basename.  It allows us to match
up files that have been moved to different directories without changing
filenames.

When basenames are not unique, though, we want to be able to guess which
directories the source files have been moved to.  Since this is the job
of directory rename detection, we employ it.  However, since it is a
directory rename detection idea, we also limit it to cases where we know
there could have been a directory rename, i.e. where the source
directory has been removed.  This has to be signalled by dirs_removed
being non-NULL and containing an entry for the relevant directory.
Since merge-ort.c is the only caller that currently does so, this
optimization is only effective for merge-ort right now.  In the future,
this condition could be reconsidered or we could modify other callers to
pass the necessary strset.

Anyway, that's a lot of background so that we can actually describe the
new function.  Add an idx_possible_rename() function which combines the
recently added dir_rename_guess and idx_map fields to provide the index
within rename_dst of a potential match for a given file.

Future commits will add checks after calling this function to compare
the resulting 'likely rename' candidates to see if the two files meet
the elevated min_basename_score threshold for marking them as actual
renames.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index d24f104aa81c..1e4a56adde2c 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -374,6 +374,12 @@ struct dir_rename_info {
 	unsigned setup;
 };
 
+static char *get_dirname(const char *filename)
+{
+	char *slash = strrchr(filename, '/');
+	return slash ? xstrndup(filename, slash-filename) : xstrdup("");
+}
+
 static void dirname_munge(char *filename)
 {
 	char *slash = strrchr(filename, '/');
@@ -651,6 +657,81 @@ static const char *get_basename(const char *filename)
 	return base ? base + 1 : filename;
 }
 
+MAYBE_UNUSED
+static int idx_possible_rename(char *filename, struct dir_rename_info *info)
+{
+	/*
+	 * Our comparison of files with the same basename (see
+	 * find_basename_matches() below), is only helpful when after exact
+	 * rename detection we have exactly one file with a given basename
+	 * among the rename sources and also only exactly one file with
+	 * that basename among the rename destinations.  When we have
+	 * multiple files with the same basename in either set, we do not
+	 * know which to compare against.  However, there are some
+	 * filenames that occur in large numbers (particularly
+	 * build-related filenames such as 'Makefile', '.gitignore', or
+	 * 'build.gradle' that potentially exist within every single
+	 * subdirectory), and for performance we want to be able to quickly
+	 * find renames for these files too.
+	 *
+	 * The reason basename comparisons are a useful heuristic was that it
+	 * is common for people to move files across directories while keeping
+	 * their filename the same.  If we had a way of determining or even
+	 * making a good educated guess about which directory these non-unique
+	 * basename files had moved the file to, we could check it.
+	 * Luckily...
+	 *
+	 * When an entire directory is in fact renamed, we have two factors
+	 * helping us out:
+	 *   (a) the original directory disappeared giving us a hint
+	 *       about when we can apply an extra heuristic.
+	 *   (a) we often have several files within that directory and
+	 *       subdirectories that are renamed without changes
+	 * So, rules for a heuristic:
+	 *   (0) If there basename matches are non-unique (the condition under
+	 *       which this function is called) AND
+	 *   (1) the directory in which the file was found has disappeared
+	 *       (i.e. dirs_removed is non-NULL and has a relevant entry) THEN
+	 *   (2) use exact renames of files within the directory to determine
+	 *       where the directory is likely to have been renamed to.  IF
+	 *       there is at least one exact rename from within that
+	 *       directory, we can proceed.
+	 *   (3) If there are multiple places the directory could have been
+	 *       renamed to based on exact renames, ignore all but one of them.
+	 *       Just use the destination with the most renames going to it.
+	 *   (4) Check if applying that directory rename to the original file
+	 *       would result in a destination filename that is in the
+	 *       potential rename set.  If so, return the index of the
+	 *       destination file (the index within rename_dst).
+	 *   (5) Compare the original file and returned destination for
+	 *       similarity, and if they are sufficiently similar, record the
+	 *       rename.
+	 *
+	 * This function, idx_possible_rename(), is only responsible for (4).
+	 * The conditions/steps in (1)-(3) are handled via setting up
+	 * dir_rename_count and dir_rename_guess in
+	 * initialize_dir_rename_info().  Steps (0) and (5) are handled by
+	 * the caller of this function.
+	 */
+	char *old_dir, *new_dir, *new_path;
+	int idx;
+
+	if (!info->setup)
+		return -1;
+
+	old_dir = get_dirname(filename);
+	new_dir = strmap_get(&info->dir_rename_guess, old_dir);
+	free(old_dir);
+	if (!new_dir)
+		return -1;
+
+	new_path = xstrfmt("%s/%s", new_dir, get_basename(filename));
+
+	idx = strintmap_get(&info->idx_map, new_path);
+	free(new_path);
+	return idx;
+}
+
 static int find_basename_matches(struct diff_options *options,
 				 int minimum_score,
 				 struct dir_rename_info *info,
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v2 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs
  2021-02-23 23:43 ` [PATCH v2 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
                     ` (7 preceding siblings ...)
  2021-02-23 23:44   ` [PATCH v2 08/10] diffcore-rename: add a new idx_possible_rename function Elijah Newren via GitGitGadget
@ 2021-02-23 23:44   ` Elijah Newren via GitGitGadget
  2021-02-23 23:44   ` [PATCH v2 10/10] diffcore-rename: use directory rename guided basename comparisons Elijah Newren via GitGitGadget
                     ` (2 subsequent siblings)
  11 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-23 23:44 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

We are using dir_rename_counts to count the number of other directories
that files within a directory moved to.  We only need this information
for directories that disappeared, though, so we can return early from
update_dir_rename_counts() for other paths.

While dirs_removed provides the relevant information for us right now,
we introduce a new info->relevant_source_dirs parameter because future
optimizations will want to change how things are called somewhat.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 1e4a56adde2c..5de4497e04fa 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -371,6 +371,7 @@ struct dir_rename_info {
 	struct strintmap idx_map;
 	struct strmap dir_rename_guess;
 	struct strmap *dir_rename_count;
+	struct strset *relevant_source_dirs;
 	unsigned setup;
 };
 
@@ -460,7 +461,13 @@ static void update_dir_rename_counts(struct dir_rename_info *info,
 		return;
 
 	while (1) {
+		/* Get old_dir, skip if its directory isn't relevant. */
 		dirname_munge(old_dir);
+		if (info->relevant_source_dirs &&
+		    !strset_contains(info->relevant_source_dirs, old_dir))
+			break;
+
+		/* Get new_dir */
 		dirname_munge(new_dir);
 
 		/*
@@ -540,6 +547,9 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 	strintmap_init_with_options(&info->idx_map, -1, NULL, 0);
 	strmap_init_with_options(&info->dir_rename_guess, NULL, 0);
 
+	/* Setup info->relevant_source_dirs */
+	info->relevant_source_dirs = dirs_removed;
+
 	/*
 	 * Loop setting up both info->idx_map, and doing setup of
 	 * info->dir_rename_count.
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v2 10/10] diffcore-rename: use directory rename guided basename comparisons
  2021-02-23 23:43 ` [PATCH v2 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
                     ` (8 preceding siblings ...)
  2021-02-23 23:44   ` [PATCH v2 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs Elijah Newren via GitGitGadget
@ 2021-02-23 23:44   ` Elijah Newren via GitGitGadget
  2021-02-24 17:44     ` Derrick Stolee
  2021-02-24 17:50   ` [PATCH v2 00/10] Optimization batch 8: use file basenames even more Derrick Stolee
  2021-02-26  1:58   ` [PATCH v3 " Elijah Newren via GitGitGadget
  11 siblings, 1 reply; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-23 23:44 UTC (permalink / raw)
  To: git; +Cc: Elijah Newren, Elijah Newren

From: Elijah Newren <newren@gmail.com>

Hook the work from the last several patches together so that when
basenames in the sets of possible remaining rename sources or
destinations aren't unique, we can guess which directory source files
were renamed into.  When that guess gives us a pairing of files, and
those files are sufficiently similar, we record the two files as a
rename and remove them from the large matrix of comparisons for inexact
rename detection.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       12.775 s ±  0.062 s    12.596 s ±  0.061 s
    mega-renames:    188.754 s ±  0.284 s   130.465 s ±  0.259 s
    just-one-mega:     5.599 s ±  0.019 s     3.958 s ±  0.010 s

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/gitdiffcore.txt |  2 +-
 diffcore-rename.c             | 32 +++++++++++++++++++++++---------
 2 files changed, 24 insertions(+), 10 deletions(-)

diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
index 80fcf9542441..8673a5c5b2f2 100644
--- a/Documentation/gitdiffcore.txt
+++ b/Documentation/gitdiffcore.txt
@@ -186,7 +186,7 @@ mark a file pair as a rename and stop considering other candidates for
 better matches.  At most, one comparison is done per file in this
 preliminary pass; so if there are several remaining ext.txt files
 throughout the directory hierarchy after exact rename detection, this
-preliminary step will be skipped for those files.
+preliminary step may be skipped for those files.
 
 Note.  When the "-C" option is used with `--find-copies-harder`
 option, 'git diff-{asterisk}' commands feed unmodified filepairs to
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 5de4497e04fa..70a484b9b63e 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -667,7 +667,6 @@ static const char *get_basename(const char *filename)
 	return base ? base + 1 : filename;
 }
 
-MAYBE_UNUSED
 static int idx_possible_rename(char *filename, struct dir_rename_info *info)
 {
 	/*
@@ -780,8 +779,6 @@ static int find_basename_matches(struct diff_options *options,
 	int i, renames = 0;
 	struct strintmap sources;
 	struct strintmap dests;
-	struct hashmap_iter iter;
-	struct strmap_entry *entry;
 
 	/*
 	 * The prefeteching stuff wants to know if it can skip prefetching
@@ -831,17 +828,34 @@ static int find_basename_matches(struct diff_options *options,
 	}
 
 	/* Now look for basename matchups and do similarity estimation */
-	strintmap_for_each_entry(&sources, &iter, entry) {
-		const char *base = entry->key;
-		intptr_t src_index = (intptr_t)entry->value;
+	for (i = 0; i < rename_src_nr; ++i) {
+		char *filename = rename_src[i].p->one->path;
+		const char *base = NULL;
+		intptr_t src_index;
 		intptr_t dst_index;
-		if (src_index == -1)
-			continue;
 
-		if (0 <= (dst_index = strintmap_get(&dests, base))) {
+		/* Is this basename unique among remaining sources? */
+		base = get_basename(filename);
+		src_index = strintmap_get(&sources, base);
+		assert(src_index == -1 || src_index == i);
+
+		if (strintmap_contains(&dests, base)) {
 			struct diff_filespec *one, *two;
 			int score;
 
+			/* Find a matching destination, if possible */
+			dst_index = strintmap_get(&dests, base);
+			if (src_index == -1 || dst_index == -1) {
+				src_index = i;
+				dst_index = idx_possible_rename(filename, info);
+			}
+			if (dst_index == -1)
+				continue;
+
+			/* Ignore this dest if already used in a rename */
+			if (rename_dst[dst_index].is_rename)
+				continue; /* already used previously */
+
 			/* Estimate the similarity */
 			one = rename_src[src_index].p->one;
 			two = rename_dst[dst_index].p->two;
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 01/10] Move computation of dir_rename_count from merge-ort to diffcore-rename
  2021-02-23 23:43   ` [PATCH v2 01/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
@ 2021-02-24 15:25     ` Derrick Stolee
  2021-02-24 18:50       ` Elijah Newren
  0 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee @ 2021-02-24 15:25 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git; +Cc: Elijah Newren

On 2/23/2021 6:43 PM, Elijah Newren via GitGitGadget wrote:
> ... While the
> diffstat looks large, viewing this commit with --color-moved makes it
> clear that only about 20 lines changed.  With this patch, the
> computation of dir_rename_count is still only done after inexact rename
> detection, but subsequent commits will add a preliminary computation of
> dir_rename_count after exact rename detection, followed by some updates
> after inexact rename detection.

The --color-moved recommendation is a good one. Everything seems to be
pretty standard, except this function:

> +static void compute_dir_rename_counts(struct strmap *dir_rename_count,
> +				      struct strset *dirs_removed)
> +{
> +	int i;
> +
> +	/* Set up dir_rename_count */
> +	for (i = 0; i < rename_dst_nr; ++i) {
> +		/*
> +		 * Make dir_rename_count contain a map of a map:
> +		 *   old_directory -> {new_directory -> count}
> +		 * In other words, for every pair look at the directories for
> +		 * the old filename and the new filename and count how many
> +		 * times that pairing occurs.
> +		 */
> +		update_dir_rename_counts(dir_rename_count, dirs_removed,
> +					 rename_dst[i].p->one->path,
> +					 rename_dst[i].p->two->path);
> +	}
> +}

is very similar to this:

> -static void compute_rename_counts(struct diff_queue_struct *pairs,
> -				  struct strmap *dir_rename_count,
> -				  struct strset *dirs_removed)
> -{
> -	int i;
> -
> -	for (i = 0; i < pairs->nr; ++i) {
> -		struct diff_filepair *pair = pairs->queue[i];
> -
> -		/* File not part of directory rename if it wasn't renamed */
> -		if (pair->status != 'R')
> -			continue;
> -
> -		/*
> -		 * Make dir_rename_count contain a map of a map:
> -		 *   old_directory -> {new_directory -> count}
> -		 * In other words, for every pair look at the directories for
> -		 * the old filename and the new filename and count how many
> -		 * times that pairing occurs.
> -		 */
> -		update_dir_rename_counts(dir_rename_count, dirs_removed,
> -					 pair->one->path,
> -					 pair->two->path);
> -	}
> -}
> -

but we dropped that "File not part of directory rename" check.

It seems that is no longer possible to use with the new data structure,
but I wonder if this will cause a slowdown in the directory renames when
merging? Or, has the data already been filtered before calling
compute_dir_rename_counts()?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 04/10] diffcore-rename: extend cleanup_dir_rename_info()
  2021-02-23 23:44   ` [PATCH v2 04/10] diffcore-rename: extend cleanup_dir_rename_info() Elijah Newren via GitGitGadget
@ 2021-02-24 15:37     ` Derrick Stolee
  2021-02-25  2:16     ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 61+ messages in thread
From: Derrick Stolee @ 2021-02-24 15:37 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git; +Cc: Elijah Newren

On 2/23/2021 6:44 PM, Elijah Newren via GitGitGadget wrote:
> +		for (i=0; i<to_remove.nr; ++i)

nit: "i < to_remove.nr"

I'm really stretching to find something valuable to say about
these well-constructed patches.

-Stolee

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 05/10] diffcore-rename: compute dir_rename_counts in stages
  2021-02-23 23:44   ` [PATCH v2 05/10] diffcore-rename: compute dir_rename_counts in stages Elijah Newren via GitGitGadget
@ 2021-02-24 15:43     ` Derrick Stolee
  0 siblings, 0 replies; 61+ messages in thread
From: Derrick Stolee @ 2021-02-24 15:43 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git; +Cc: Elijah Newren

On 2/23/2021 6:44 PM, Elijah Newren via GitGitGadget wrote:
> +	info->setup = 0;
> +	if (!dirs_removed)
> +		return;
>  	info->setup = 1;

This would probably be clearer as

	if (!dirs_removed) {
		info->setup = 0;
		return;
	}
	info->setup = 1;

> -MAYBE_UNUSED

Good cleanup.

> @@ -931,10 +974,17 @@ void diffcore_rename_extended(struct diff_options *options,
>  		remove_unneeded_paths_from_src(want_copies);
>  		trace2_region_leave("diff", "cull after exact", options->repo);
>  
> +		/* Preparation for basename-driven matching. */
> +		trace2_region_enter("diff", "dir rename setup", options->repo);
> +		initialize_dir_rename_info(&info,
> +					   dirs_removed, dir_rename_count);
> +		trace2_region_leave("diff", "dir rename setup", options->repo);
> +

The parts visible in this context are pretty trivial, but this
method _is_ doing a lot of work. Good to mark it with a trace
region so we can identify if/when it is a problem.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 08/10] diffcore-rename: add a new idx_possible_rename function
  2021-02-23 23:44   ` [PATCH v2 08/10] diffcore-rename: add a new idx_possible_rename function Elijah Newren via GitGitGadget
@ 2021-02-24 17:35     ` Derrick Stolee
  2021-02-25  1:13       ` Elijah Newren
  0 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee @ 2021-02-24 17:35 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git; +Cc: Elijah Newren

On 2/23/2021 6:44 PM, Elijah Newren via GitGitGadget wrote:> +static char *get_dirname(const char *filename)
> +{
> +	char *slash = strrchr(filename, '/');
> +	return slash ? xstrndup(filename, slash-filename) : xstrdup("");

My brain interpreted "slash-filename" as a single token on first
read, which confused me briefly. Inserting spaces would help
readers like me.

> +	 *   (4) Check if applying that directory rename to the original file
> +	 *       would result in a destination filename that is in the
> +	 *       potential rename set.  If so, return the index of the
> +	 *       destination file (the index within rename_dst).

> +	 * This function, idx_possible_rename(), is only responsible for (4).

This helps isolate the important step to care about for the implementation,
while the rest of the context is important, too.

> +	char *old_dir, *new_dir, *new_path;
> +	int idx;
> +
> +	if (!info->setup)
> +		return -1;
> +
> +	old_dir = get_dirname(filename);
> +	new_dir = strmap_get(&info->dir_rename_guess, old_dir);
> +	free(old_dir);
> +	if (!new_dir)
> +		return -1;
> +
> +	new_path = xstrfmt("%s/%s", new_dir, get_basename(filename));

This is running in a loop, so `xstrfmt()` might be overkill compared
to something like

	strbuf_addstr(&new_path, new_dir);
	strbuf_addch(&new_path, '/');
	strbuf_addstr(&new_path, get_basename(filename));

but maybe the difference is too small to notice. (notice the type
change to "struct strbuf new_path = STRBUF_INIT;")

> +
> +	idx = strintmap_get(&info->idx_map, new_path);
> +	free(new_path);
> +	return idx;
> +}

Does what it says it does.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 10/10] diffcore-rename: use directory rename guided basename comparisons
  2021-02-23 23:44   ` [PATCH v2 10/10] diffcore-rename: use directory rename guided basename comparisons Elijah Newren via GitGitGadget
@ 2021-02-24 17:44     ` Derrick Stolee
  0 siblings, 0 replies; 61+ messages in thread
From: Derrick Stolee @ 2021-02-24 17:44 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git; +Cc: Elijah Newren

On 2/23/2021 6:44 PM, Elijah Newren via GitGitGadget wrote:
> For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
> performance work; instrument with trace2_region_* calls", 2020-10-28),
> this change improves the performance as follows:
> 
>                             Before                  After
>     no-renames:       12.775 s ±  0.062 s    12.596 s ±  0.061 s
>     mega-renames:    188.754 s ±  0.284 s   130.465 s ±  0.259 s
>     just-one-mega:     5.599 s ±  0.019 s     3.958 s ±  0.010 s

Hooray!

> +	for (i = 0; i < rename_src_nr; ++i) {
> +		char *filename = rename_src[i].p->one->path;
> +		const char *base = NULL;
> +		intptr_t src_index;
>  		intptr_t dst_index;
>
> +		/* Is this basename unique among remaining sources? */

This comment sent me down a confusing direction. Perhaps, we can
instead say:

	/*
	 * If the basename is unique among remaining sources, then
	 * src_index will equal 'i' and we can attempt to match it
	 * to a unique basename in the destinations. Otherwise, use
	 * directory rename heuristics, if possible.
	 */

> +		base = get_basename(filename);
> +		src_index = strintmap_get(&sources, base);
> +		assert(src_index == -1 || src_index == i);
> +
> +		if (strintmap_contains(&dests, base)) {
>  			struct diff_filespec *one, *two;
>  			int score;
>  
> +			/* Find a matching destination, if possible */
> +			dst_index = strintmap_get(&dests, base);
> +			if (src_index == -1 || dst_index == -1) {
> +				src_index = i;
> +				dst_index = idx_possible_rename(filename, info);
> +			}

It is important that 'src_index == i' from this point on, no
matter whether it was unique or not.

> +			if (dst_index == -1)
> +				continue;
> +
> +			/* Ignore this dest if already used in a rename */
> +			if (rename_dst[dst_index].is_rename)
> +				continue; /* already used previously */
> +

This seems to match all of the complicated special cases.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 00/10] Optimization batch 8: use file basenames even more
  2021-02-23 23:43 ` [PATCH v2 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
                     ` (9 preceding siblings ...)
  2021-02-23 23:44   ` [PATCH v2 10/10] diffcore-rename: use directory rename guided basename comparisons Elijah Newren via GitGitGadget
@ 2021-02-24 17:50   ` Derrick Stolee
  2021-02-25  1:38     ` Elijah Newren
  2021-02-26  1:58   ` [PATCH v3 " Elijah Newren via GitGitGadget
  11 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee @ 2021-02-24 17:50 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git; +Cc: Elijah Newren

On 2/23/2021 6:43 PM, Elijah Newren via GitGitGadget wrote:
> This series depends on en/diffcore-rename (a concatenation of what I was
> calling ort-perf-batch-6 and ort-perf-batch-7).
> 
> There are no changes since v1; it's just a resend a week and a half later to
> bump it so it isn't lost.

Thank you for re-sending. I intended to review it before but got redirected
and forgot to pick it up again.

> === Results ===
> 
> For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
> performance work; instrument with trace2_region_* calls", 2020-10-28), the
> changes in just this series improves the performance as follows:
> 
>                      Before Series           After Series
> no-renames:       12.775 s ±  0.062 s    12.596 s ±  0.061 s
> mega-renames:    188.754 s ±  0.284 s   130.465 s ±  0.259 s
> just-one-mega:     5.599 s ±  0.019 s     3.958 s ±  0.010 s
> 
> 
> As a reminder, before any merge-ort/diffcore-rename performance work, the
> performance results we started with (as noted in the same commit message)
> were:
> 
> no-renames-am:      6.940 s ±  0.485 s
> no-renames:        18.912 s ±  0.174 s
> mega-renames:    5964.031 s ± 10.459 s
> just-one-mega:    149.583 s ±  0.751 s

These are good results.

I reviewed the patches and believe they do the optimizations claimed. I
only found some nits for comments and whitespace things.

You are very careful to create the necessary pieces and connect them
from the bottom-up. However, this leads to one big "now everything is
done" commit with performance improvements. It seems that there are
some smaller performance improvements that could be measured if the
logic was instead built from the top-down with stubs for the complicated
logic.

For example, the final patch links the rename logic with a call to
idx_possible_rename(). But, that could just as well always return -1
and the implementation would be correct. Then, it would be good to see
if the performance changes with that non-functional update. It would
also help me read the series in patch order and understand the context
of the methods a bit better before seeing their implementation.

This is _not_ a recommendation that you rewrite the series. Just food
for thought as we continue with similar enhancements in the future.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 01/10] Move computation of dir_rename_count from merge-ort to diffcore-rename
  2021-02-24 15:25     ` Derrick Stolee
@ 2021-02-24 18:50       ` Elijah Newren
  0 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren @ 2021-02-24 18:50 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Elijah Newren via GitGitGadget, Git Mailing List

On Wed, Feb 24, 2021 at 7:25 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/23/2021 6:43 PM, Elijah Newren via GitGitGadget wrote:
> > ... While the
> > diffstat looks large, viewing this commit with --color-moved makes it
> > clear that only about 20 lines changed.  With this patch, the
> > computation of dir_rename_count is still only done after inexact rename
> > detection, but subsequent commits will add a preliminary computation of
> > dir_rename_count after exact rename detection, followed by some updates
> > after inexact rename detection.
>
> The --color-moved recommendation is a good one. Everything seems to be
> pretty standard, except this function:
>
> > +static void compute_dir_rename_counts(struct strmap *dir_rename_count,
> > +                                   struct strset *dirs_removed)
> > +{
> > +     int i;
> > +
> > +     /* Set up dir_rename_count */
> > +     for (i = 0; i < rename_dst_nr; ++i) {
> > +             /*
> > +              * Make dir_rename_count contain a map of a map:
> > +              *   old_directory -> {new_directory -> count}
> > +              * In other words, for every pair look at the directories for
> > +              * the old filename and the new filename and count how many
> > +              * times that pairing occurs.
> > +              */
> > +             update_dir_rename_counts(dir_rename_count, dirs_removed,
> > +                                      rename_dst[i].p->one->path,
> > +                                      rename_dst[i].p->two->path);
> > +     }
> > +}
>
> is very similar to this:
>
> > -static void compute_rename_counts(struct diff_queue_struct *pairs,
> > -                               struct strmap *dir_rename_count,
> > -                               struct strset *dirs_removed)
> > -{
> > -     int i;
> > -
> > -     for (i = 0; i < pairs->nr; ++i) {
> > -             struct diff_filepair *pair = pairs->queue[i];
> > -
> > -             /* File not part of directory rename if it wasn't renamed */
> > -             if (pair->status != 'R')
> > -                     continue;
> > -
> > -             /*
> > -              * Make dir_rename_count contain a map of a map:
> > -              *   old_directory -> {new_directory -> count}
> > -              * In other words, for every pair look at the directories for
> > -              * the old filename and the new filename and count how many
> > -              * times that pairing occurs.
> > -              */
> > -             update_dir_rename_counts(dir_rename_count, dirs_removed,
> > -                                      pair->one->path,
> > -                                      pair->two->path);
> > -     }
> > -}
> > -
>
> but we dropped that "File not part of directory rename" check.
>
> It seems that is no longer possible to use with the new data structure,
> but I wonder if this will cause a slowdown in the directory renames when
> merging? Or, has the data already been filtered before calling
> compute_dir_rename_counts()?

Oh, indeed, good catch.  The
    if (pair->status != 'R')
        continue;
check should have been translated to
    if (!rename_dst[i].is_rename)
        continue;
Such a check _was_ added later in the series except with more code
than just the "continue" that didn't make sense at this stage; the
extra code made me overlook it when I was splitting.  Oops.  I'll add
this check back in; thanks for catching.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 08/10] diffcore-rename: add a new idx_possible_rename function
  2021-02-24 17:35     ` Derrick Stolee
@ 2021-02-25  1:13       ` Elijah Newren
  0 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren @ 2021-02-25  1:13 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Elijah Newren via GitGitGadget, Git Mailing List

On Wed, Feb 24, 2021 at 9:35 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/23/2021 6:44 PM, Elijah Newren via GitGitGadget wrote:> +static char *get_dirname(const char *filename)
> > +{
> > +     char *slash = strrchr(filename, '/');
> > +     return slash ? xstrndup(filename, slash-filename) : xstrdup("");
>
> My brain interpreted "slash-filename" as a single token on first
> read, which confused me briefly. Inserting spaces would help
> readers like me.
>
> > +      *   (4) Check if applying that directory rename to the original file
> > +      *       would result in a destination filename that is in the
> > +      *       potential rename set.  If so, return the index of the
> > +      *       destination file (the index within rename_dst).
>
> > +      * This function, idx_possible_rename(), is only responsible for (4).
>
> This helps isolate the important step to care about for the implementation,
> while the rest of the context is important, too.
>
> > +     char *old_dir, *new_dir, *new_path;
> > +     int idx;
> > +
> > +     if (!info->setup)
> > +             return -1;
> > +
> > +     old_dir = get_dirname(filename);
> > +     new_dir = strmap_get(&info->dir_rename_guess, old_dir);
> > +     free(old_dir);
> > +     if (!new_dir)
> > +             return -1;
> > +
> > +     new_path = xstrfmt("%s/%s", new_dir, get_basename(filename));
>
> This is running in a loop, so `xstrfmt()` might be overkill compared
> to something like
>
>         strbuf_addstr(&new_path, new_dir);
>         strbuf_addch(&new_path, '/');
>         strbuf_addstr(&new_path, get_basename(filename));
>
> but maybe the difference is too small to notice. (notice the type
> change to "struct strbuf new_path = STRBUF_INIT;")

Ooh, nice find.  Since this is in a loop over the renames as you point
out, this is an O(N) improvement (with N = number of renames) rather
than an O(1) improvement.  It does turn out to be hard to notice,
though.  Since we still have some O(N^2) code (all the inexact rename
detection for which our exact- and basename-guided detection
optimizations can't handle), with that N^2 actually being multiplied
by the average number of lines in the given files, this improvement
does seem to mostly get lost in the noise.

I tried a bunch of times to measure the performance with these
changes.  After a bunch of runs, it seems that this optimization saves
somewhere between 3-10ms (depending on which testcase, whether at this
point in the series or at the very end, etc.).  It's hard to pin down,
because the savings is less than the standard deviation of any given
sets of runs.  I don't think it's big enough to warrant restating the
performance measurements, but I'm very happy to include this
suggestion in my reroll.

>
> > +
> > +     idx = strintmap_get(&info->idx_map, new_path);
> > +     free(new_path);
> > +     return idx;
> > +}
>
> Does what it says it does.
>
> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 00/10] Optimization batch 8: use file basenames even more
  2021-02-24 17:50   ` [PATCH v2 00/10] Optimization batch 8: use file basenames even more Derrick Stolee
@ 2021-02-25  1:38     ` Elijah Newren
  0 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren @ 2021-02-25  1:38 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Elijah Newren via GitGitGadget, Git Mailing List

On Wed, Feb 24, 2021 at 9:50 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/23/2021 6:43 PM, Elijah Newren via GitGitGadget wrote:
> > This series depends on en/diffcore-rename (a concatenation of what I was
> > calling ort-perf-batch-6 and ort-perf-batch-7).
> >
> > There are no changes since v1; it's just a resend a week and a half later to
> > bump it so it isn't lost.
>
> Thank you for re-sending. I intended to review it before but got redirected
> and forgot to pick it up again.
>
> > === Results ===
> >
> > For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
> > performance work; instrument with trace2_region_* calls", 2020-10-28), the
> > changes in just this series improves the performance as follows:
> >
> >                      Before Series           After Series
> > no-renames:       12.775 s ±  0.062 s    12.596 s ±  0.061 s
> > mega-renames:    188.754 s ±  0.284 s   130.465 s ±  0.259 s
> > just-one-mega:     5.599 s ±  0.019 s     3.958 s ±  0.010 s
> >
> >
> > As a reminder, before any merge-ort/diffcore-rename performance work, the
> > performance results we started with (as noted in the same commit message)
> > were:
> >
> > no-renames-am:      6.940 s ±  0.485 s
> > no-renames:        18.912 s ±  0.174 s
> > mega-renames:    5964.031 s ± 10.459 s
> > just-one-mega:    149.583 s ±  0.751 s
>
> These are good results.
>
> I reviewed the patches and believe they do the optimizations claimed. I
> only found some nits for comments and whitespace things.

Thanks for taking a look; I'll get those fixed up.  Also, I think your
performance improvement of switching from xstrfmt to a few strbuf
calls, even if small, counts as more than "nits for comments and
whitespace things".  :-)

> You are very careful to create the necessary pieces and connect them
> from the bottom-up. However, this leads to one big "now everything is
> done" commit with performance improvements. It seems that there are
> some smaller performance improvements that could be measured if the
> logic was instead built from the top-down with stubs for the complicated
> logic.
>
> For example, the final patch links the rename logic with a call to
> idx_possible_rename(). But, that could just as well always return -1
> and the implementation would be correct. Then, it would be good to see
> if the performance changes with that non-functional update. It would
> also help me read the series in patch order and understand the context
> of the methods a bit better before seeing their implementation.
>
> This is _not_ a recommendation that you rewrite the series. Just food
> for thought as we continue with similar enhancements in the future.

I can give it a shot for future relevant patch series (some of the
series this wouldn't be relevant for because they just include a
collection of patches implementing separate improvements that are just
batched together).  A couple of the series are already structured this
way, in fact, but the next series after this one has one patch that I
think I could reorder to make it more like this.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 04/10] diffcore-rename: extend cleanup_dir_rename_info()
  2021-02-23 23:44   ` [PATCH v2 04/10] diffcore-rename: extend cleanup_dir_rename_info() Elijah Newren via GitGitGadget
  2021-02-24 15:37     ` Derrick Stolee
@ 2021-02-25  2:16     ` Ævar Arnfjörð Bjarmason
  2021-02-25  2:26       ` Ævar Arnfjörð Bjarmason
  2021-02-25  2:34       ` Junio C Hamano
  1 sibling, 2 replies; 61+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-02-25  2:16 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, Elijah Newren


On Wed, Feb 24 2021, Elijah Newren via GitGitGadget wrote:

> From: Elijah Newren <newren@gmail.com>
> [...]
>  MAYBE_UNUSED
> -static void cleanup_dir_rename_info(struct dir_rename_info *info)
> +static void cleanup_dir_rename_info(struct dir_rename_info *info,
> +				    struct strset *dirs_removed,
> +				    int keep_dir_rename_count)
>  {
> +	struct hashmap_iter iter;
> +	struct strmap_entry *entry;
> +
>  	if (!info->setup)
>  		return;
>  
> -	partial_clear_dir_rename_count(info->dir_rename_count);
> -	strmap_clear(info->dir_rename_count, 1);
> +	if (!keep_dir_rename_count) {
> +		partial_clear_dir_rename_count(info->dir_rename_count);
> +		strmap_clear(info->dir_rename_count, 1);
> +		FREE_AND_NULL(info->dir_rename_count);
> +	} else {
> +		/*
> +		 * Although dir_rename_count was passed in
> +		 * diffcore_rename_extended() and we want to keep it around and
> +		 * return it to that caller, we first want to remove any data
> +		 * associated with directories that weren't renamed.
> +		 */
> +		struct string_list to_remove = STRING_LIST_INIT_NODUP;
> +		int i;
> +
> [...]

I find the pattern in patch 02 and 03 and leading up to this 04/05
confusing to review.

First we add a clear_dir_rename_count() in 02 but nothing uses it, then
in 03 it's renamed to cleanup_dir_rename_info() and its code changed,
but still nothing uses it. Here we're changing the function nothing
uses, and then finally in 05 we make use of it, and the MAYBE_UNUSED
attribute is removed.

I appreciate trying to split these large and complex patches into more
digestible pieces. I think that sometimes it's more readable to have a
patch that adds a function and a subsequent one that uses it.

But in this case where we've gone through stages of changing code that's
never been used I think we're making it harder to read than not. I'd
prefer just to see this cleanup_dir_rename_info() function pop into
existence in 05.

Just my 0.02.

Style nit/preference: I think code like this is easier to read as:

    if (simple-case) {
        blah
        blah;
        return;
    }
    complex_case;

Than not having the "return" and having most of the interesting logic in
an indented "else" block. Or maybe just this on top of the whole thing
(a -w diff, hopefully more readable, but still understandable):
    
    diff --git a/diffcore-rename.c b/diffcore-rename.c
    index 70a484b9b6..5a5c62ec79 100644
    --- a/diffcore-rename.c
    +++ b/diffcore-rename.c
    @@ -609,11 +609,12 @@ void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
     }
     
     static void cleanup_dir_rename_info(struct dir_rename_info *info,
    -				    struct strset *dirs_removed,
    -				    int keep_dir_rename_count)
    +				    struct strset *dirs_removed)
     {
     	struct hashmap_iter iter;
     	struct strmap_entry *entry;
    +	struct string_list to_remove = STRING_LIST_INIT_NODUP;
    +	int i;
     
     	if (!info->setup)
     		return;
    @@ -624,20 +625,12 @@ static void cleanup_dir_rename_info(struct dir_rename_info *info,
     	/* dir_rename_guess */
     	strmap_clear(&info->dir_rename_guess, 1);
     
    -	if (!keep_dir_rename_count) {
    -		partial_clear_dir_rename_count(info->dir_rename_count);
    -		strmap_clear(info->dir_rename_count, 1);
    -		FREE_AND_NULL(info->dir_rename_count);
    -	} else {
     	/*
     	 * Although dir_rename_count was passed in
     	 * diffcore_rename_extended() and we want to keep it around and
     	 * return it to that caller, we first want to remove any data
     	 * associated with directories that weren't renamed.
     	 */
    -		struct string_list to_remove = STRING_LIST_INIT_NODUP;
    -		int i;
    -
     	strmap_for_each_entry(info->dir_rename_count, &iter, entry) {
     		const char *source_dir = entry->key;
     		struct strintmap *counts = entry->value;
    @@ -653,7 +646,6 @@ static void cleanup_dir_rename_info(struct dir_rename_info *info,
     			      to_remove.items[i].string, 1);
     	string_list_clear(&to_remove, 0);
     }
    -}
     
     static const char *get_basename(const char *filename)
     {
    @@ -1317,7 +1309,13 @@ void diffcore_rename_extended(struct diff_options *options,
     		if (rename_dst[i].filespec_to_free)
     			free_filespec(rename_dst[i].filespec_to_free);
     
    -	cleanup_dir_rename_info(&info, dirs_removed, dir_rename_count != NULL);
    +	if (!dir_rename_count) {
    +		cleanup_dir_rename_info(&info, dirs_removed);
    +	} else {
    +		partial_clear_dir_rename_count(info.dir_rename_count);
    +		strmap_clear(info.dir_rename_count, 1);
    +		FREE_AND_NULL(info.dir_rename_count);
    +	}
     	FREE_AND_NULL(rename_dst);
     	rename_dst_nr = rename_dst_alloc = 0;
     	FREE_AND_NULL(rename_src);

I also wonder if that strmap_clear() wouldn't be better moved into
partial_clear_dir_rename_count():
    
    diff --git a/diffcore-rename.c b/diffcore-rename.c
    index 5a5c62ec790..5f6c5745d64 100644
    --- a/diffcore-rename.c
    +++ b/diffcore-rename.c
    @@ -596,7 +596,8 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
     	}
     }
     
    -void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
    +void partial_clear_dir_rename_count(struct strmap *dir_rename_count,
    +				    int clear_strmap)
     {
     	struct hashmap_iter iter;
     	struct strmap_entry *entry;
    @@ -606,6 +607,9 @@ void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
     		strintmap_clear(counts);
     	}
     	strmap_partial_clear(dir_rename_count, 1);
    +	if (clear_strmap)
    +		strmap_clear(dir_rename_count, 1);
    +
     }
     
     static void cleanup_dir_rename_info(struct dir_rename_info *info,
    @@ -1312,8 +1316,7 @@ void diffcore_rename_extended(struct diff_options *options,
     	if (!dir_rename_count) {
     		cleanup_dir_rename_info(&info, dirs_removed);
     	} else {
    -		partial_clear_dir_rename_count(info.dir_rename_count);
    -		strmap_clear(info.dir_rename_count, 1);
    +		partial_clear_dir_rename_count(info.dir_rename_count, 1);
     		FREE_AND_NULL(info.dir_rename_count);
     	}
     	FREE_AND_NULL(rename_dst);
    diff --git a/diffcore.h b/diffcore.h
    index c6ba64abd19..de8667bfa04 100644
    --- a/diffcore.h
    +++ b/diffcore.h
    @@ -161,7 +161,8 @@ struct diff_filepair *diff_queue(struct diff_queue_struct *,
     				 struct diff_filespec *);
     void diff_q(struct diff_queue_struct *, struct diff_filepair *);
     
    -void partial_clear_dir_rename_count(struct strmap *dir_rename_count);
    +void partial_clear_dir_rename_count(struct strmap *dir_rename_count,
    +				    int clear_strmap);
     
     void diffcore_break(struct repository *, int);
     void diffcore_rename(struct diff_options *);
    diff --git a/merge-ort.c b/merge-ort.c
    index 467404cc0a3..0bbd49f0d78 100644
    --- a/merge-ort.c
    +++ b/merge-ort.c
    @@ -353,9 +353,9 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
     	for (i = MERGE_SIDE1; i <= MERGE_SIDE2; ++i) {
     		strset_func(&renames->dirs_removed[i]);
     
    -		partial_clear_dir_rename_count(&renames->dir_rename_count[i]);
    -		if (!reinitialize)
    -			strmap_clear(&renames->dir_rename_count[i], 1);
    +		partial_clear_dir_rename_count(&renames->dir_rename_count[i],
    +					       !reinitialize);
    +		free(&renames->dir_rename_count[i]);
     
     		strmap_func(&renames->dir_renames[i], 0);
     	}
    

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 04/10] diffcore-rename: extend cleanup_dir_rename_info()
  2021-02-25  2:16     ` Ævar Arnfjörð Bjarmason
@ 2021-02-25  2:26       ` Ævar Arnfjörð Bjarmason
  2021-02-25  2:34       ` Junio C Hamano
  1 sibling, 0 replies; 61+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-02-25  2:26 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget; +Cc: git, Elijah Newren


On Thu, Feb 25 2021, Ævar Arnfjörð Bjarmason wrote:

>     -		partial_clear_dir_rename_count(&renames->dir_rename_count[i]);
>     -		if (!reinitialize)
>     -			strmap_clear(&renames->dir_rename_count[i], 1);
>     +		partial_clear_dir_rename_count(&renames->dir_rename_count[i],
>     +					       !reinitialize);
>     +		free(&renames->dir_rename_count[i]);
>      
>      		strmap_func(&renames->dir_renames[i], 0);
>      	}

That free() wasn't supposed to be in there, I was still experimenting
with whether partial_clear_dir_rename_count() should also free() this
itself.

But now that I notice I left that in there and in the meantime ran all
the tests, they all passed. So maybe this & the if/else in
diffcore_rename_extended() can lose its braces by just calling either
cleanup_dir_rename_info() or partial_clear_dir_rename_count(). I didn't
look in any detail, and if the free() v.s. FREE_AND_NULL() distinction
mattered here.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v2 04/10] diffcore-rename: extend cleanup_dir_rename_info()
  2021-02-25  2:16     ` Ævar Arnfjörð Bjarmason
  2021-02-25  2:26       ` Ævar Arnfjörð Bjarmason
@ 2021-02-25  2:34       ` Junio C Hamano
  1 sibling, 0 replies; 61+ messages in thread
From: Junio C Hamano @ 2021-02-25  2:34 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Elijah Newren via GitGitGadget, git, Elijah Newren

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> But in this case where we've gone through stages of changing code that's
> never been used I think we're making it harder to read than not. I'd
> prefer just to see this cleanup_dir_rename_info() function pop into
> existence in 05.

I had a similar impression on parts of the earlier series where a
new helper without users were given as a standalone patch.  I found
it a bit disorienting.

> Style nit/preference: I think code like this is easier to read as:
>
>     if (simple-case) {
>         blah
>         blah;
>         return;
>     }
>     complex_case;
>
> Than not having the "return" and having most of the interesting logic in
> an indented "else" block. Or maybe just this on top of the whole thing
> (a -w diff, hopefully more readable, but still understandable):

Yes, that is also a good tip for a more readable patch, but that
applies only for if/else at the end of the function.

In general, formulating the condition so that the smaller body comes
first for "if" and the larger one goes to the body of "else" would
make the if/else easier to understand, as you can often hold the
condition and smaller body just before "else" in your head, and
after clearly understanding that part, it becomes easier to
concentrate on the other side, i.e. "now we know what happens if the
condition is true, what about the other case?  Let's read on ...".

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v3 00/10] Optimization batch 8: use file basenames even more
  2021-02-23 23:43 ` [PATCH v2 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
                     ` (10 preceding siblings ...)
  2021-02-24 17:50   ` [PATCH v2 00/10] Optimization batch 8: use file basenames even more Derrick Stolee
@ 2021-02-26  1:58   ` Elijah Newren via GitGitGadget
  2021-02-26  1:58     ` [PATCH v3 01/10] diffcore-rename: use directory rename guided basename comparisons Elijah Newren via GitGitGadget
                       ` (11 more replies)
  11 siblings, 12 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-26  1:58 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Elijah Newren, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren

This series depends on en/diffcore-rename (a concatenation of what I was
calling ort-perf-batch-6 and ort-perf-batch-7).

Changes since v2:

 * Rearrange the patches in the series to have a top-down ordering rather
   than bottom-up -- as suggested by Stolee, Ævar, and Junio
 * Several comments and style improvements suggested by Stolee
 * Replace xstrfmt() with a few strbuf_add*() calls, as suggested by Stolee

Elijah Newren (10):
  diffcore-rename: use directory rename guided basename comparisons
  diffcore-rename: add a new idx_possible_rename function
  diffcore-rename: add a mapping of destination names to their indices
  Move computation of dir_rename_count from merge-ort to diffcore-rename
  diffcore-rename: add function for clearing dir_rename_count
  diffcore-rename: move dir_rename_counts into dir_rename_info struct
  diffcore-rename: extend cleanup_dir_rename_info()
  diffcore-rename: compute dir_rename_counts in stages
  diffcore-rename: limit dir_rename_counts computation to relevant dirs
  diffcore-rename: compute dir_rename_guess from dir_rename_counts

 Documentation/gitdiffcore.txt |   2 +-
 diffcore-rename.c             | 449 ++++++++++++++++++++++++++++++++--
 diffcore.h                    |   7 +
 merge-ort.c                   | 144 +----------
 4 files changed, 449 insertions(+), 153 deletions(-)


base-commit: aeca14f748afc7fb5b65bca56ea2ebd970729814
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-844%2Fnewren%2Fort-perf-batch-8-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-844/newren/ort-perf-batch-8-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/844

Range-diff vs v2:

 10:  805c101cfd84 !  1:  6afa9add40b9 diffcore-rename: use directory rename guided basename comparisons
     @@ Metadata
       ## Commit message ##
          diffcore-rename: use directory rename guided basename comparisons
      
     -    Hook the work from the last several patches together so that when
     -    basenames in the sets of possible remaining rename sources or
     -    destinations aren't unique, we can guess which directory source files
     -    were renamed into.  When that guess gives us a pairing of files, and
     -    those files are sufficiently similar, we record the two files as a
     -    rename and remove them from the large matrix of comparisons for inexact
     -    rename detection.
     +    A previous commit noted that it is very common for people to move files
     +    across directories while keeping their filename the same.  The last few
     +    commits took advantage of this and showed that we can accelerate rename
     +    detection significantly using basenames; since files with the same
     +    basename serve as likely rename candidates, we can check those first and
     +    remove them from the rename candidate pool if they are sufficiently
     +    similar.
      
     -    For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
     -    performance work; instrument with trace2_region_* calls", 2020-10-28),
     -    this change improves the performance as follows:
     +    Unfortunately, the previous optimization was limited by the fact that
     +    the remaining basenames after exact rename detection are not always
     +    unique.  Many repositories have hundreds of build files with the same
     +    name (e.g. Makefile, .gitignore, build.gradle, etc.), and may even have
     +    hundreds of source files with the same name.  (For example, the linux
     +    kernel has 100 setup.c, 87 irq.c, and 112 core.c files.  A repository at
     +    $DAYJOB has a lot of ObjectFactory.java and Plugin.java files).
      
     -                                Before                  After
     -        no-renames:       12.775 s ±  0.062 s    12.596 s ±  0.061 s
     -        mega-renames:    188.754 s ±  0.284 s   130.465 s ±  0.259 s
     -        just-one-mega:     5.599 s ±  0.019 s     3.958 s ±  0.010 s
     +    For these files with non-unique basenames, we are faced with the task of
     +    attempting to determine or guess which directory they may have been
     +    relocated to.  Such a task is precisely the job of directory rename
     +    detection.  However, there are two catches: (1) the directory rename
     +    detection code has traditionally been part of the merge machinery rather
     +    than diffcore-rename.c, and (2) directory rename detection currently
     +    runs after regular rename detection is complete.  The 1st catch is just
     +    an implementation issue that can be overcome by some code shuffling.
     +    The 2nd requires us to add a further approximation: we only have access
     +    to exact renames at this point, so we need to do directory rename
     +    detection based on just exact renames.  In some cases we won't have
     +    exact renames, in which case this extra optimization won't apply.  We
     +    also choose to not apply the optimization unless we know that the
     +    underlying directory was removed, which will require extra data to be
     +    passed in to diffcore_rename_extended().  Also, even if we get a
     +    prediction about which directory a file may have relocated to, we will
     +    still need to check to see if there is a file in the predicted
     +    directory, and then compare the two files to see if they meet the higher
     +    min_basename_score threshold required for marking the two files as
     +    renames.
     +
     +    This commit introduces an idx_possible_rename() function which will give
     +    do this directory rename detection for us and give us the index within
     +    rename_dst of the resulting filename.  For now, this function is
     +    hardcoded to return -1 (not found) and just hooks up how its results
     +    would be used once we have a more complete implementation in place.
      
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
     @@ diffcore-rename.c: static const char *get_basename(const char *filename)
       	return base ? base + 1 : filename;
       }
       
     --MAYBE_UNUSED
     - static int idx_possible_rename(char *filename, struct dir_rename_info *info)
     ++static int idx_possible_rename(char *filename)
     ++{
     ++	/* Unconditionally return -1, "not found", for now */
     ++	return -1;
     ++}
     ++
     + static int find_basename_matches(struct diff_options *options,
     + 				 int minimum_score)
       {
     - 	/*
      @@ diffcore-rename.c: static int find_basename_matches(struct diff_options *options,
       	int i, renames = 0;
       	struct strintmap sources;
     @@ diffcore-rename.c: static int find_basename_matches(struct diff_options *options
      -			continue;
       
      -		if (0 <= (dst_index = strintmap_get(&dests, base))) {
     -+		/* Is this basename unique among remaining sources? */
     ++		/*
     ++		 * If the basename is unique among remaining sources, then
     ++		 * src_index will equal 'i' and we can attempt to match it
     ++		 * to a unique basename in the destinations.  Otherwise,
     ++		 * use directory rename heuristics, if possible.
     ++		 */
      +		base = get_basename(filename);
      +		src_index = strintmap_get(&sources, base);
      +		assert(src_index == -1 || src_index == i);
     @@ diffcore-rename.c: static int find_basename_matches(struct diff_options *options
      +			dst_index = strintmap_get(&dests, base);
      +			if (src_index == -1 || dst_index == -1) {
      +				src_index = i;
     -+				dst_index = idx_possible_rename(filename, info);
     ++				dst_index = idx_possible_rename(filename);
      +			}
      +			if (dst_index == -1)
      +				continue;
  8:  cbd055ab3399 !  2:  40f57bcc2055 diffcore-rename: add a new idx_possible_rename function
     @@ Commit message
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
     -@@ diffcore-rename.c: struct dir_rename_info {
     - 	unsigned setup;
     - };
     +@@ diffcore-rename.c: static int find_exact_renames(struct diff_options *options)
     + 	return renames;
     + }
       
     ++struct dir_rename_info {
     ++	struct strintmap idx_map;
     ++	struct strmap dir_rename_guess;
     ++	struct strmap *dir_rename_count;
     ++	unsigned setup;
     ++};
     ++
      +static char *get_dirname(const char *filename)
      +{
      +	char *slash = strrchr(filename, '/');
     -+	return slash ? xstrndup(filename, slash-filename) : xstrdup("");
     ++	return slash ? xstrndup(filename, slash - filename) : xstrdup("");
      +}
      +
     - static void dirname_munge(char *filename)
     + static const char *get_basename(const char *filename)
       {
     - 	char *slash = strrchr(filename, '/');
     + 	/*
      @@ diffcore-rename.c: static const char *get_basename(const char *filename)
       	return base ? base + 1 : filename;
       }
       
     -+MAYBE_UNUSED
     +-static int idx_possible_rename(char *filename)
      +static int idx_possible_rename(char *filename, struct dir_rename_info *info)
     -+{
     + {
     +-	/* Unconditionally return -1, "not found", for now */
     +-	return -1;
      +	/*
      +	 * Our comparison of files with the same basename (see
      +	 * find_basename_matches() below), is only helpful when after exact
     @@ diffcore-rename.c: static const char *get_basename(const char *filename)
      +	 *       rename.
      +	 *
      +	 * This function, idx_possible_rename(), is only responsible for (4).
     -+	 * The conditions/steps in (1)-(3) are handled via setting up
     -+	 * dir_rename_count and dir_rename_guess in
     -+	 * initialize_dir_rename_info().  Steps (0) and (5) are handled by
     -+	 * the caller of this function.
     ++	 * The conditions/steps in (1)-(3) will be handled via setting up
     ++	 * dir_rename_count and dir_rename_guess in a future
     ++	 * initialize_dir_rename_info() function.  Steps (0) and (5) are
     ++	 * handled by the caller of this function.
      +	 */
     -+	char *old_dir, *new_dir, *new_path;
     ++	char *old_dir, *new_dir;
     ++	struct strbuf new_path = STRBUF_INIT;
      +	int idx;
      +
      +	if (!info->setup)
     @@ diffcore-rename.c: static const char *get_basename(const char *filename)
      +	if (!new_dir)
      +		return -1;
      +
     -+	new_path = xstrfmt("%s/%s", new_dir, get_basename(filename));
     ++	strbuf_addstr(&new_path, new_dir);
     ++	strbuf_addch(&new_path, '/');
     ++	strbuf_addstr(&new_path, get_basename(filename));
      +
     -+	idx = strintmap_get(&info->idx_map, new_path);
     -+	free(new_path);
     ++	idx = strintmap_get(&info->idx_map, new_path.buf);
     ++	strbuf_release(&new_path);
      +	return idx;
     -+}
     -+
     + }
     + 
       static int find_basename_matches(struct diff_options *options,
     - 				 int minimum_score,
     - 				 struct dir_rename_info *info,
     +-				 int minimum_score)
     ++				 int minimum_score,
     ++				 struct dir_rename_info *info)
     + {
     + 	/*
     + 	 * When I checked in early 2020, over 76% of file renames in linux
     +@@ diffcore-rename.c: static int find_basename_matches(struct diff_options *options,
     + 			dst_index = strintmap_get(&dests, base);
     + 			if (src_index == -1 || dst_index == -1) {
     + 				src_index = i;
     +-				dst_index = idx_possible_rename(filename);
     ++				dst_index = idx_possible_rename(filename, info);
     + 			}
     + 			if (dst_index == -1)
     + 				continue;
     +@@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
     + 	int num_destinations, dst_cnt;
     + 	int num_sources, want_copies;
     + 	struct progress *progress = NULL;
     ++	struct dir_rename_info info;
     + 
     + 	trace2_region_enter("diff", "setup", options->repo);
     ++	info.setup = 0;
     + 	want_copies = (detect_rename == DIFF_DETECT_COPY);
     + 	if (!minimum_score)
     + 		minimum_score = DEFAULT_RENAME_SCORE;
     +@@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
     + 		/* Utilize file basenames to quickly find renames. */
     + 		trace2_region_enter("diff", "basename matches", options->repo);
     + 		rename_count += find_basename_matches(options,
     +-						      min_basename_score);
     ++						      min_basename_score,
     ++						      &info);
     + 		trace2_region_leave("diff", "basename matches", options->repo);
     + 
     + 		/*
  6:  dffecc064dd3 !  3:  0e14961574ea diffcore-rename: add a mapping of destination names to their indices
     @@ Metadata
       ## Commit message ##
          diffcore-rename: add a mapping of destination names to their indices
      
     -    Add an idx_map member to struct dir_rename_info, which tracks a mapping
     -    of the full filename to the index within rename_dst where that filename
     -    is found.  We will later use this for quickly finding an array entry in
     -    rename_dst given the pathname.
     +    Compute a mapping of full filename to the index within rename_dst where
     +    that filename is found, and store it in idx_map.  idx_possible_rename()
     +    needs this to quickly finding an array entry in rename_dst given the
     +    pathname.
     +
     +    While at it, add placeholder initializations for dir_rename_count and
     +    dir_rename_guess; these will be more fully populated in subsequent
     +    commits.
      
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
     -@@ diffcore-rename.c: static int find_exact_renames(struct diff_options *options)
     +@@ diffcore-rename.c: static char *get_dirname(const char *filename)
     + 	return slash ? xstrndup(filename, slash - filename) : xstrdup("");
       }
       
     - struct dir_rename_info {
     -+	struct strintmap idx_map;
     - 	struct strmap *dir_rename_count;
     - 	unsigned setup;
     - };
     -@@ diffcore-rename.c: static void initialize_dir_rename_info(struct dir_rename_info *info,
     - 		info->dir_rename_count = xmalloc(sizeof(*dir_rename_count));
     - 		strmap_init(info->dir_rename_count);
     - 	}
     ++static void initialize_dir_rename_info(struct dir_rename_info *info)
     ++{
     ++	int i;
     ++
     ++	info->setup = 1;
     ++
      +	strintmap_init_with_options(&info->idx_map, -1, NULL, 0);
     - 
     ++	strmap_init_with_options(&info->dir_rename_guess, NULL, 0);
     ++	info->dir_rename_count = NULL;
     ++
      +	/*
     -+	 * Loop setting up both info->idx_map, and doing setup of
     -+	 * info->dir_rename_count.
     ++	 * Loop setting up both info->idx_map.
      +	 */
     - 	for (i = 0; i < rename_dst_nr; ++i) {
     - 		/*
     --		 * Make dir_rename_count contain a map of a map:
     ++	for (i = 0; i < rename_dst_nr; ++i) {
     ++		/*
      +		 * For non-renamed files, make idx_map contain mapping of
      +		 *   filename -> index (index within rename_dst, that is)
      +		 */
      +		if (!rename_dst[i].is_rename) {
      +			char *filename = rename_dst[i].p->two->path;
      +			strintmap_set(&info->idx_map, filename, i);
     -+			continue;
      +		}
     ++	}
     ++}
     ++
     ++static void cleanup_dir_rename_info(struct dir_rename_info *info)
     ++{
     ++	if (!info->setup)
     ++		return;
      +
     -+		/*
     -+		 * For everything else (i.e. renamed files), make
     -+		 * dir_rename_count contain a map of a map:
     - 		 *   old_directory -> {new_directory -> count}
     - 		 * In other words, for every pair look at the directories for
     - 		 * the old filename and the new filename and count how many
     -@@ diffcore-rename.c: static void cleanup_dir_rename_info(struct dir_rename_info *info,
     - 	if (!info->setup)
     - 		return;
     - 
      +	/* idx_map */
      +	strintmap_clear(&info->idx_map);
      +
     - 	if (!keep_dir_rename_count) {
     - 		partial_clear_dir_rename_count(info->dir_rename_count);
     - 		strmap_clear(info->dir_rename_count, 1);
     ++	/* dir_rename_guess */
     ++	strmap_clear(&info->dir_rename_guess, 1);
     ++
     ++	/* Nothing to do for dir_rename_count, yet */
     ++}
     ++
     + static const char *get_basename(const char *filename)
     + {
     + 	/*
     +@@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
     + 		remove_unneeded_paths_from_src(want_copies);
     + 		trace2_region_leave("diff", "cull after exact", options->repo);
     + 
     ++		/* Preparation for basename-driven matching. */
     ++		trace2_region_enter("diff", "dir rename setup", options->repo);
     ++		initialize_dir_rename_info(&info);
     ++		trace2_region_leave("diff", "dir rename setup", options->repo);
     ++
     + 		/* Utilize file basenames to quickly find renames. */
     + 		trace2_region_enter("diff", "basename matches", options->repo);
     + 		rename_count += find_basename_matches(options,
     +@@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
     + 		if (rename_dst[i].filespec_to_free)
     + 			free_filespec(rename_dst[i].filespec_to_free);
     + 
     ++	cleanup_dir_rename_info(&info);
     + 	FREE_AND_NULL(rename_dst);
     + 	rename_dst_nr = rename_dst_alloc = 0;
     + 	FREE_AND_NULL(rename_src);
  1:  fec4f1d44c06 !  4:  9b9d5b207b03 Move computation of dir_rename_count from merge-ort to diffcore-rename
     @@ Metadata
       ## Commit message ##
          Move computation of dir_rename_count from merge-ort to diffcore-rename
      
     -    A previous commit noted that it is very common for people to move files
     -    across directories while keeping their filename the same.  The last few
     -    commits took advantage of this and showed that we can accelerate rename
     -    detection significantly using basenames; since files with the same
     -    basename serve as likely rename candidates, we can check those first and
     -    remove them from the rename candidate pool if they are sufficiently
     -    similar.
     +    Move the computation of dir_rename_count from merge-ort.c to
     +    diffcore-rename.c, making slight adjustments to the data structures
     +    based on the move.  While the diffstat looks large, viewing this commit
     +    with --color-moved makes it clear that only about 20 lines changed.
      
     -    Unfortunately, the previous optimization was limited by the fact that
     -    the remaining basenames after exact rename detection are not always
     -    unique.  Many repositories have hundreds of build files with the same
     -    name (e.g. Makefile, .gitignore, build.gradle, etc.), and may even have
     -    hundreds of source files with the same name.  (For example, the linux
     -    kernel has 100 setup.c, 87 irq.c, and 112 core.c files.  A repository at
     -    $DAYJOB has a lot of ObjectFactory.java and Plugin.java files).
     -
     -    For these files with non-unique basenames, we are faced with the task of
     -    attempting to determine or guess which directory they may have been
     -    relocated to.  Such a task is precisely the job of directory rename
     -    detection.  However, there are two catches: (1) the directory rename
     -    detection code has traditionally been part of the merge machinery rather
     -    than diffcore-rename.c, and (2) directory rename detection currently
     -    runs after regular rename detection is complete.  The 1st catch is just
     -    an implementation issue that can be overcome by some code shuffling.
     -    The 2nd requires us to add a further approximation: we only have access
     -    to exact renames at this point, so we need to do directory rename
     -    detection based on just exact renames.  In some cases we won't have
     -    exact renames, in which case this extra optimization won't apply.  We
     -    also choose to not apply the optimization unless we know that the
     -    underlying directory was removed, which will require extra data to be
     -    passed in to diffcore_rename_extended().  Also, even if we get a
     -    prediction about which directory a file may have relocated to, we will
     -    still need to check to see if there is a file in the predicted
     -    directory, and then compare the two files to see if they meet the higher
     -    min_basename_score threshold required for marking the two files as
     -    renames.
     -
     -    This commit and the next few will set up the necessary infrastructure to
     -    do such computations.  This commit merely moves the computation of
     -    dir_rename_count from merge-ort.c to diffcore-rename.c, making slight
     -    adjustments to the data structures based on the move.  While the
     -    diffstat looks large, viewing this commit with --color-moved makes it
     -    clear that only about 20 lines changed.  With this patch, the
     -    computation of dir_rename_count is still only done after inexact rename
     -    detection, but subsequent commits will add a preliminary computation of
     -    dir_rename_count after exact rename detection, followed by some updates
     -    after inexact rename detection.
     +    With this patch, the computation of dir_rename_count is still only done
     +    after inexact rename detection, but subsequent commits will add a
     +    preliminary computation of dir_rename_count after exact rename
     +    detection, followed by some updates after inexact rename detection.
      
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
     -@@ diffcore-rename.c: static int find_exact_renames(struct diff_options *options)
     - 	return renames;
     +@@ diffcore-rename.c: static char *get_dirname(const char *filename)
     + 	return slash ? xstrndup(filename, slash - filename) : xstrdup("");
       }
       
      +static void dirname_munge(char *filename)
     @@ diffcore-rename.c: static int find_exact_renames(struct diff_options *options)
      +
      +	/* Set up dir_rename_count */
      +	for (i = 0; i < rename_dst_nr; ++i) {
     ++		/* File not part of directory rename counts if not a rename */
     ++		if (!rename_dst[i].is_rename)
     ++			continue;
     ++
      +		/*
      +		 * Make dir_rename_count contain a map of a map:
      +		 *   old_directory -> {new_directory -> count}
     @@ diffcore-rename.c: static int find_exact_renames(struct diff_options *options)
      +	}
      +}
      +
     - static const char *get_basename(const char *filename)
     + static void initialize_dir_rename_info(struct dir_rename_info *info)
       {
     - 	/*
     + 	int i;
      @@ diffcore-rename.c: static void remove_unneeded_paths_from_src(int detecting_copies)
       	rename_src_nr = new_num_src;
       }
     @@ diffcore-rename.c: static void remove_unneeded_paths_from_src(int detecting_copi
       	int detect_rename = options->detect_rename;
       	int minimum_score = options->rename_score;
      @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
     - 	struct progress *progress = NULL;
       
       	trace2_region_enter("diff", "setup", options->repo);
     + 	info.setup = 0;
      +	assert(!dir_rename_count || strmap_empty(dir_rename_count));
       	want_copies = (detect_rename == DIFF_DETECT_COPY);
       	if (!minimum_score)
  2:  612da82f049c !  5:  f286e89464ea diffcore-rename: add functions for clearing dir_rename_count
     @@ Metadata
      Author: Elijah Newren <newren@gmail.com>
      
       ## Commit message ##
     -    diffcore-rename: add functions for clearing dir_rename_count
     +    diffcore-rename: add function for clearing dir_rename_count
      
     -    As we adjust the usage of dir_rename_count we want to have functions for
     -    clearing, or partially clearing it out.  Add such functions.
     +    As we adjust the usage of dir_rename_count we want to have a function
     +    for clearing, or partially clearing it out.  Add a
     +    partial_clear_dir_rename_count() function for this purpose.
      
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
     -@@ diffcore-rename.c: static void compute_dir_rename_counts(struct strmap *dir_rename_count,
     +@@ diffcore-rename.c: static void initialize_dir_rename_info(struct dir_rename_info *info)
       	}
       }
       
     @@ diffcore-rename.c: static void compute_dir_rename_counts(struct strmap *dir_rena
      +	strmap_partial_clear(dir_rename_count, 1);
      +}
      +
     -+MAYBE_UNUSED
     -+static void clear_dir_rename_count(struct strmap *dir_rename_count)
     -+{
     -+	partial_clear_dir_rename_count(dir_rename_count);
     -+	strmap_clear(dir_rename_count, 1);
     -+}
     -+
     - static const char *get_basename(const char *filename)
     + static void cleanup_dir_rename_info(struct dir_rename_info *info)
       {
     - 	/*
     + 	if (!info->setup)
      
       ## diffcore.h ##
      @@ diffcore.h: struct diff_filepair *diff_queue(struct diff_queue_struct *,
  3:  93f98fc0b264 !  6:  ab353f2e75eb diffcore-rename: move dir_rename_counts into a dir_rename_info struct
     @@ Metadata
      Author: Elijah Newren <newren@gmail.com>
      
       ## Commit message ##
     -    diffcore-rename: move dir_rename_counts into a dir_rename_info struct
     +    diffcore-rename: move dir_rename_counts into dir_rename_info struct
      
     -    This is a purely cosmetic change for now, but we will be adding
     -    additional information to the struct and changing where and how it is
     -    setup and used in subsequent patches.
     +    This continues the migration of the directory rename detection code into
     +    diffcore-rename, now taking the simple step of combining it with the
     +    dir_rename_info struct.  Future commits will then make dir_rename_counts
     +    be computed in stages, and add computation of dir_rename_guess.
      
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
     -@@ diffcore-rename.c: static int find_exact_renames(struct diff_options *options)
     - 	return renames;
     - }
     - 
     -+struct dir_rename_info {
     -+	struct strmap *dir_rename_count;
     -+	unsigned setup;
     -+};
     -+
     - static void dirname_munge(char *filename)
     - {
     - 	char *slash = strrchr(filename, '/');
      @@ diffcore-rename.c: static void dirname_munge(char *filename)
       	*slash = '\0';
       }
     @@ diffcore-rename.c: static void update_dir_rename_counts(struct strmap *dir_renam
      +	info->dir_rename_count = dir_rename_count;
      +
       	for (i = 0; i < rename_dst_nr; ++i) {
     - 		/*
     - 		 * Make dir_rename_count contain a map of a map:
     + 		/* File not part of directory rename counts if not a rename */
     + 		if (!rename_dst[i].is_rename)
      @@ diffcore-rename.c: static void compute_dir_rename_counts(struct strmap *dir_rename_count,
       		 * the old filename and the new filename and count how many
       		 * times that pairing occurs.
     @@ diffcore-rename.c: static void compute_dir_rename_counts(struct strmap *dir_rena
       					 rename_dst[i].p->one->path,
       					 rename_dst[i].p->two->path);
       	}
     -@@ diffcore-rename.c: void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
     - }
     +@@ diffcore-rename.c: static void cleanup_dir_rename_info(struct dir_rename_info *info)
     + 	/* dir_rename_guess */
     + 	strmap_clear(&info->dir_rename_guess, 1);
       
     - MAYBE_UNUSED
     --static void clear_dir_rename_count(struct strmap *dir_rename_count)
     -+static void cleanup_dir_rename_info(struct dir_rename_info *info)
     - {
     --	partial_clear_dir_rename_count(dir_rename_count);
     --	strmap_clear(dir_rename_count, 1);
     -+	if (!info->setup)
     -+		return;
     -+
     +-	/* Nothing to do for dir_rename_count, yet */
     ++	/* dir_rename_count */
      +	partial_clear_dir_rename_count(info->dir_rename_count);
      +	strmap_clear(info->dir_rename_count, 1);
       }
       
       static const char *get_basename(const char *filename)
     -@@ diffcore-rename.c: void diffcore_rename_extended(struct diff_options *options,
     - 	int num_destinations, dst_cnt;
     - 	int num_sources, want_copies;
     - 	struct progress *progress = NULL;
     -+	struct dir_rename_info info;
     - 
     - 	trace2_region_enter("diff", "setup", options->repo);
     -+	info.setup = 0;
     - 	assert(!dir_rename_count || strmap_empty(dir_rename_count));
     - 	want_copies = (detect_rename == DIFF_DETECT_COPY);
     - 	if (!minimum_score)
      @@ diffcore-rename.c: void diffcore_rename_extended(struct diff_options *options,
       	/*
       	 * Now that renames have been computed, compute dir_rename_count */
  4:  f7bdad78219d !  7:  bd50d9e53804 diffcore-rename: extend cleanup_dir_rename_info()
     @@ Commit message
      
       ## diffcore-rename.c ##
      @@ diffcore-rename.c: void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
     + 	strmap_partial_clear(dir_rename_count, 1);
       }
       
     - MAYBE_UNUSED
      -static void cleanup_dir_rename_info(struct dir_rename_info *info)
      +static void cleanup_dir_rename_info(struct dir_rename_info *info,
      +				    struct strset *dirs_removed,
     @@ diffcore-rename.c: void partial_clear_dir_rename_count(struct strmap *dir_rename
       {
      +	struct hashmap_iter iter;
      +	struct strmap_entry *entry;
     ++	struct string_list to_remove = STRING_LIST_INIT_NODUP;
     ++	int i;
      +
       	if (!info->setup)
       		return;
       
     +@@ diffcore-rename.c: static void cleanup_dir_rename_info(struct dir_rename_info *info)
     + 	strmap_clear(&info->dir_rename_guess, 1);
     + 
     + 	/* dir_rename_count */
      -	partial_clear_dir_rename_count(info->dir_rename_count);
      -	strmap_clear(info->dir_rename_count, 1);
      +	if (!keep_dir_rename_count) {
      +		partial_clear_dir_rename_count(info->dir_rename_count);
      +		strmap_clear(info->dir_rename_count, 1);
      +		FREE_AND_NULL(info->dir_rename_count);
     -+	} else {
     -+		/*
     -+		 * Although dir_rename_count was passed in
     -+		 * diffcore_rename_extended() and we want to keep it around and
     -+		 * return it to that caller, we first want to remove any data
     -+		 * associated with directories that weren't renamed.
     -+		 */
     -+		struct string_list to_remove = STRING_LIST_INIT_NODUP;
     -+		int i;
     ++		return;
     ++	}
      +
     -+		strmap_for_each_entry(info->dir_rename_count, &iter, entry) {
     -+			const char *source_dir = entry->key;
     -+			struct strintmap *counts = entry->value;
     ++	/*
     ++	 * Although dir_rename_count was passed in
     ++	 * diffcore_rename_extended() and we want to keep it around and
     ++	 * return it to that caller, we first want to remove any data
     ++	 * associated with directories that weren't renamed.
     ++	 */
     ++	strmap_for_each_entry(info->dir_rename_count, &iter, entry) {
     ++		const char *source_dir = entry->key;
     ++		struct strintmap *counts = entry->value;
      +
     -+			if (!strset_contains(dirs_removed, source_dir)) {
     -+				string_list_append(&to_remove, source_dir);
     -+				strintmap_clear(counts);
     -+				continue;
     -+			}
     ++		if (!strset_contains(dirs_removed, source_dir)) {
     ++			string_list_append(&to_remove, source_dir);
     ++			strintmap_clear(counts);
     ++			continue;
      +		}
     -+		for (i=0; i<to_remove.nr; ++i)
     -+			strmap_remove(info->dir_rename_count,
     -+				      to_remove.items[i].string, 1);
     -+		string_list_clear(&to_remove, 0);
      +	}
     ++	for (i = 0; i < to_remove.nr; ++i)
     ++		strmap_remove(info->dir_rename_count,
     ++			      to_remove.items[i].string, 1);
     ++	string_list_clear(&to_remove, 0);
       }
       
       static const char *get_basename(const char *filename)
     +@@ diffcore-rename.c: void diffcore_rename_extended(struct diff_options *options,
     + 		if (rename_dst[i].filespec_to_free)
     + 			free_filespec(rename_dst[i].filespec_to_free);
     + 
     +-	cleanup_dir_rename_info(&info);
     ++	cleanup_dir_rename_info(&info, dirs_removed, dir_rename_count != NULL);
     + 	FREE_AND_NULL(rename_dst);
     + 	rename_dst_nr = rename_dst_alloc = 0;
     + 	FREE_AND_NULL(rename_src);
  5:  3a29cf9e526f !  8:  44cfae6505f2 diffcore-rename: compute dir_rename_counts in stages
     @@ Metadata
       ## Commit message ##
          diffcore-rename: compute dir_rename_counts in stages
      
     -    We want to first compute dir_rename_counts based just on exact renames
     -    to start, as that can provide us useful information in
     -    find_basename_matches().  That will give us an incomplete result, which
     -    we can then later augment as basename and inexact rename matches are
     -    found.
     +    Compute dir_rename_counts based just on exact renames to start, as that
     +    can provide us useful information in find_basename_matches().  This is
     +    done by moving the code from compute_dir_rename_counts() into
     +    initialize_dir_rename_info(), resulting in it being computed earlier and
     +    based just on exact renames.  Since that's an incomplete result, we
     +    augment the counts via calling update_dir_rename_counts() after each
     +    basename-guide and inexact rename detection match is found.
      
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
     @@ diffcore-rename.c: static void update_dir_rename_counts(struct dir_rename_info *
       {
       	int i;
       
     -+	info->setup = 0;
     -+	if (!dirs_removed)
     +-	info->setup = 1;
     +-	info->dir_rename_count = dir_rename_count;
     +-
     +-	for (i = 0; i < rename_dst_nr; ++i) {
     +-		/* File not part of directory rename counts if not a rename */
     +-		if (!rename_dst[i].is_rename)
     +-			continue;
     +-
     +-		/*
     +-		 * Make dir_rename_count contain a map of a map:
     +-		 *   old_directory -> {new_directory -> count}
     +-		 * In other words, for every pair look at the directories for
     +-		 * the old filename and the new filename and count how many
     +-		 * times that pairing occurs.
     +-		 */
     +-		update_dir_rename_counts(info, dirs_removed,
     +-					 rename_dst[i].p->one->path,
     +-					 rename_dst[i].p->two->path);
     ++	if (!dirs_removed) {
     ++		info->setup = 0;
      +		return;
     + 	}
     +-}
     +-
     +-static void initialize_dir_rename_info(struct dir_rename_info *info)
     +-{
     +-	int i;
     +-
       	info->setup = 1;
     -+
     - 	info->dir_rename_count = dir_rename_count;
     + 
     ++	info->dir_rename_count = dir_rename_count;
      +	if (!info->dir_rename_count) {
      +		info->dir_rename_count = xmalloc(sizeof(*dir_rename_count));
      +		strmap_init(info->dir_rename_count);
      +	}
     + 	strintmap_init_with_options(&info->idx_map, -1, NULL, 0);
     + 	strmap_init_with_options(&info->dir_rename_guess, NULL, 0);
     +-	info->dir_rename_count = NULL;
       
     + 	/*
     +-	 * Loop setting up both info->idx_map.
     ++	 * Loop setting up both info->idx_map, and doing setup of
     ++	 * info->dir_rename_count.
     + 	 */
       	for (i = 0; i < rename_dst_nr; ++i) {
       		/*
     -@@ diffcore-rename.c: void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
     - 	strmap_partial_clear(dir_rename_count, 1);
     +@@ diffcore-rename.c: static void initialize_dir_rename_info(struct dir_rename_info *info)
     + 		if (!rename_dst[i].is_rename) {
     + 			char *filename = rename_dst[i].p->two->path;
     + 			strintmap_set(&info->idx_map, filename, i);
     ++			continue;
     + 		}
     ++
     ++		/*
     ++		 * For everything else (i.e. renamed files), make
     ++		 * dir_rename_count contain a map of a map:
     ++		 *   old_directory -> {new_directory -> count}
     ++		 * In other words, for every pair look at the directories for
     ++		 * the old filename and the new filename and count how many
     ++		 * times that pairing occurs.
     ++		 */
     ++		update_dir_rename_counts(info, dirs_removed,
     ++					 rename_dst[i].p->one->path,
     ++					 rename_dst[i].p->two->path);
     + 	}
       }
       
     --MAYBE_UNUSED
     - static void cleanup_dir_rename_info(struct dir_rename_info *info,
     - 				    struct strset *dirs_removed,
     - 				    int keep_dir_rename_count)
     -@@ diffcore-rename.c: static const char *get_basename(const char *filename)
     - }
     +@@ diffcore-rename.c: static int idx_possible_rename(char *filename, struct dir_rename_info *info)
       
       static int find_basename_matches(struct diff_options *options,
     --				 int minimum_score)
     -+				 int minimum_score,
     + 				 int minimum_score,
     +-				 struct dir_rename_info *info)
      +				 struct dir_rename_info *info,
      +				 struct strset *dirs_removed)
       {
     @@ diffcore-rename.c: void diffcore_rename_extended(struct diff_options *options,
       		minimum_score = DEFAULT_RENAME_SCORE;
       
      @@ diffcore-rename.c: void diffcore_rename_extended(struct diff_options *options,
     - 		remove_unneeded_paths_from_src(want_copies);
     - 		trace2_region_leave("diff", "cull after exact", options->repo);
       
     -+		/* Preparation for basename-driven matching. */
     -+		trace2_region_enter("diff", "dir rename setup", options->repo);
     + 		/* Preparation for basename-driven matching. */
     + 		trace2_region_enter("diff", "dir rename setup", options->repo);
     +-		initialize_dir_rename_info(&info);
      +		initialize_dir_rename_info(&info,
      +					   dirs_removed, dir_rename_count);
     -+		trace2_region_leave("diff", "dir rename setup", options->repo);
     -+
     + 		trace2_region_leave("diff", "dir rename setup", options->repo);
     + 
       		/* Utilize file basenames to quickly find renames. */
       		trace2_region_enter("diff", "basename matches", options->repo);
       		rename_count += find_basename_matches(options,
     --						      min_basename_score);
     -+						      min_basename_score,
     + 						      min_basename_score,
     +-						      &info);
      +						      &info, dirs_removed);
       		trace2_region_leave("diff", "basename matches", options->repo);
       
     @@ diffcore-rename.c: void diffcore_rename_extended(struct diff_options *options,
       	/* At this point, we have found some renames and copies and they
       	 * are recorded in rename_dst.  The original list is still in *q.
       	 */
     -@@ diffcore-rename.c: void diffcore_rename_extended(struct diff_options *options,
     - 		if (rename_dst[i].filespec_to_free)
     - 			free_filespec(rename_dst[i].filespec_to_free);
     - 
     -+	cleanup_dir_rename_info(&info, dirs_removed, dir_rename_count != NULL);
     - 	FREE_AND_NULL(rename_dst);
     - 	rename_dst_nr = rename_dst_alloc = 0;
     - 	FREE_AND_NULL(rename_src);
  9:  4e095ea7c439 !  9:  752aff3a7995 diffcore-rename: limit dir_rename_counts computation to relevant dirs
     @@ Commit message
          for directories that disappeared, though, so we can return early from
          update_dir_rename_counts() for other paths.
      
     -    While dirs_removed provides the relevant information for us right now,
     -    we introduce a new info->relevant_source_dirs parameter because future
     -    optimizations will want to change how things are called somewhat.
     +    If dirs_removed is passed to diffcore_rename_extended(), then it
     +    provides the relevant bits of information for us to limit this counting
     +    to relevant dirs.  If dirs_removed is not passed, we would need to
     +    compute some replacement in order to do this limiting.  Introduce a new
     +    info->relevant_source_dirs variable for this purpose, even though at
     +    this stage we will only set it to dirs_removed for simplicity.
      
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
  7:  4983a1c2f908 ! 10:  65f7bfb735f2 diffcore-rename: add a dir_rename_guess field to dir_rename_info
     @@ Metadata
      Author: Elijah Newren <newren@gmail.com>
      
       ## Commit message ##
     -    diffcore-rename: add a dir_rename_guess field to dir_rename_info
     +    diffcore-rename: compute dir_rename_guess from dir_rename_counts
      
          dir_rename_counts has a mapping of a mapping, in particular, it has
             old_dir => { new_dir => count }
          We want a simple mapping of
             old_dir => new_dir
          based on which new_dir had the highest count for a given old_dir.
     -    Introduce dir_rename_guess for this purpose.
     +    Compute this and store it in dir_rename_guess.
     +
     +    This is the final piece of the puzzle needed to make our guesses at
     +    which directory files have been moved to when basenames aren't unique.
     +
     +    For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
     +    performance work; instrument with trace2_region_* calls", 2020-10-28),
     +    this change improves the performance as follows:
     +
     +                                Before                  After
     +        no-renames:       12.775 s ±  0.062 s    12.596 s ±  0.061 s
     +        mega-renames:    188.754 s ±  0.284 s   130.465 s ±  0.259 s
     +        just-one-mega:     5.599 s ±  0.019 s     3.958 s ±  0.010 s
      
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
     -@@ diffcore-rename.c: static int find_exact_renames(struct diff_options *options)
     - 
     - struct dir_rename_info {
     - 	struct strintmap idx_map;
     -+	struct strmap dir_rename_guess;
     - 	struct strmap *dir_rename_count;
     - 	unsigned setup;
     - };
      @@ diffcore-rename.c: static void dirname_munge(char *filename)
       	*slash = '\0';
       }
     @@ diffcore-rename.c: static void initialize_dir_rename_info(struct dir_rename_info
      +	struct strmap_entry *entry;
       	int i;
       
     - 	info->setup = 0;
     -@@ diffcore-rename.c: static void initialize_dir_rename_info(struct dir_rename_info *info,
     - 		strmap_init(info->dir_rename_count);
     - 	}
     - 	strintmap_init_with_options(&info->idx_map, -1, NULL, 0);
     -+	strmap_init_with_options(&info->dir_rename_guess, NULL, 0);
     - 
     - 	/*
     - 	 * Loop setting up both info->idx_map, and doing setup of
     + 	if (!dirs_removed) {
      @@ diffcore-rename.c: static void initialize_dir_rename_info(struct dir_rename_info *info,
       					 rename_dst[i].p->one->path,
       					 rename_dst[i].p->two->path);
     @@ diffcore-rename.c: static void initialize_dir_rename_info(struct dir_rename_info
       }
       
       void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
     -@@ diffcore-rename.c: static void cleanup_dir_rename_info(struct dir_rename_info *info,
     - 	/* idx_map */
     - 	strintmap_clear(&info->idx_map);
     - 
     -+	/* dir_rename_guess */
     -+	strmap_clear(&info->dir_rename_guess, 1);
     -+
     - 	if (!keep_dir_rename_count) {
     - 		partial_clear_dir_rename_count(info->dir_rename_count);
     - 		strmap_clear(info->dir_rename_count, 1);
     +@@ diffcore-rename.c: static int idx_possible_rename(char *filename, struct dir_rename_info *info)
     + 	 *       rename.
     + 	 *
     + 	 * This function, idx_possible_rename(), is only responsible for (4).
     +-	 * The conditions/steps in (1)-(3) will be handled via setting up
     +-	 * dir_rename_count and dir_rename_guess in a future
     +-	 * initialize_dir_rename_info() function.  Steps (0) and (5) are
     +-	 * handled by the caller of this function.
     ++	 * The conditions/steps in (1)-(3) are handled via setting up
     ++	 * dir_rename_count and dir_rename_guess in
     ++	 * initialize_dir_rename_info().  Steps (0) and (5) are handled by
     ++	 * the caller of this function.
     + 	 */
     + 	char *old_dir, *new_dir;
     + 	struct strbuf new_path = STRBUF_INIT;

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v3 01/10] diffcore-rename: use directory rename guided basename comparisons
  2021-02-26  1:58   ` [PATCH v3 " Elijah Newren via GitGitGadget
@ 2021-02-26  1:58     ` Elijah Newren via GitGitGadget
  2021-02-26  1:58     ` [PATCH v3 02/10] diffcore-rename: add a new idx_possible_rename function Elijah Newren via GitGitGadget
                       ` (10 subsequent siblings)
  11 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-26  1:58 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Elijah Newren, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

A previous commit noted that it is very common for people to move files
across directories while keeping their filename the same.  The last few
commits took advantage of this and showed that we can accelerate rename
detection significantly using basenames; since files with the same
basename serve as likely rename candidates, we can check those first and
remove them from the rename candidate pool if they are sufficiently
similar.

Unfortunately, the previous optimization was limited by the fact that
the remaining basenames after exact rename detection are not always
unique.  Many repositories have hundreds of build files with the same
name (e.g. Makefile, .gitignore, build.gradle, etc.), and may even have
hundreds of source files with the same name.  (For example, the linux
kernel has 100 setup.c, 87 irq.c, and 112 core.c files.  A repository at
$DAYJOB has a lot of ObjectFactory.java and Plugin.java files).

For these files with non-unique basenames, we are faced with the task of
attempting to determine or guess which directory they may have been
relocated to.  Such a task is precisely the job of directory rename
detection.  However, there are two catches: (1) the directory rename
detection code has traditionally been part of the merge machinery rather
than diffcore-rename.c, and (2) directory rename detection currently
runs after regular rename detection is complete.  The 1st catch is just
an implementation issue that can be overcome by some code shuffling.
The 2nd requires us to add a further approximation: we only have access
to exact renames at this point, so we need to do directory rename
detection based on just exact renames.  In some cases we won't have
exact renames, in which case this extra optimization won't apply.  We
also choose to not apply the optimization unless we know that the
underlying directory was removed, which will require extra data to be
passed in to diffcore_rename_extended().  Also, even if we get a
prediction about which directory a file may have relocated to, we will
still need to check to see if there is a file in the predicted
directory, and then compare the two files to see if they meet the higher
min_basename_score threshold required for marking the two files as
renames.

This commit introduces an idx_possible_rename() function which will give
do this directory rename detection for us and give us the index within
rename_dst of the resulting filename.  For now, this function is
hardcoded to return -1 (not found) and just hooks up how its results
would be used once we have a more complete implementation in place.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/gitdiffcore.txt |  2 +-
 diffcore-rename.c             | 42 ++++++++++++++++++++++++++++-------
 2 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
index 80fcf9542441..8673a5c5b2f2 100644
--- a/Documentation/gitdiffcore.txt
+++ b/Documentation/gitdiffcore.txt
@@ -186,7 +186,7 @@ mark a file pair as a rename and stop considering other candidates for
 better matches.  At most, one comparison is done per file in this
 preliminary pass; so if there are several remaining ext.txt files
 throughout the directory hierarchy after exact rename detection, this
-preliminary step will be skipped for those files.
+preliminary step may be skipped for those files.
 
 Note.  When the "-C" option is used with `--find-copies-harder`
 option, 'git diff-{asterisk}' commands feed unmodified filepairs to
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 41558185ae1d..b3055683bac2 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -379,6 +379,12 @@ static const char *get_basename(const char *filename)
 	return base ? base + 1 : filename;
 }
 
+static int idx_possible_rename(char *filename)
+{
+	/* Unconditionally return -1, "not found", for now */
+	return -1;
+}
+
 static int find_basename_matches(struct diff_options *options,
 				 int minimum_score)
 {
@@ -415,8 +421,6 @@ static int find_basename_matches(struct diff_options *options,
 	int i, renames = 0;
 	struct strintmap sources;
 	struct strintmap dests;
-	struct hashmap_iter iter;
-	struct strmap_entry *entry;
 
 	/*
 	 * The prefeteching stuff wants to know if it can skip prefetching
@@ -466,17 +470,39 @@ static int find_basename_matches(struct diff_options *options,
 	}
 
 	/* Now look for basename matchups and do similarity estimation */
-	strintmap_for_each_entry(&sources, &iter, entry) {
-		const char *base = entry->key;
-		intptr_t src_index = (intptr_t)entry->value;
+	for (i = 0; i < rename_src_nr; ++i) {
+		char *filename = rename_src[i].p->one->path;
+		const char *base = NULL;
+		intptr_t src_index;
 		intptr_t dst_index;
-		if (src_index == -1)
-			continue;
 
-		if (0 <= (dst_index = strintmap_get(&dests, base))) {
+		/*
+		 * If the basename is unique among remaining sources, then
+		 * src_index will equal 'i' and we can attempt to match it
+		 * to a unique basename in the destinations.  Otherwise,
+		 * use directory rename heuristics, if possible.
+		 */
+		base = get_basename(filename);
+		src_index = strintmap_get(&sources, base);
+		assert(src_index == -1 || src_index == i);
+
+		if (strintmap_contains(&dests, base)) {
 			struct diff_filespec *one, *two;
 			int score;
 
+			/* Find a matching destination, if possible */
+			dst_index = strintmap_get(&dests, base);
+			if (src_index == -1 || dst_index == -1) {
+				src_index = i;
+				dst_index = idx_possible_rename(filename);
+			}
+			if (dst_index == -1)
+				continue;
+
+			/* Ignore this dest if already used in a rename */
+			if (rename_dst[dst_index].is_rename)
+				continue; /* already used previously */
+
 			/* Estimate the similarity */
 			one = rename_src[src_index].p->one;
 			two = rename_dst[dst_index].p->two;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v3 02/10] diffcore-rename: add a new idx_possible_rename function
  2021-02-26  1:58   ` [PATCH v3 " Elijah Newren via GitGitGadget
  2021-02-26  1:58     ` [PATCH v3 01/10] diffcore-rename: use directory rename guided basename comparisons Elijah Newren via GitGitGadget
@ 2021-02-26  1:58     ` Elijah Newren via GitGitGadget
  2021-02-26 15:52       ` Derrick Stolee
  2021-02-26  1:58     ` [PATCH v3 03/10] diffcore-rename: add a mapping of destination names to their indices Elijah Newren via GitGitGadget
                       ` (9 subsequent siblings)
  11 siblings, 1 reply; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-26  1:58 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Elijah Newren, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

find_basename_matches() is great when both the remaining set of possible
rename sources and the remaining set of possible rename destinations
have exactly one file each with a given basename.  It allows us to match
up files that have been moved to different directories without changing
filenames.

When basenames are not unique, though, we want to be able to guess which
directories the source files have been moved to.  Since this is the job
of directory rename detection, we employ it.  However, since it is a
directory rename detection idea, we also limit it to cases where we know
there could have been a directory rename, i.e. where the source
directory has been removed.  This has to be signalled by dirs_removed
being non-NULL and containing an entry for the relevant directory.
Since merge-ort.c is the only caller that currently does so, this
optimization is only effective for merge-ort right now.  In the future,
this condition could be reconsidered or we could modify other callers to
pass the necessary strset.

Anyway, that's a lot of background so that we can actually describe the
new function.  Add an idx_possible_rename() function which combines the
recently added dir_rename_guess and idx_map fields to provide the index
within rename_dst of a potential match for a given file.

Future commits will add checks after calling this function to compare
the resulting 'likely rename' candidates to see if the two files meet
the elevated min_basename_score threshold for marking them as actual
renames.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 100 +++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 94 insertions(+), 6 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index b3055683bac2..edb0effb6ef4 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -367,6 +367,19 @@ static int find_exact_renames(struct diff_options *options)
 	return renames;
 }
 
+struct dir_rename_info {
+	struct strintmap idx_map;
+	struct strmap dir_rename_guess;
+	struct strmap *dir_rename_count;
+	unsigned setup;
+};
+
+static char *get_dirname(const char *filename)
+{
+	char *slash = strrchr(filename, '/');
+	return slash ? xstrndup(filename, slash - filename) : xstrdup("");
+}
+
 static const char *get_basename(const char *filename)
 {
 	/*
@@ -379,14 +392,86 @@ static const char *get_basename(const char *filename)
 	return base ? base + 1 : filename;
 }
 
-static int idx_possible_rename(char *filename)
+static int idx_possible_rename(char *filename, struct dir_rename_info *info)
 {
-	/* Unconditionally return -1, "not found", for now */
-	return -1;
+	/*
+	 * Our comparison of files with the same basename (see
+	 * find_basename_matches() below), is only helpful when after exact
+	 * rename detection we have exactly one file with a given basename
+	 * among the rename sources and also only exactly one file with
+	 * that basename among the rename destinations.  When we have
+	 * multiple files with the same basename in either set, we do not
+	 * know which to compare against.  However, there are some
+	 * filenames that occur in large numbers (particularly
+	 * build-related filenames such as 'Makefile', '.gitignore', or
+	 * 'build.gradle' that potentially exist within every single
+	 * subdirectory), and for performance we want to be able to quickly
+	 * find renames for these files too.
+	 *
+	 * The reason basename comparisons are a useful heuristic was that it
+	 * is common for people to move files across directories while keeping
+	 * their filename the same.  If we had a way of determining or even
+	 * making a good educated guess about which directory these non-unique
+	 * basename files had moved the file to, we could check it.
+	 * Luckily...
+	 *
+	 * When an entire directory is in fact renamed, we have two factors
+	 * helping us out:
+	 *   (a) the original directory disappeared giving us a hint
+	 *       about when we can apply an extra heuristic.
+	 *   (a) we often have several files within that directory and
+	 *       subdirectories that are renamed without changes
+	 * So, rules for a heuristic:
+	 *   (0) If there basename matches are non-unique (the condition under
+	 *       which this function is called) AND
+	 *   (1) the directory in which the file was found has disappeared
+	 *       (i.e. dirs_removed is non-NULL and has a relevant entry) THEN
+	 *   (2) use exact renames of files within the directory to determine
+	 *       where the directory is likely to have been renamed to.  IF
+	 *       there is at least one exact rename from within that
+	 *       directory, we can proceed.
+	 *   (3) If there are multiple places the directory could have been
+	 *       renamed to based on exact renames, ignore all but one of them.
+	 *       Just use the destination with the most renames going to it.
+	 *   (4) Check if applying that directory rename to the original file
+	 *       would result in a destination filename that is in the
+	 *       potential rename set.  If so, return the index of the
+	 *       destination file (the index within rename_dst).
+	 *   (5) Compare the original file and returned destination for
+	 *       similarity, and if they are sufficiently similar, record the
+	 *       rename.
+	 *
+	 * This function, idx_possible_rename(), is only responsible for (4).
+	 * The conditions/steps in (1)-(3) will be handled via setting up
+	 * dir_rename_count and dir_rename_guess in a future
+	 * initialize_dir_rename_info() function.  Steps (0) and (5) are
+	 * handled by the caller of this function.
+	 */
+	char *old_dir, *new_dir;
+	struct strbuf new_path = STRBUF_INIT;
+	int idx;
+
+	if (!info->setup)
+		return -1;
+
+	old_dir = get_dirname(filename);
+	new_dir = strmap_get(&info->dir_rename_guess, old_dir);
+	free(old_dir);
+	if (!new_dir)
+		return -1;
+
+	strbuf_addstr(&new_path, new_dir);
+	strbuf_addch(&new_path, '/');
+	strbuf_addstr(&new_path, get_basename(filename));
+
+	idx = strintmap_get(&info->idx_map, new_path.buf);
+	strbuf_release(&new_path);
+	return idx;
 }
 
 static int find_basename_matches(struct diff_options *options,
-				 int minimum_score)
+				 int minimum_score,
+				 struct dir_rename_info *info)
 {
 	/*
 	 * When I checked in early 2020, over 76% of file renames in linux
@@ -494,7 +579,7 @@ static int find_basename_matches(struct diff_options *options,
 			dst_index = strintmap_get(&dests, base);
 			if (src_index == -1 || dst_index == -1) {
 				src_index = i;
-				dst_index = idx_possible_rename(filename);
+				dst_index = idx_possible_rename(filename, info);
 			}
 			if (dst_index == -1)
 				continue;
@@ -677,8 +762,10 @@ void diffcore_rename(struct diff_options *options)
 	int num_destinations, dst_cnt;
 	int num_sources, want_copies;
 	struct progress *progress = NULL;
+	struct dir_rename_info info;
 
 	trace2_region_enter("diff", "setup", options->repo);
+	info.setup = 0;
 	want_copies = (detect_rename == DIFF_DETECT_COPY);
 	if (!minimum_score)
 		minimum_score = DEFAULT_RENAME_SCORE;
@@ -774,7 +861,8 @@ void diffcore_rename(struct diff_options *options)
 		/* Utilize file basenames to quickly find renames. */
 		trace2_region_enter("diff", "basename matches", options->repo);
 		rename_count += find_basename_matches(options,
-						      min_basename_score);
+						      min_basename_score,
+						      &info);
 		trace2_region_leave("diff", "basename matches", options->repo);
 
 		/*
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v3 03/10] diffcore-rename: add a mapping of destination names to their indices
  2021-02-26  1:58   ` [PATCH v3 " Elijah Newren via GitGitGadget
  2021-02-26  1:58     ` [PATCH v3 01/10] diffcore-rename: use directory rename guided basename comparisons Elijah Newren via GitGitGadget
  2021-02-26  1:58     ` [PATCH v3 02/10] diffcore-rename: add a new idx_possible_rename function Elijah Newren via GitGitGadget
@ 2021-02-26  1:58     ` Elijah Newren via GitGitGadget
  2021-02-26  1:58     ` [PATCH v3 04/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
                       ` (8 subsequent siblings)
  11 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-26  1:58 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Elijah Newren, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Compute a mapping of full filename to the index within rename_dst where
that filename is found, and store it in idx_map.  idx_possible_rename()
needs this to quickly finding an array entry in rename_dst given the
pathname.

While at it, add placeholder initializations for dir_rename_count and
dir_rename_guess; these will be more fully populated in subsequent
commits.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index edb0effb6ef4..8eeb8c73664c 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -380,6 +380,45 @@ static char *get_dirname(const char *filename)
 	return slash ? xstrndup(filename, slash - filename) : xstrdup("");
 }
 
+static void initialize_dir_rename_info(struct dir_rename_info *info)
+{
+	int i;
+
+	info->setup = 1;
+
+	strintmap_init_with_options(&info->idx_map, -1, NULL, 0);
+	strmap_init_with_options(&info->dir_rename_guess, NULL, 0);
+	info->dir_rename_count = NULL;
+
+	/*
+	 * Loop setting up both info->idx_map.
+	 */
+	for (i = 0; i < rename_dst_nr; ++i) {
+		/*
+		 * For non-renamed files, make idx_map contain mapping of
+		 *   filename -> index (index within rename_dst, that is)
+		 */
+		if (!rename_dst[i].is_rename) {
+			char *filename = rename_dst[i].p->two->path;
+			strintmap_set(&info->idx_map, filename, i);
+		}
+	}
+}
+
+static void cleanup_dir_rename_info(struct dir_rename_info *info)
+{
+	if (!info->setup)
+		return;
+
+	/* idx_map */
+	strintmap_clear(&info->idx_map);
+
+	/* dir_rename_guess */
+	strmap_clear(&info->dir_rename_guess, 1);
+
+	/* Nothing to do for dir_rename_count, yet */
+}
+
 static const char *get_basename(const char *filename)
 {
 	/*
@@ -858,6 +897,11 @@ void diffcore_rename(struct diff_options *options)
 		remove_unneeded_paths_from_src(want_copies);
 		trace2_region_leave("diff", "cull after exact", options->repo);
 
+		/* Preparation for basename-driven matching. */
+		trace2_region_enter("diff", "dir rename setup", options->repo);
+		initialize_dir_rename_info(&info);
+		trace2_region_leave("diff", "dir rename setup", options->repo);
+
 		/* Utilize file basenames to quickly find renames. */
 		trace2_region_enter("diff", "basename matches", options->repo);
 		rename_count += find_basename_matches(options,
@@ -1026,6 +1070,7 @@ void diffcore_rename(struct diff_options *options)
 		if (rename_dst[i].filespec_to_free)
 			free_filespec(rename_dst[i].filespec_to_free);
 
+	cleanup_dir_rename_info(&info);
 	FREE_AND_NULL(rename_dst);
 	rename_dst_nr = rename_dst_alloc = 0;
 	FREE_AND_NULL(rename_src);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v3 04/10] Move computation of dir_rename_count from merge-ort to diffcore-rename
  2021-02-26  1:58   ` [PATCH v3 " Elijah Newren via GitGitGadget
                       ` (2 preceding siblings ...)
  2021-02-26  1:58     ` [PATCH v3 03/10] diffcore-rename: add a mapping of destination names to their indices Elijah Newren via GitGitGadget
@ 2021-02-26  1:58     ` Elijah Newren via GitGitGadget
  2021-02-26 15:55       ` Derrick Stolee
  2021-02-26  1:58     ` [PATCH v3 05/10] diffcore-rename: add function for clearing dir_rename_count Elijah Newren via GitGitGadget
                       ` (7 subsequent siblings)
  11 siblings, 1 reply; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-26  1:58 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Elijah Newren, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Move the computation of dir_rename_count from merge-ort.c to
diffcore-rename.c, making slight adjustments to the data structures
based on the move.  While the diffstat looks large, viewing this commit
with --color-moved makes it clear that only about 20 lines changed.

With this patch, the computation of dir_rename_count is still only done
after inexact rename detection, but subsequent commits will add a
preliminary computation of dir_rename_count after exact rename
detection, followed by some updates after inexact rename detection.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 138 +++++++++++++++++++++++++++++++++++++++++++++-
 diffcore.h        |   5 ++
 merge-ort.c       | 132 +-------------------------------------------
 3 files changed, 145 insertions(+), 130 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 8eeb8c73664c..39e23d57e7bc 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -380,6 +380,129 @@ static char *get_dirname(const char *filename)
 	return slash ? xstrndup(filename, slash - filename) : xstrdup("");
 }
 
+static void dirname_munge(char *filename)
+{
+	char *slash = strrchr(filename, '/');
+	if (!slash)
+		slash = filename;
+	*slash = '\0';
+}
+
+static void increment_count(struct strmap *dir_rename_count,
+			    char *old_dir,
+			    char *new_dir)
+{
+	struct strintmap *counts;
+	struct strmap_entry *e;
+
+	/* Get the {new_dirs -> counts} mapping using old_dir */
+	e = strmap_get_entry(dir_rename_count, old_dir);
+	if (e) {
+		counts = e->value;
+	} else {
+		counts = xmalloc(sizeof(*counts));
+		strintmap_init_with_options(counts, 0, NULL, 1);
+		strmap_put(dir_rename_count, old_dir, counts);
+	}
+
+	/* Increment the count for new_dir */
+	strintmap_incr(counts, new_dir, 1);
+}
+
+static void update_dir_rename_counts(struct strmap *dir_rename_count,
+				     struct strset *dirs_removed,
+				     const char *oldname,
+				     const char *newname)
+{
+	char *old_dir = xstrdup(oldname);
+	char *new_dir = xstrdup(newname);
+	char new_dir_first_char = new_dir[0];
+	int first_time_in_loop = 1;
+
+	while (1) {
+		dirname_munge(old_dir);
+		dirname_munge(new_dir);
+
+		/*
+		 * When renaming
+		 *   "a/b/c/d/e/foo.c" -> "a/b/some/thing/else/e/foo.c"
+		 * then this suggests that both
+		 *   a/b/c/d/e/ => a/b/some/thing/else/e/
+		 *   a/b/c/d/   => a/b/some/thing/else/
+		 * so we want to increment counters for both.  We do NOT,
+		 * however, also want to suggest that there was the following
+		 * rename:
+		 *   a/b/c/ => a/b/some/thing/
+		 * so we need to quit at that point.
+		 *
+		 * Note the when first_time_in_loop, we only strip off the
+		 * basename, and we don't care if that's different.
+		 */
+		if (!first_time_in_loop) {
+			char *old_sub_dir = strchr(old_dir, '\0')+1;
+			char *new_sub_dir = strchr(new_dir, '\0')+1;
+			if (!*new_dir) {
+				/*
+				 * Special case when renaming to root directory,
+				 * i.e. when new_dir == "".  In this case, we had
+				 * something like
+				 *    a/b/subdir => subdir
+				 * and so dirname_munge() sets things up so that
+				 *    old_dir = "a/b\0subdir\0"
+				 *    new_dir = "\0ubdir\0"
+				 * We didn't have a '/' to overwrite a '\0' onto
+				 * in new_dir, so we have to compare differently.
+				 */
+				if (new_dir_first_char != old_sub_dir[0] ||
+				    strcmp(old_sub_dir+1, new_sub_dir))
+					break;
+			} else {
+				if (strcmp(old_sub_dir, new_sub_dir))
+					break;
+			}
+		}
+
+		if (strset_contains(dirs_removed, old_dir))
+			increment_count(dir_rename_count, old_dir, new_dir);
+		else
+			break;
+
+		/* If we hit toplevel directory ("") for old or new dir, quit */
+		if (!*old_dir || !*new_dir)
+			break;
+
+		first_time_in_loop = 0;
+	}
+
+	/* Free resources we don't need anymore */
+	free(old_dir);
+	free(new_dir);
+}
+
+static void compute_dir_rename_counts(struct strmap *dir_rename_count,
+				      struct strset *dirs_removed)
+{
+	int i;
+
+	/* Set up dir_rename_count */
+	for (i = 0; i < rename_dst_nr; ++i) {
+		/* File not part of directory rename counts if not a rename */
+		if (!rename_dst[i].is_rename)
+			continue;
+
+		/*
+		 * Make dir_rename_count contain a map of a map:
+		 *   old_directory -> {new_directory -> count}
+		 * In other words, for every pair look at the directories for
+		 * the old filename and the new filename and count how many
+		 * times that pairing occurs.
+		 */
+		update_dir_rename_counts(dir_rename_count, dirs_removed,
+					 rename_dst[i].p->one->path,
+					 rename_dst[i].p->two->path);
+	}
+}
+
 static void initialize_dir_rename_info(struct dir_rename_info *info)
 {
 	int i;
@@ -790,7 +913,9 @@ static void remove_unneeded_paths_from_src(int detecting_copies)
 	rename_src_nr = new_num_src;
 }
 
-void diffcore_rename(struct diff_options *options)
+void diffcore_rename_extended(struct diff_options *options,
+			      struct strset *dirs_removed,
+			      struct strmap *dir_rename_count)
 {
 	int detect_rename = options->detect_rename;
 	int minimum_score = options->rename_score;
@@ -805,6 +930,7 @@ void diffcore_rename(struct diff_options *options)
 
 	trace2_region_enter("diff", "setup", options->repo);
 	info.setup = 0;
+	assert(!dir_rename_count || strmap_empty(dir_rename_count));
 	want_copies = (detect_rename == DIFF_DETECT_COPY);
 	if (!minimum_score)
 		minimum_score = DEFAULT_RENAME_SCORE;
@@ -999,6 +1125,11 @@ void diffcore_rename(struct diff_options *options)
 	trace2_region_leave("diff", "inexact renames", options->repo);
 
  cleanup:
+	/*
+	 * Now that renames have been computed, compute dir_rename_count */
+	if (dirs_removed && dir_rename_count)
+		compute_dir_rename_counts(dir_rename_count, dirs_removed);
+
 	/* At this point, we have found some renames and copies and they
 	 * are recorded in rename_dst.  The original list is still in *q.
 	 */
@@ -1082,3 +1213,8 @@ void diffcore_rename(struct diff_options *options)
 	trace2_region_leave("diff", "write back to queue", options->repo);
 	return;
 }
+
+void diffcore_rename(struct diff_options *options)
+{
+	diffcore_rename_extended(options, NULL, NULL);
+}
diff --git a/diffcore.h b/diffcore.h
index d2a63c5c71f4..db55d3853071 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -8,6 +8,8 @@
 
 struct diff_options;
 struct repository;
+struct strmap;
+struct strset;
 struct userdiff_driver;
 
 /* This header file is internal between diff.c and its diff transformers
@@ -161,6 +163,9 @@ void diff_q(struct diff_queue_struct *, struct diff_filepair *);
 
 void diffcore_break(struct repository *, int);
 void diffcore_rename(struct diff_options *);
+void diffcore_rename_extended(struct diff_options *options,
+			      struct strset *dirs_removed,
+			      struct strmap *dir_rename_count);
 void diffcore_merge_broken(void);
 void diffcore_pickaxe(struct diff_options *);
 void diffcore_order(const char *orderfile);
diff --git a/merge-ort.c b/merge-ort.c
index 603d30c52170..c4467e073b45 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1302,131 +1302,6 @@ static char *handle_path_level_conflicts(struct merge_options *opt,
 	return new_path;
 }
 
-static void dirname_munge(char *filename)
-{
-	char *slash = strrchr(filename, '/');
-	if (!slash)
-		slash = filename;
-	*slash = '\0';
-}
-
-static void increment_count(struct strmap *dir_rename_count,
-			    char *old_dir,
-			    char *new_dir)
-{
-	struct strintmap *counts;
-	struct strmap_entry *e;
-
-	/* Get the {new_dirs -> counts} mapping using old_dir */
-	e = strmap_get_entry(dir_rename_count, old_dir);
-	if (e) {
-		counts = e->value;
-	} else {
-		counts = xmalloc(sizeof(*counts));
-		strintmap_init_with_options(counts, 0, NULL, 1);
-		strmap_put(dir_rename_count, old_dir, counts);
-	}
-
-	/* Increment the count for new_dir */
-	strintmap_incr(counts, new_dir, 1);
-}
-
-static void update_dir_rename_counts(struct strmap *dir_rename_count,
-				     struct strset *dirs_removed,
-				     const char *oldname,
-				     const char *newname)
-{
-	char *old_dir = xstrdup(oldname);
-	char *new_dir = xstrdup(newname);
-	char new_dir_first_char = new_dir[0];
-	int first_time_in_loop = 1;
-
-	while (1) {
-		dirname_munge(old_dir);
-		dirname_munge(new_dir);
-
-		/*
-		 * When renaming
-		 *   "a/b/c/d/e/foo.c" -> "a/b/some/thing/else/e/foo.c"
-		 * then this suggests that both
-		 *   a/b/c/d/e/ => a/b/some/thing/else/e/
-		 *   a/b/c/d/   => a/b/some/thing/else/
-		 * so we want to increment counters for both.  We do NOT,
-		 * however, also want to suggest that there was the following
-		 * rename:
-		 *   a/b/c/ => a/b/some/thing/
-		 * so we need to quit at that point.
-		 *
-		 * Note the when first_time_in_loop, we only strip off the
-		 * basename, and we don't care if that's different.
-		 */
-		if (!first_time_in_loop) {
-			char *old_sub_dir = strchr(old_dir, '\0')+1;
-			char *new_sub_dir = strchr(new_dir, '\0')+1;
-			if (!*new_dir) {
-				/*
-				 * Special case when renaming to root directory,
-				 * i.e. when new_dir == "".  In this case, we had
-				 * something like
-				 *    a/b/subdir => subdir
-				 * and so dirname_munge() sets things up so that
-				 *    old_dir = "a/b\0subdir\0"
-				 *    new_dir = "\0ubdir\0"
-				 * We didn't have a '/' to overwrite a '\0' onto
-				 * in new_dir, so we have to compare differently.
-				 */
-				if (new_dir_first_char != old_sub_dir[0] ||
-				    strcmp(old_sub_dir+1, new_sub_dir))
-					break;
-			} else {
-				if (strcmp(old_sub_dir, new_sub_dir))
-					break;
-			}
-		}
-
-		if (strset_contains(dirs_removed, old_dir))
-			increment_count(dir_rename_count, old_dir, new_dir);
-		else
-			break;
-
-		/* If we hit toplevel directory ("") for old or new dir, quit */
-		if (!*old_dir || !*new_dir)
-			break;
-
-		first_time_in_loop = 0;
-	}
-
-	/* Free resources we don't need anymore */
-	free(old_dir);
-	free(new_dir);
-}
-
-static void compute_rename_counts(struct diff_queue_struct *pairs,
-				  struct strmap *dir_rename_count,
-				  struct strset *dirs_removed)
-{
-	int i;
-
-	for (i = 0; i < pairs->nr; ++i) {
-		struct diff_filepair *pair = pairs->queue[i];
-
-		/* File not part of directory rename if it wasn't renamed */
-		if (pair->status != 'R')
-			continue;
-
-		/*
-		 * Make dir_rename_count contain a map of a map:
-		 *   old_directory -> {new_directory -> count}
-		 * In other words, for every pair look at the directories for
-		 * the old filename and the new filename and count how many
-		 * times that pairing occurs.
-		 */
-		update_dir_rename_counts(dir_rename_count, dirs_removed,
-					 pair->one->path,
-					 pair->two->path);
-	}
-}
-
 static void get_provisional_directory_renames(struct merge_options *opt,
 					      unsigned side,
 					      int *clean)
@@ -1435,9 +1310,6 @@ static void get_provisional_directory_renames(struct merge_options *opt,
 	struct strmap_entry *entry;
 	struct rename_info *renames = &opt->priv->renames;
 
-	compute_rename_counts(&renames->pairs[side],
-			      &renames->dir_rename_count[side],
-			      &renames->dirs_removed[side]);
 	/*
 	 * Collapse
 	 *    dir_rename_count: old_directory -> {new_directory -> count}
@@ -2162,7 +2034,9 @@ static void detect_regular_renames(struct merge_options *opt,
 
 	diff_queued_diff = renames->pairs[side_index];
 	trace2_region_enter("diff", "diffcore_rename", opt->repo);
-	diffcore_rename(&diff_opts);
+	diffcore_rename_extended(&diff_opts,
+				 &renames->dirs_removed[side_index],
+				 &renames->dir_rename_count[side_index]);
 	trace2_region_leave("diff", "diffcore_rename", opt->repo);
 	resolve_diffpair_statuses(&diff_queued_diff);
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v3 05/10] diffcore-rename: add function for clearing dir_rename_count
  2021-02-26  1:58   ` [PATCH v3 " Elijah Newren via GitGitGadget
                       ` (3 preceding siblings ...)
  2021-02-26  1:58     ` [PATCH v3 04/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
@ 2021-02-26  1:58     ` Elijah Newren via GitGitGadget
  2021-02-26  1:58     ` [PATCH v3 06/10] diffcore-rename: move dir_rename_counts into dir_rename_info struct Elijah Newren via GitGitGadget
                       ` (6 subsequent siblings)
  11 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-26  1:58 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Elijah Newren, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

As we adjust the usage of dir_rename_count we want to have a function
for clearing, or partially clearing it out.  Add a
partial_clear_dir_rename_count() function for this purpose.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 12 ++++++++++++
 diffcore.h        |  2 ++
 merge-ort.c       | 12 +++---------
 3 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 39e23d57e7bc..7dd475ff9a9f 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -528,6 +528,18 @@ static void initialize_dir_rename_info(struct dir_rename_info *info)
 	}
 }
 
+void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
+{
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
+
+	strmap_for_each_entry(dir_rename_count, &iter, entry) {
+		struct strintmap *counts = entry->value;
+		strintmap_clear(counts);
+	}
+	strmap_partial_clear(dir_rename_count, 1);
+}
+
 static void cleanup_dir_rename_info(struct dir_rename_info *info)
 {
 	if (!info->setup)
diff --git a/diffcore.h b/diffcore.h
index db55d3853071..c6ba64abd198 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -161,6 +161,8 @@ struct diff_filepair *diff_queue(struct diff_queue_struct *,
 				 struct diff_filespec *);
 void diff_q(struct diff_queue_struct *, struct diff_filepair *);
 
+void partial_clear_dir_rename_count(struct strmap *dir_rename_count);
+
 void diffcore_break(struct repository *, int);
 void diffcore_rename(struct diff_options *);
 void diffcore_rename_extended(struct diff_options *options,
diff --git a/merge-ort.c b/merge-ort.c
index c4467e073b45..467404cc0a35 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -351,17 +351,11 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 
 	/* Free memory used by various renames maps */
 	for (i = MERGE_SIDE1; i <= MERGE_SIDE2; ++i) {
-		struct hashmap_iter iter;
-		struct strmap_entry *entry;
-
 		strset_func(&renames->dirs_removed[i]);
 
-		strmap_for_each_entry(&renames->dir_rename_count[i],
-				      &iter, entry) {
-			struct strintmap *counts = entry->value;
-			strintmap_clear(counts);
-		}
-		strmap_func(&renames->dir_rename_count[i], 1);
+		partial_clear_dir_rename_count(&renames->dir_rename_count[i]);
+		if (!reinitialize)
+			strmap_clear(&renames->dir_rename_count[i], 1);
 
 		strmap_func(&renames->dir_renames[i], 0);
 	}
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v3 06/10] diffcore-rename: move dir_rename_counts into dir_rename_info struct
  2021-02-26  1:58   ` [PATCH v3 " Elijah Newren via GitGitGadget
                       ` (4 preceding siblings ...)
  2021-02-26  1:58     ` [PATCH v3 05/10] diffcore-rename: add function for clearing dir_rename_count Elijah Newren via GitGitGadget
@ 2021-02-26  1:58     ` Elijah Newren via GitGitGadget
  2021-02-26  1:58     ` [PATCH v3 07/10] diffcore-rename: extend cleanup_dir_rename_info() Elijah Newren via GitGitGadget
                       ` (5 subsequent siblings)
  11 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-26  1:58 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Elijah Newren, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

This continues the migration of the directory rename detection code into
diffcore-rename, now taking the simple step of combining it with the
dir_rename_info struct.  Future commits will then make dir_rename_counts
be computed in stages, and add computation of dir_rename_guess.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 7dd475ff9a9f..a1ccf14001f5 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -388,7 +388,7 @@ static void dirname_munge(char *filename)
 	*slash = '\0';
 }
 
-static void increment_count(struct strmap *dir_rename_count,
+static void increment_count(struct dir_rename_info *info,
 			    char *old_dir,
 			    char *new_dir)
 {
@@ -396,20 +396,20 @@ static void increment_count(struct strmap *dir_rename_count,
 	struct strmap_entry *e;
 
 	/* Get the {new_dirs -> counts} mapping using old_dir */
-	e = strmap_get_entry(dir_rename_count, old_dir);
+	e = strmap_get_entry(info->dir_rename_count, old_dir);
 	if (e) {
 		counts = e->value;
 	} else {
 		counts = xmalloc(sizeof(*counts));
 		strintmap_init_with_options(counts, 0, NULL, 1);
-		strmap_put(dir_rename_count, old_dir, counts);
+		strmap_put(info->dir_rename_count, old_dir, counts);
 	}
 
 	/* Increment the count for new_dir */
 	strintmap_incr(counts, new_dir, 1);
 }
 
-static void update_dir_rename_counts(struct strmap *dir_rename_count,
+static void update_dir_rename_counts(struct dir_rename_info *info,
 				     struct strset *dirs_removed,
 				     const char *oldname,
 				     const char *newname)
@@ -463,7 +463,7 @@ static void update_dir_rename_counts(struct strmap *dir_rename_count,
 		}
 
 		if (strset_contains(dirs_removed, old_dir))
-			increment_count(dir_rename_count, old_dir, new_dir);
+			increment_count(info, old_dir, new_dir);
 		else
 			break;
 
@@ -479,12 +479,15 @@ static void update_dir_rename_counts(struct strmap *dir_rename_count,
 	free(new_dir);
 }
 
-static void compute_dir_rename_counts(struct strmap *dir_rename_count,
-				      struct strset *dirs_removed)
+static void compute_dir_rename_counts(struct dir_rename_info *info,
+				      struct strset *dirs_removed,
+				      struct strmap *dir_rename_count)
 {
 	int i;
 
-	/* Set up dir_rename_count */
+	info->setup = 1;
+	info->dir_rename_count = dir_rename_count;
+
 	for (i = 0; i < rename_dst_nr; ++i) {
 		/* File not part of directory rename counts if not a rename */
 		if (!rename_dst[i].is_rename)
@@ -497,7 +500,7 @@ static void compute_dir_rename_counts(struct strmap *dir_rename_count,
 		 * the old filename and the new filename and count how many
 		 * times that pairing occurs.
 		 */
-		update_dir_rename_counts(dir_rename_count, dirs_removed,
+		update_dir_rename_counts(info, dirs_removed,
 					 rename_dst[i].p->one->path,
 					 rename_dst[i].p->two->path);
 	}
@@ -551,7 +554,9 @@ static void cleanup_dir_rename_info(struct dir_rename_info *info)
 	/* dir_rename_guess */
 	strmap_clear(&info->dir_rename_guess, 1);
 
-	/* Nothing to do for dir_rename_count, yet */
+	/* dir_rename_count */
+	partial_clear_dir_rename_count(info->dir_rename_count);
+	strmap_clear(info->dir_rename_count, 1);
 }
 
 static const char *get_basename(const char *filename)
@@ -1140,7 +1145,7 @@ void diffcore_rename_extended(struct diff_options *options,
 	/*
 	 * Now that renames have been computed, compute dir_rename_count */
 	if (dirs_removed && dir_rename_count)
-		compute_dir_rename_counts(dir_rename_count, dirs_removed);
+		compute_dir_rename_counts(&info, dirs_removed, dir_rename_count);
 
 	/* At this point, we have found some renames and copies and they
 	 * are recorded in rename_dst.  The original list is still in *q.
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v3 07/10] diffcore-rename: extend cleanup_dir_rename_info()
  2021-02-26  1:58   ` [PATCH v3 " Elijah Newren via GitGitGadget
                       ` (5 preceding siblings ...)
  2021-02-26  1:58     ` [PATCH v3 06/10] diffcore-rename: move dir_rename_counts into dir_rename_info struct Elijah Newren via GitGitGadget
@ 2021-02-26  1:58     ` Elijah Newren via GitGitGadget
  2021-02-26  1:58     ` [PATCH v3 08/10] diffcore-rename: compute dir_rename_counts in stages Elijah Newren via GitGitGadget
                       ` (4 subsequent siblings)
  11 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-26  1:58 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Elijah Newren, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

When diffcore_rename_extended() is passed a NULL dir_rename_count, we
will still want to create a temporary one for use by
find_basename_matches(), but have it fully deallocated before
diffcore_rename_extended() returns.  However, when
diffcore_rename_extended() is passed a dir_rename_count, we want to fill
that strmap with appropriate values and return it.  However, for our
interim purposes we may also add entries corresponding to directories
that cannot have been renamed due to still existing on both sides.

Extend cleanup_dir_rename_info() to handle these two different cases,
cleaning up the relevant bits of information for each case.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 40 ++++++++++++++++++++++++++++++++++++----
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index a1ccf14001f5..2cf9c47c6364 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -543,8 +543,15 @@ void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
 	strmap_partial_clear(dir_rename_count, 1);
 }
 
-static void cleanup_dir_rename_info(struct dir_rename_info *info)
+static void cleanup_dir_rename_info(struct dir_rename_info *info,
+				    struct strset *dirs_removed,
+				    int keep_dir_rename_count)
 {
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
+	struct string_list to_remove = STRING_LIST_INIT_NODUP;
+	int i;
+
 	if (!info->setup)
 		return;
 
@@ -555,8 +562,33 @@ static void cleanup_dir_rename_info(struct dir_rename_info *info)
 	strmap_clear(&info->dir_rename_guess, 1);
 
 	/* dir_rename_count */
-	partial_clear_dir_rename_count(info->dir_rename_count);
-	strmap_clear(info->dir_rename_count, 1);
+	if (!keep_dir_rename_count) {
+		partial_clear_dir_rename_count(info->dir_rename_count);
+		strmap_clear(info->dir_rename_count, 1);
+		FREE_AND_NULL(info->dir_rename_count);
+		return;
+	}
+
+	/*
+	 * Although dir_rename_count was passed in
+	 * diffcore_rename_extended() and we want to keep it around and
+	 * return it to that caller, we first want to remove any data
+	 * associated with directories that weren't renamed.
+	 */
+	strmap_for_each_entry(info->dir_rename_count, &iter, entry) {
+		const char *source_dir = entry->key;
+		struct strintmap *counts = entry->value;
+
+		if (!strset_contains(dirs_removed, source_dir)) {
+			string_list_append(&to_remove, source_dir);
+			strintmap_clear(counts);
+			continue;
+		}
+	}
+	for (i = 0; i < to_remove.nr; ++i)
+		strmap_remove(info->dir_rename_count,
+			      to_remove.items[i].string, 1);
+	string_list_clear(&to_remove, 0);
 }
 
 static const char *get_basename(const char *filename)
@@ -1218,7 +1250,7 @@ void diffcore_rename_extended(struct diff_options *options,
 		if (rename_dst[i].filespec_to_free)
 			free_filespec(rename_dst[i].filespec_to_free);
 
-	cleanup_dir_rename_info(&info);
+	cleanup_dir_rename_info(&info, dirs_removed, dir_rename_count != NULL);
 	FREE_AND_NULL(rename_dst);
 	rename_dst_nr = rename_dst_alloc = 0;
 	FREE_AND_NULL(rename_src);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v3 08/10] diffcore-rename: compute dir_rename_counts in stages
  2021-02-26  1:58   ` [PATCH v3 " Elijah Newren via GitGitGadget
                       ` (6 preceding siblings ...)
  2021-02-26  1:58     ` [PATCH v3 07/10] diffcore-rename: extend cleanup_dir_rename_info() Elijah Newren via GitGitGadget
@ 2021-02-26  1:58     ` Elijah Newren via GitGitGadget
  2021-02-26  1:58     ` [PATCH v3 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs Elijah Newren via GitGitGadget
                       ` (3 subsequent siblings)
  11 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-26  1:58 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Elijah Newren, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Compute dir_rename_counts based just on exact renames to start, as that
can provide us useful information in find_basename_matches().  This is
done by moving the code from compute_dir_rename_counts() into
initialize_dir_rename_info(), resulting in it being computed earlier and
based just on exact renames.  Since that's an incomplete result, we
augment the counts via calling update_dir_rename_counts() after each
basename-guide and inexact rename detection match is found.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 110 +++++++++++++++++++++++++++++-----------------
 1 file changed, 70 insertions(+), 40 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 2cf9c47c6364..10f8f4a301e3 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -419,6 +419,28 @@ static void update_dir_rename_counts(struct dir_rename_info *info,
 	char new_dir_first_char = new_dir[0];
 	int first_time_in_loop = 1;
 
+	if (!info->setup)
+		/*
+		 * info->setup is 0 here in two cases: (1) all auxiliary
+		 * vars (like dirs_removed) were NULL so
+		 * initialize_dir_rename_info() returned early, or (2)
+		 * either break detection or copy detection are active so
+		 * that we never called initialize_dir_rename_info().  In
+		 * the former case, we don't have enough info to know if
+		 * directories were renamed (because dirs_removed lets us
+		 * know about a necessary prerequisite, namely if they were
+		 * removed), and in the latter, we don't care about
+		 * directory renames or find_basename_matches.
+		 *
+		 * This matters because both basename and inexact matching
+		 * will also call update_dir_rename_counts().  In either of
+		 * the above two cases info->dir_rename_counts will not
+		 * have been properly initialized which prevents us from
+		 * updating it, but in these two cases we don't care about
+		 * dir_rename_counts anyway, so we can just exit early.
+		 */
+		return;
+
 	while (1) {
 		dirname_munge(old_dir);
 		dirname_munge(new_dir);
@@ -479,45 +501,29 @@ static void update_dir_rename_counts(struct dir_rename_info *info,
 	free(new_dir);
 }
 
-static void compute_dir_rename_counts(struct dir_rename_info *info,
-				      struct strset *dirs_removed,
-				      struct strmap *dir_rename_count)
+static void initialize_dir_rename_info(struct dir_rename_info *info,
+				       struct strset *dirs_removed,
+				       struct strmap *dir_rename_count)
 {
 	int i;
 
-	info->setup = 1;
-	info->dir_rename_count = dir_rename_count;
-
-	for (i = 0; i < rename_dst_nr; ++i) {
-		/* File not part of directory rename counts if not a rename */
-		if (!rename_dst[i].is_rename)
-			continue;
-
-		/*
-		 * Make dir_rename_count contain a map of a map:
-		 *   old_directory -> {new_directory -> count}
-		 * In other words, for every pair look at the directories for
-		 * the old filename and the new filename and count how many
-		 * times that pairing occurs.
-		 */
-		update_dir_rename_counts(info, dirs_removed,
-					 rename_dst[i].p->one->path,
-					 rename_dst[i].p->two->path);
+	if (!dirs_removed) {
+		info->setup = 0;
+		return;
 	}
-}
-
-static void initialize_dir_rename_info(struct dir_rename_info *info)
-{
-	int i;
-
 	info->setup = 1;
 
+	info->dir_rename_count = dir_rename_count;
+	if (!info->dir_rename_count) {
+		info->dir_rename_count = xmalloc(sizeof(*dir_rename_count));
+		strmap_init(info->dir_rename_count);
+	}
 	strintmap_init_with_options(&info->idx_map, -1, NULL, 0);
 	strmap_init_with_options(&info->dir_rename_guess, NULL, 0);
-	info->dir_rename_count = NULL;
 
 	/*
-	 * Loop setting up both info->idx_map.
+	 * Loop setting up both info->idx_map, and doing setup of
+	 * info->dir_rename_count.
 	 */
 	for (i = 0; i < rename_dst_nr; ++i) {
 		/*
@@ -527,7 +533,20 @@ static void initialize_dir_rename_info(struct dir_rename_info *info)
 		if (!rename_dst[i].is_rename) {
 			char *filename = rename_dst[i].p->two->path;
 			strintmap_set(&info->idx_map, filename, i);
+			continue;
 		}
+
+		/*
+		 * For everything else (i.e. renamed files), make
+		 * dir_rename_count contain a map of a map:
+		 *   old_directory -> {new_directory -> count}
+		 * In other words, for every pair look at the directories for
+		 * the old filename and the new filename and count how many
+		 * times that pairing occurs.
+		 */
+		update_dir_rename_counts(info, dirs_removed,
+					 rename_dst[i].p->one->path,
+					 rename_dst[i].p->two->path);
 	}
 }
 
@@ -682,7 +701,8 @@ static int idx_possible_rename(char *filename, struct dir_rename_info *info)
 
 static int find_basename_matches(struct diff_options *options,
 				 int minimum_score,
-				 struct dir_rename_info *info)
+				 struct dir_rename_info *info,
+				 struct strset *dirs_removed)
 {
 	/*
 	 * When I checked in early 2020, over 76% of file renames in linux
@@ -810,6 +830,8 @@ static int find_basename_matches(struct diff_options *options,
 				continue;
 			record_rename_pair(dst_index, src_index, score);
 			renames++;
+			update_dir_rename_counts(info, dirs_removed,
+						 one->path, two->path);
 
 			/*
 			 * Found a rename so don't need text anymore; if we
@@ -893,7 +915,12 @@ static int too_many_rename_candidates(int num_destinations, int num_sources,
 	return 1;
 }
 
-static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, int copies)
+static int find_renames(struct diff_score *mx,
+			int dst_cnt,
+			int minimum_score,
+			int copies,
+			struct dir_rename_info *info,
+			struct strset *dirs_removed)
 {
 	int count = 0, i;
 
@@ -910,6 +937,9 @@ static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, i
 			continue;
 		record_rename_pair(mx[i].dst, mx[i].src, mx[i].score);
 		count++;
+		update_dir_rename_counts(info, dirs_removed,
+					 rename_src[mx[i].src].p->one->path,
+					 rename_dst[mx[i].dst].p->two->path);
 	}
 	return count;
 }
@@ -981,6 +1011,8 @@ void diffcore_rename_extended(struct diff_options *options,
 	info.setup = 0;
 	assert(!dir_rename_count || strmap_empty(dir_rename_count));
 	want_copies = (detect_rename == DIFF_DETECT_COPY);
+	if (dirs_removed && (break_idx || want_copies))
+		BUG("dirs_removed incompatible with break/copy detection");
 	if (!minimum_score)
 		minimum_score = DEFAULT_RENAME_SCORE;
 
@@ -1074,14 +1106,15 @@ void diffcore_rename_extended(struct diff_options *options,
 
 		/* Preparation for basename-driven matching. */
 		trace2_region_enter("diff", "dir rename setup", options->repo);
-		initialize_dir_rename_info(&info);
+		initialize_dir_rename_info(&info,
+					   dirs_removed, dir_rename_count);
 		trace2_region_leave("diff", "dir rename setup", options->repo);
 
 		/* Utilize file basenames to quickly find renames. */
 		trace2_region_enter("diff", "basename matches", options->repo);
 		rename_count += find_basename_matches(options,
 						      min_basename_score,
-						      &info);
+						      &info, dirs_removed);
 		trace2_region_leave("diff", "basename matches", options->repo);
 
 		/*
@@ -1167,18 +1200,15 @@ void diffcore_rename_extended(struct diff_options *options,
 	/* cost matrix sorted by most to least similar pair */
 	STABLE_QSORT(mx, dst_cnt * NUM_CANDIDATE_PER_DST, score_compare);
 
-	rename_count += find_renames(mx, dst_cnt, minimum_score, 0);
+	rename_count += find_renames(mx, dst_cnt, minimum_score, 0,
+				     &info, dirs_removed);
 	if (want_copies)
-		rename_count += find_renames(mx, dst_cnt, minimum_score, 1);
+		rename_count += find_renames(mx, dst_cnt, minimum_score, 1,
+					     &info, dirs_removed);
 	free(mx);
 	trace2_region_leave("diff", "inexact renames", options->repo);
 
  cleanup:
-	/*
-	 * Now that renames have been computed, compute dir_rename_count */
-	if (dirs_removed && dir_rename_count)
-		compute_dir_rename_counts(&info, dirs_removed, dir_rename_count);
-
 	/* At this point, we have found some renames and copies and they
 	 * are recorded in rename_dst.  The original list is still in *q.
 	 */
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v3 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs
  2021-02-26  1:58   ` [PATCH v3 " Elijah Newren via GitGitGadget
                       ` (7 preceding siblings ...)
  2021-02-26  1:58     ` [PATCH v3 08/10] diffcore-rename: compute dir_rename_counts in stages Elijah Newren via GitGitGadget
@ 2021-02-26  1:58     ` Elijah Newren via GitGitGadget
  2021-02-26  1:58     ` [PATCH v3 10/10] diffcore-rename: compute dir_rename_guess from dir_rename_counts Elijah Newren via GitGitGadget
                       ` (2 subsequent siblings)
  11 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-26  1:58 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Elijah Newren, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

We are using dir_rename_counts to count the number of other directories
that files within a directory moved to.  We only need this information
for directories that disappeared, though, so we can return early from
update_dir_rename_counts() for other paths.

If dirs_removed is passed to diffcore_rename_extended(), then it
provides the relevant bits of information for us to limit this counting
to relevant dirs.  If dirs_removed is not passed, we would need to
compute some replacement in order to do this limiting.  Introduce a new
info->relevant_source_dirs variable for this purpose, even though at
this stage we will only set it to dirs_removed for simplicity.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 10f8f4a301e3..e5fa0cb555dd 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -371,6 +371,7 @@ struct dir_rename_info {
 	struct strintmap idx_map;
 	struct strmap dir_rename_guess;
 	struct strmap *dir_rename_count;
+	struct strset *relevant_source_dirs;
 	unsigned setup;
 };
 
@@ -442,7 +443,13 @@ static void update_dir_rename_counts(struct dir_rename_info *info,
 		return;
 
 	while (1) {
+		/* Get old_dir, skip if its directory isn't relevant. */
 		dirname_munge(old_dir);
+		if (info->relevant_source_dirs &&
+		    !strset_contains(info->relevant_source_dirs, old_dir))
+			break;
+
+		/* Get new_dir */
 		dirname_munge(new_dir);
 
 		/*
@@ -521,6 +528,9 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 	strintmap_init_with_options(&info->idx_map, -1, NULL, 0);
 	strmap_init_with_options(&info->dir_rename_guess, NULL, 0);
 
+	/* Setup info->relevant_source_dirs */
+	info->relevant_source_dirs = dirs_removed;
+
 	/*
 	 * Loop setting up both info->idx_map, and doing setup of
 	 * info->dir_rename_count.
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v3 10/10] diffcore-rename: compute dir_rename_guess from dir_rename_counts
  2021-02-26  1:58   ` [PATCH v3 " Elijah Newren via GitGitGadget
                       ` (8 preceding siblings ...)
  2021-02-26  1:58     ` [PATCH v3 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs Elijah Newren via GitGitGadget
@ 2021-02-26  1:58     ` Elijah Newren via GitGitGadget
  2021-02-26 16:34     ` [PATCH v3 00/10] Optimization batch 8: use file basenames even more Derrick Stolee
  2021-02-27  0:30     ` [PATCH v4 " Elijah Newren via GitGitGadget
  11 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-26  1:58 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Elijah Newren, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

dir_rename_counts has a mapping of a mapping, in particular, it has
   old_dir => { new_dir => count }
We want a simple mapping of
   old_dir => new_dir
based on which new_dir had the highest count for a given old_dir.
Compute this and store it in dir_rename_guess.

This is the final piece of the puzzle needed to make our guesses at
which directory files have been moved to when basenames aren't unique.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       12.775 s ±  0.062 s    12.596 s ±  0.061 s
    mega-renames:    188.754 s ±  0.284 s   130.465 s ±  0.259 s
    just-one-mega:     5.599 s ±  0.019 s     3.958 s ±  0.010 s

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 45 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 41 insertions(+), 4 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index e5fa0cb555dd..1fe902ed2af0 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -389,6 +389,24 @@ static void dirname_munge(char *filename)
 	*slash = '\0';
 }
 
+static const char *get_highest_rename_path(struct strintmap *counts)
+{
+	int highest_count = 0;
+	const char *highest_destination_dir = NULL;
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
+
+	strintmap_for_each_entry(counts, &iter, entry) {
+		const char *destination_dir = entry->key;
+		intptr_t count = (intptr_t)entry->value;
+		if (count > highest_count) {
+			highest_count = count;
+			highest_destination_dir = destination_dir;
+		}
+	}
+	return highest_destination_dir;
+}
+
 static void increment_count(struct dir_rename_info *info,
 			    char *old_dir,
 			    char *new_dir)
@@ -512,6 +530,8 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 				       struct strset *dirs_removed,
 				       struct strmap *dir_rename_count)
 {
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
 	int i;
 
 	if (!dirs_removed) {
@@ -558,6 +578,23 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 					 rename_dst[i].p->one->path,
 					 rename_dst[i].p->two->path);
 	}
+
+	/*
+	 * Now we collapse
+	 *    dir_rename_count: old_directory -> {new_directory -> count}
+	 * down to
+	 *    dir_rename_guess: old_directory -> best_new_directory
+	 * where best_new_directory is the one with the highest count.
+	 */
+	strmap_for_each_entry(info->dir_rename_count, &iter, entry) {
+		/* entry->key is source_dir */
+		struct strintmap *counts = entry->value;
+		char *best_newdir;
+
+		best_newdir = xstrdup(get_highest_rename_path(counts));
+		strmap_put(&info->dir_rename_guess, entry->key,
+			   best_newdir);
+	}
 }
 
 void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
@@ -682,10 +719,10 @@ static int idx_possible_rename(char *filename, struct dir_rename_info *info)
 	 *       rename.
 	 *
 	 * This function, idx_possible_rename(), is only responsible for (4).
-	 * The conditions/steps in (1)-(3) will be handled via setting up
-	 * dir_rename_count and dir_rename_guess in a future
-	 * initialize_dir_rename_info() function.  Steps (0) and (5) are
-	 * handled by the caller of this function.
+	 * The conditions/steps in (1)-(3) are handled via setting up
+	 * dir_rename_count and dir_rename_guess in
+	 * initialize_dir_rename_info().  Steps (0) and (5) are handled by
+	 * the caller of this function.
 	 */
 	char *old_dir, *new_dir;
 	struct strbuf new_path = STRBUF_INIT;
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v3 02/10] diffcore-rename: add a new idx_possible_rename function
  2021-02-26  1:58     ` [PATCH v3 02/10] diffcore-rename: add a new idx_possible_rename function Elijah Newren via GitGitGadget
@ 2021-02-26 15:52       ` Derrick Stolee
  0 siblings, 0 replies; 61+ messages in thread
From: Derrick Stolee @ 2021-02-26 15:52 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Elijah Newren, Junio C Hamano, Ævar Arnfjörð Bjarmason

On 2/25/2021 8:58 PM, Elijah Newren via GitGitGadget wrote:
> Anyway, that's a lot of background so that we can actually describe the
> new function.  Add an idx_possible_rename() function which combines the
> recently added dir_rename_guess and idx_map fields to provide the index
> within rename_dst of a potential match for a given file.

This paragraph could use rewording now that idx_possible_rename() is not
new to this commit, and dir_rename_guess() is introduced here (but not
yet initialized). This would be a good time to introduce the
dir_rename_info struct and how it _will_ be populated.

> -static int idx_possible_rename(char *filename)
> +static int idx_possible_rename(char *filename, struct dir_rename_info *info)
>  {
...
> +	if (!info->setup)
> +		return -1;

It might be worth noting in the commit message that since
info.setup is always zero, this implementation is not run (yet).

In general, I'm finding this a lot easier to read the code in
the presented order. The commit messages need to be massaged to
match.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v3 04/10] Move computation of dir_rename_count from merge-ort to diffcore-rename
  2021-02-26  1:58     ` [PATCH v3 04/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
@ 2021-02-26 15:55       ` Derrick Stolee
  0 siblings, 0 replies; 61+ messages in thread
From: Derrick Stolee @ 2021-02-26 15:55 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Elijah Newren, Junio C Hamano, Ævar Arnfjörð Bjarmason

On 2/25/2021 8:58 PM, Elijah Newren via GitGitGadget wrote:
> +	for (i = 0; i < rename_dst_nr; ++i) {
> +		/* File not part of directory rename counts if not a rename */
> +		if (!rename_dst[i].is_rename)
> +			continue;
> +
...
> -		/* File not part of directory rename if it wasn't renamed */
> -		if (pair->status != 'R')
> -			continue;
> -
Thanks for updating to include this check!

-Stolee

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v3 00/10] Optimization batch 8: use file basenames even more
  2021-02-26  1:58   ` [PATCH v3 " Elijah Newren via GitGitGadget
                       ` (9 preceding siblings ...)
  2021-02-26  1:58     ` [PATCH v3 10/10] diffcore-rename: compute dir_rename_guess from dir_rename_counts Elijah Newren via GitGitGadget
@ 2021-02-26 16:34     ` Derrick Stolee
  2021-02-26 19:28       ` Elijah Newren
  2021-02-27  0:30     ` [PATCH v4 " Elijah Newren via GitGitGadget
  11 siblings, 1 reply; 61+ messages in thread
From: Derrick Stolee @ 2021-02-26 16:34 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Elijah Newren, Junio C Hamano, Ævar Arnfjörð Bjarmason

On 2/25/2021 8:58 PM, Elijah Newren via GitGitGadget wrote:
> This series depends on en/diffcore-rename (a concatenation of what I was
> calling ort-perf-batch-6 and ort-perf-batch-7).
> 
> Changes since v2:
> 
>  * Rearrange the patches in the series to have a top-down ordering rather
>    than bottom-up -- as suggested by Stolee, Ævar, and Junio
>  * Several comments and style improvements suggested by Stolee
>  * Replace xstrfmt() with a few strbuf_add*() calls, as suggested by Stolee
 I like the new layout. The code looks good. The only nit I have is that
some of the old commit messages read awkwardly with the new ordering.
These are not critical, so this version gets my:

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v3 00/10] Optimization batch 8: use file basenames even more
  2021-02-26 16:34     ` [PATCH v3 00/10] Optimization batch 8: use file basenames even more Derrick Stolee
@ 2021-02-26 19:28       ` Elijah Newren
  0 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren @ 2021-02-26 19:28 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Elijah Newren via GitGitGadget, Git Mailing List, Junio C Hamano,
	Ævar Arnfjörð Bjarmason

On Fri, Feb 26, 2021 at 8:34 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/25/2021 8:58 PM, Elijah Newren via GitGitGadget wrote:
> > This series depends on en/diffcore-rename (a concatenation of what I was
> > calling ort-perf-batch-6 and ort-perf-batch-7).
> >
> > Changes since v2:
> >
> >  * Rearrange the patches in the series to have a top-down ordering rather
> >    than bottom-up -- as suggested by Stolee, Ævar, and Junio
> >  * Several comments and style improvements suggested by Stolee
> >  * Replace xstrfmt() with a few strbuf_add*() calls, as suggested by Stolee

Oh, and I also forgot to mention that I applied Ævar's suggested
cleanup -- return-in-if-block-to-avoid-indented-else.

>  I like the new layout. The code looks good. The only nit I have is that
> some of the old commit messages read awkwardly with the new ordering.

Doh, I fixed some of those up, but apparently missed some.  I'll clean
those up too.

> These are not critical, so this version gets my:
>
> Reviewed-by: Derrick Stolee <dstolee@microsoft.com>

Thanks, I'll include that in my reroll with the cleaned up commit messages.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v4 00/10] Optimization batch 8: use file basenames even more
  2021-02-26  1:58   ` [PATCH v3 " Elijah Newren via GitGitGadget
                       ` (10 preceding siblings ...)
  2021-02-26 16:34     ` [PATCH v3 00/10] Optimization batch 8: use file basenames even more Derrick Stolee
@ 2021-02-27  0:30     ` Elijah Newren via GitGitGadget
  2021-02-27  0:30       ` [PATCH v4 01/10] diffcore-rename: use directory rename guided basename comparisons Elijah Newren via GitGitGadget
                         ` (10 more replies)
  11 siblings, 11 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-27  0:30 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Elijah Newren, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren

This series depends on en/diffcore-rename (a concatenation of what I was
calling ort-perf-batch-6 and ort-perf-batch-7).

Changes since v3:

 * Update the commit messages (one was out of date after the rearrangement),
   and include Stolee's Reviewed-by

Elijah Newren (10):
  diffcore-rename: use directory rename guided basename comparisons
  diffcore-rename: provide basic implementation of idx_possible_rename()
  diffcore-rename: add a mapping of destination names to their indices
  Move computation of dir_rename_count from merge-ort to diffcore-rename
  diffcore-rename: add function for clearing dir_rename_count
  diffcore-rename: move dir_rename_counts into dir_rename_info struct
  diffcore-rename: extend cleanup_dir_rename_info()
  diffcore-rename: compute dir_rename_counts in stages
  diffcore-rename: limit dir_rename_counts computation to relevant dirs
  diffcore-rename: compute dir_rename_guess from dir_rename_counts

 Documentation/gitdiffcore.txt |   2 +-
 diffcore-rename.c             | 449 ++++++++++++++++++++++++++++++++--
 diffcore.h                    |   7 +
 merge-ort.c                   | 144 +----------
 4 files changed, 449 insertions(+), 153 deletions(-)


base-commit: aeca14f748afc7fb5b65bca56ea2ebd970729814
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-844%2Fnewren%2Fort-perf-batch-8-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-844/newren/ort-perf-batch-8-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/844

Range-diff vs v3:

  1:  6afa9add40b9 !  1:  823d07532e00 diffcore-rename: use directory rename guided basename comparisons
     @@ Commit message
          min_basename_score threshold required for marking the two files as
          renames.
      
     -    This commit introduces an idx_possible_rename() function which will give
     +    This commit introduces an idx_possible_rename() function which will
          do this directory rename detection for us and give us the index within
          rename_dst of the resulting filename.  For now, this function is
          hardcoded to return -1 (not found) and just hooks up how its results
          would be used once we have a more complete implementation in place.
      
     +    Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## Documentation/gitdiffcore.txt ##
  2:  40f57bcc2055 !  2:  2dde621d7de5 diffcore-rename: add a new idx_possible_rename function
     @@ Metadata
      Author: Elijah Newren <newren@gmail.com>
      
       ## Commit message ##
     -    diffcore-rename: add a new idx_possible_rename function
     +    diffcore-rename: provide basic implementation of idx_possible_rename()
      
     -    find_basename_matches() is great when both the remaining set of possible
     -    rename sources and the remaining set of possible rename destinations
     -    have exactly one file each with a given basename.  It allows us to match
     -    up files that have been moved to different directories without changing
     -    filenames.
     +    Add a new struct dir_rename_info with various values we need inside our
     +    idx_possible_rename() function introduced in the previous commit.  Add a
     +    basic implementation for this function showing how we plan to use the
     +    variables, but which will just return early with a value of -1 (not
     +    found) when those variables are not set up.
      
     -    When basenames are not unique, though, we want to be able to guess which
     -    directories the source files have been moved to.  Since this is the job
     -    of directory rename detection, we employ it.  However, since it is a
     -    directory rename detection idea, we also limit it to cases where we know
     -    there could have been a directory rename, i.e. where the source
     -    directory has been removed.  This has to be signalled by dirs_removed
     -    being non-NULL and containing an entry for the relevant directory.
     -    Since merge-ort.c is the only caller that currently does so, this
     -    optimization is only effective for merge-ort right now.  In the future,
     -    this condition could be reconsidered or we could modify other callers to
     -    pass the necessary strset.
     -
     -    Anyway, that's a lot of background so that we can actually describe the
     -    new function.  Add an idx_possible_rename() function which combines the
     -    recently added dir_rename_guess and idx_map fields to provide the index
     -    within rename_dst of a potential match for a given file.
     -
     -    Future commits will add checks after calling this function to compare
     -    the resulting 'likely rename' candidates to see if the two files meet
     -    the elevated min_basename_score threshold for marking them as actual
     -    renames.
     +    Future commits will do the work necessary to set up those other
     +    variables so that idx_possible_rename() does not always return -1.
      
     +    Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
  3:  0e14961574ea !  3:  21b9cf1da30e diffcore-rename: add a mapping of destination names to their indices
     @@ Commit message
          dir_rename_guess; these will be more fully populated in subsequent
          commits.
      
     +    Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
  4:  9b9d5b207b03 !  4:  3617b0209cc4 Move computation of dir_rename_count from merge-ort to diffcore-rename
     @@ Commit message
          preliminary computation of dir_rename_count after exact rename
          detection, followed by some updates after inexact rename detection.
      
     +    Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
  5:  f286e89464ea !  5:  2baf39d82f3e diffcore-rename: add function for clearing dir_rename_count
     @@ Commit message
          for clearing, or partially clearing it out.  Add a
          partial_clear_dir_rename_count() function for this purpose.
      
     +    Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
  6:  ab353f2e75eb !  6:  02f1f7c02d32 diffcore-rename: move dir_rename_counts into dir_rename_info struct
     @@ Commit message
          dir_rename_info struct.  Future commits will then make dir_rename_counts
          be computed in stages, and add computation of dir_rename_guess.
      
     +    Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
  7:  bd50d9e53804 !  7:  9c3436840534 diffcore-rename: extend cleanup_dir_rename_info()
     @@ Commit message
          Extend cleanup_dir_rename_info() to handle these two different cases,
          cleaning up the relevant bits of information for each case.
      
     +    Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
  8:  44cfae6505f2 !  8:  6bd398d3707e diffcore-rename: compute dir_rename_counts in stages
     @@ Commit message
          augment the counts via calling update_dir_rename_counts() after each
          basename-guide and inexact rename detection match is found.
      
     +    Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
  9:  752aff3a7995 !  9:  46304aaebf5a diffcore-rename: limit dir_rename_counts computation to relevant dirs
     @@ Commit message
          info->relevant_source_dirs variable for this purpose, even though at
          this stage we will only set it to dirs_removed for simplicity.
      
     +    Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
 10:  65f7bfb735f2 ! 10:  4be565c47208 diffcore-rename: compute dir_rename_guess from dir_rename_counts
     @@ Commit message
              mega-renames:    188.754 s ±  0.284 s   130.465 s ±  0.259 s
              just-one-mega:     5.599 s ±  0.019 s     3.958 s ±  0.010 s
      
     +    Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v4 01/10] diffcore-rename: use directory rename guided basename comparisons
  2021-02-27  0:30     ` [PATCH v4 " Elijah Newren via GitGitGadget
@ 2021-02-27  0:30       ` Elijah Newren via GitGitGadget
  2021-02-27  0:30       ` [PATCH v4 02/10] diffcore-rename: provide basic implementation of idx_possible_rename() Elijah Newren via GitGitGadget
                         ` (9 subsequent siblings)
  10 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-27  0:30 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Elijah Newren, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

A previous commit noted that it is very common for people to move files
across directories while keeping their filename the same.  The last few
commits took advantage of this and showed that we can accelerate rename
detection significantly using basenames; since files with the same
basename serve as likely rename candidates, we can check those first and
remove them from the rename candidate pool if they are sufficiently
similar.

Unfortunately, the previous optimization was limited by the fact that
the remaining basenames after exact rename detection are not always
unique.  Many repositories have hundreds of build files with the same
name (e.g. Makefile, .gitignore, build.gradle, etc.), and may even have
hundreds of source files with the same name.  (For example, the linux
kernel has 100 setup.c, 87 irq.c, and 112 core.c files.  A repository at
$DAYJOB has a lot of ObjectFactory.java and Plugin.java files).

For these files with non-unique basenames, we are faced with the task of
attempting to determine or guess which directory they may have been
relocated to.  Such a task is precisely the job of directory rename
detection.  However, there are two catches: (1) the directory rename
detection code has traditionally been part of the merge machinery rather
than diffcore-rename.c, and (2) directory rename detection currently
runs after regular rename detection is complete.  The 1st catch is just
an implementation issue that can be overcome by some code shuffling.
The 2nd requires us to add a further approximation: we only have access
to exact renames at this point, so we need to do directory rename
detection based on just exact renames.  In some cases we won't have
exact renames, in which case this extra optimization won't apply.  We
also choose to not apply the optimization unless we know that the
underlying directory was removed, which will require extra data to be
passed in to diffcore_rename_extended().  Also, even if we get a
prediction about which directory a file may have relocated to, we will
still need to check to see if there is a file in the predicted
directory, and then compare the two files to see if they meet the higher
min_basename_score threshold required for marking the two files as
renames.

This commit introduces an idx_possible_rename() function which will
do this directory rename detection for us and give us the index within
rename_dst of the resulting filename.  For now, this function is
hardcoded to return -1 (not found) and just hooks up how its results
would be used once we have a more complete implementation in place.

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
---
 Documentation/gitdiffcore.txt |  2 +-
 diffcore-rename.c             | 42 ++++++++++++++++++++++++++++-------
 2 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/Documentation/gitdiffcore.txt b/Documentation/gitdiffcore.txt
index 80fcf9542441..8673a5c5b2f2 100644
--- a/Documentation/gitdiffcore.txt
+++ b/Documentation/gitdiffcore.txt
@@ -186,7 +186,7 @@ mark a file pair as a rename and stop considering other candidates for
 better matches.  At most, one comparison is done per file in this
 preliminary pass; so if there are several remaining ext.txt files
 throughout the directory hierarchy after exact rename detection, this
-preliminary step will be skipped for those files.
+preliminary step may be skipped for those files.
 
 Note.  When the "-C" option is used with `--find-copies-harder`
 option, 'git diff-{asterisk}' commands feed unmodified filepairs to
diff --git a/diffcore-rename.c b/diffcore-rename.c
index 41558185ae1d..b3055683bac2 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -379,6 +379,12 @@ static const char *get_basename(const char *filename)
 	return base ? base + 1 : filename;
 }
 
+static int idx_possible_rename(char *filename)
+{
+	/* Unconditionally return -1, "not found", for now */
+	return -1;
+}
+
 static int find_basename_matches(struct diff_options *options,
 				 int minimum_score)
 {
@@ -415,8 +421,6 @@ static int find_basename_matches(struct diff_options *options,
 	int i, renames = 0;
 	struct strintmap sources;
 	struct strintmap dests;
-	struct hashmap_iter iter;
-	struct strmap_entry *entry;
 
 	/*
 	 * The prefeteching stuff wants to know if it can skip prefetching
@@ -466,17 +470,39 @@ static int find_basename_matches(struct diff_options *options,
 	}
 
 	/* Now look for basename matchups and do similarity estimation */
-	strintmap_for_each_entry(&sources, &iter, entry) {
-		const char *base = entry->key;
-		intptr_t src_index = (intptr_t)entry->value;
+	for (i = 0; i < rename_src_nr; ++i) {
+		char *filename = rename_src[i].p->one->path;
+		const char *base = NULL;
+		intptr_t src_index;
 		intptr_t dst_index;
-		if (src_index == -1)
-			continue;
 
-		if (0 <= (dst_index = strintmap_get(&dests, base))) {
+		/*
+		 * If the basename is unique among remaining sources, then
+		 * src_index will equal 'i' and we can attempt to match it
+		 * to a unique basename in the destinations.  Otherwise,
+		 * use directory rename heuristics, if possible.
+		 */
+		base = get_basename(filename);
+		src_index = strintmap_get(&sources, base);
+		assert(src_index == -1 || src_index == i);
+
+		if (strintmap_contains(&dests, base)) {
 			struct diff_filespec *one, *two;
 			int score;
 
+			/* Find a matching destination, if possible */
+			dst_index = strintmap_get(&dests, base);
+			if (src_index == -1 || dst_index == -1) {
+				src_index = i;
+				dst_index = idx_possible_rename(filename);
+			}
+			if (dst_index == -1)
+				continue;
+
+			/* Ignore this dest if already used in a rename */
+			if (rename_dst[dst_index].is_rename)
+				continue; /* already used previously */
+
 			/* Estimate the similarity */
 			one = rename_src[src_index].p->one;
 			two = rename_dst[dst_index].p->two;
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v4 02/10] diffcore-rename: provide basic implementation of idx_possible_rename()
  2021-02-27  0:30     ` [PATCH v4 " Elijah Newren via GitGitGadget
  2021-02-27  0:30       ` [PATCH v4 01/10] diffcore-rename: use directory rename guided basename comparisons Elijah Newren via GitGitGadget
@ 2021-02-27  0:30       ` Elijah Newren via GitGitGadget
  2021-02-27  0:30       ` [PATCH v4 03/10] diffcore-rename: add a mapping of destination names to their indices Elijah Newren via GitGitGadget
                         ` (8 subsequent siblings)
  10 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-27  0:30 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Elijah Newren, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Add a new struct dir_rename_info with various values we need inside our
idx_possible_rename() function introduced in the previous commit.  Add a
basic implementation for this function showing how we plan to use the
variables, but which will just return early with a value of -1 (not
found) when those variables are not set up.

Future commits will do the work necessary to set up those other
variables so that idx_possible_rename() does not always return -1.

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 100 +++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 94 insertions(+), 6 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index b3055683bac2..edb0effb6ef4 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -367,6 +367,19 @@ static int find_exact_renames(struct diff_options *options)
 	return renames;
 }
 
+struct dir_rename_info {
+	struct strintmap idx_map;
+	struct strmap dir_rename_guess;
+	struct strmap *dir_rename_count;
+	unsigned setup;
+};
+
+static char *get_dirname(const char *filename)
+{
+	char *slash = strrchr(filename, '/');
+	return slash ? xstrndup(filename, slash - filename) : xstrdup("");
+}
+
 static const char *get_basename(const char *filename)
 {
 	/*
@@ -379,14 +392,86 @@ static const char *get_basename(const char *filename)
 	return base ? base + 1 : filename;
 }
 
-static int idx_possible_rename(char *filename)
+static int idx_possible_rename(char *filename, struct dir_rename_info *info)
 {
-	/* Unconditionally return -1, "not found", for now */
-	return -1;
+	/*
+	 * Our comparison of files with the same basename (see
+	 * find_basename_matches() below), is only helpful when after exact
+	 * rename detection we have exactly one file with a given basename
+	 * among the rename sources and also only exactly one file with
+	 * that basename among the rename destinations.  When we have
+	 * multiple files with the same basename in either set, we do not
+	 * know which to compare against.  However, there are some
+	 * filenames that occur in large numbers (particularly
+	 * build-related filenames such as 'Makefile', '.gitignore', or
+	 * 'build.gradle' that potentially exist within every single
+	 * subdirectory), and for performance we want to be able to quickly
+	 * find renames for these files too.
+	 *
+	 * The reason basename comparisons are a useful heuristic was that it
+	 * is common for people to move files across directories while keeping
+	 * their filename the same.  If we had a way of determining or even
+	 * making a good educated guess about which directory these non-unique
+	 * basename files had moved the file to, we could check it.
+	 * Luckily...
+	 *
+	 * When an entire directory is in fact renamed, we have two factors
+	 * helping us out:
+	 *   (a) the original directory disappeared giving us a hint
+	 *       about when we can apply an extra heuristic.
+	 *   (a) we often have several files within that directory and
+	 *       subdirectories that are renamed without changes
+	 * So, rules for a heuristic:
+	 *   (0) If there basename matches are non-unique (the condition under
+	 *       which this function is called) AND
+	 *   (1) the directory in which the file was found has disappeared
+	 *       (i.e. dirs_removed is non-NULL and has a relevant entry) THEN
+	 *   (2) use exact renames of files within the directory to determine
+	 *       where the directory is likely to have been renamed to.  IF
+	 *       there is at least one exact rename from within that
+	 *       directory, we can proceed.
+	 *   (3) If there are multiple places the directory could have been
+	 *       renamed to based on exact renames, ignore all but one of them.
+	 *       Just use the destination with the most renames going to it.
+	 *   (4) Check if applying that directory rename to the original file
+	 *       would result in a destination filename that is in the
+	 *       potential rename set.  If so, return the index of the
+	 *       destination file (the index within rename_dst).
+	 *   (5) Compare the original file and returned destination for
+	 *       similarity, and if they are sufficiently similar, record the
+	 *       rename.
+	 *
+	 * This function, idx_possible_rename(), is only responsible for (4).
+	 * The conditions/steps in (1)-(3) will be handled via setting up
+	 * dir_rename_count and dir_rename_guess in a future
+	 * initialize_dir_rename_info() function.  Steps (0) and (5) are
+	 * handled by the caller of this function.
+	 */
+	char *old_dir, *new_dir;
+	struct strbuf new_path = STRBUF_INIT;
+	int idx;
+
+	if (!info->setup)
+		return -1;
+
+	old_dir = get_dirname(filename);
+	new_dir = strmap_get(&info->dir_rename_guess, old_dir);
+	free(old_dir);
+	if (!new_dir)
+		return -1;
+
+	strbuf_addstr(&new_path, new_dir);
+	strbuf_addch(&new_path, '/');
+	strbuf_addstr(&new_path, get_basename(filename));
+
+	idx = strintmap_get(&info->idx_map, new_path.buf);
+	strbuf_release(&new_path);
+	return idx;
 }
 
 static int find_basename_matches(struct diff_options *options,
-				 int minimum_score)
+				 int minimum_score,
+				 struct dir_rename_info *info)
 {
 	/*
 	 * When I checked in early 2020, over 76% of file renames in linux
@@ -494,7 +579,7 @@ static int find_basename_matches(struct diff_options *options,
 			dst_index = strintmap_get(&dests, base);
 			if (src_index == -1 || dst_index == -1) {
 				src_index = i;
-				dst_index = idx_possible_rename(filename);
+				dst_index = idx_possible_rename(filename, info);
 			}
 			if (dst_index == -1)
 				continue;
@@ -677,8 +762,10 @@ void diffcore_rename(struct diff_options *options)
 	int num_destinations, dst_cnt;
 	int num_sources, want_copies;
 	struct progress *progress = NULL;
+	struct dir_rename_info info;
 
 	trace2_region_enter("diff", "setup", options->repo);
+	info.setup = 0;
 	want_copies = (detect_rename == DIFF_DETECT_COPY);
 	if (!minimum_score)
 		minimum_score = DEFAULT_RENAME_SCORE;
@@ -774,7 +861,8 @@ void diffcore_rename(struct diff_options *options)
 		/* Utilize file basenames to quickly find renames. */
 		trace2_region_enter("diff", "basename matches", options->repo);
 		rename_count += find_basename_matches(options,
-						      min_basename_score);
+						      min_basename_score,
+						      &info);
 		trace2_region_leave("diff", "basename matches", options->repo);
 
 		/*
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v4 03/10] diffcore-rename: add a mapping of destination names to their indices
  2021-02-27  0:30     ` [PATCH v4 " Elijah Newren via GitGitGadget
  2021-02-27  0:30       ` [PATCH v4 01/10] diffcore-rename: use directory rename guided basename comparisons Elijah Newren via GitGitGadget
  2021-02-27  0:30       ` [PATCH v4 02/10] diffcore-rename: provide basic implementation of idx_possible_rename() Elijah Newren via GitGitGadget
@ 2021-02-27  0:30       ` Elijah Newren via GitGitGadget
  2021-02-27  0:30       ` [PATCH v4 04/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
                         ` (7 subsequent siblings)
  10 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-27  0:30 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Elijah Newren, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Compute a mapping of full filename to the index within rename_dst where
that filename is found, and store it in idx_map.  idx_possible_rename()
needs this to quickly finding an array entry in rename_dst given the
pathname.

While at it, add placeholder initializations for dir_rename_count and
dir_rename_guess; these will be more fully populated in subsequent
commits.

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index edb0effb6ef4..8eeb8c73664c 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -380,6 +380,45 @@ static char *get_dirname(const char *filename)
 	return slash ? xstrndup(filename, slash - filename) : xstrdup("");
 }
 
+static void initialize_dir_rename_info(struct dir_rename_info *info)
+{
+	int i;
+
+	info->setup = 1;
+
+	strintmap_init_with_options(&info->idx_map, -1, NULL, 0);
+	strmap_init_with_options(&info->dir_rename_guess, NULL, 0);
+	info->dir_rename_count = NULL;
+
+	/*
+	 * Loop setting up both info->idx_map.
+	 */
+	for (i = 0; i < rename_dst_nr; ++i) {
+		/*
+		 * For non-renamed files, make idx_map contain mapping of
+		 *   filename -> index (index within rename_dst, that is)
+		 */
+		if (!rename_dst[i].is_rename) {
+			char *filename = rename_dst[i].p->two->path;
+			strintmap_set(&info->idx_map, filename, i);
+		}
+	}
+}
+
+static void cleanup_dir_rename_info(struct dir_rename_info *info)
+{
+	if (!info->setup)
+		return;
+
+	/* idx_map */
+	strintmap_clear(&info->idx_map);
+
+	/* dir_rename_guess */
+	strmap_clear(&info->dir_rename_guess, 1);
+
+	/* Nothing to do for dir_rename_count, yet */
+}
+
 static const char *get_basename(const char *filename)
 {
 	/*
@@ -858,6 +897,11 @@ void diffcore_rename(struct diff_options *options)
 		remove_unneeded_paths_from_src(want_copies);
 		trace2_region_leave("diff", "cull after exact", options->repo);
 
+		/* Preparation for basename-driven matching. */
+		trace2_region_enter("diff", "dir rename setup", options->repo);
+		initialize_dir_rename_info(&info);
+		trace2_region_leave("diff", "dir rename setup", options->repo);
+
 		/* Utilize file basenames to quickly find renames. */
 		trace2_region_enter("diff", "basename matches", options->repo);
 		rename_count += find_basename_matches(options,
@@ -1026,6 +1070,7 @@ void diffcore_rename(struct diff_options *options)
 		if (rename_dst[i].filespec_to_free)
 			free_filespec(rename_dst[i].filespec_to_free);
 
+	cleanup_dir_rename_info(&info);
 	FREE_AND_NULL(rename_dst);
 	rename_dst_nr = rename_dst_alloc = 0;
 	FREE_AND_NULL(rename_src);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v4 04/10] Move computation of dir_rename_count from merge-ort to diffcore-rename
  2021-02-27  0:30     ` [PATCH v4 " Elijah Newren via GitGitGadget
                         ` (2 preceding siblings ...)
  2021-02-27  0:30       ` [PATCH v4 03/10] diffcore-rename: add a mapping of destination names to their indices Elijah Newren via GitGitGadget
@ 2021-02-27  0:30       ` Elijah Newren via GitGitGadget
  2021-02-27  0:30       ` [PATCH v4 05/10] diffcore-rename: add function for clearing dir_rename_count Elijah Newren via GitGitGadget
                         ` (6 subsequent siblings)
  10 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-27  0:30 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Elijah Newren, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Move the computation of dir_rename_count from merge-ort.c to
diffcore-rename.c, making slight adjustments to the data structures
based on the move.  While the diffstat looks large, viewing this commit
with --color-moved makes it clear that only about 20 lines changed.

With this patch, the computation of dir_rename_count is still only done
after inexact rename detection, but subsequent commits will add a
preliminary computation of dir_rename_count after exact rename
detection, followed by some updates after inexact rename detection.

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 138 +++++++++++++++++++++++++++++++++++++++++++++-
 diffcore.h        |   5 ++
 merge-ort.c       | 132 +-------------------------------------------
 3 files changed, 145 insertions(+), 130 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 8eeb8c73664c..39e23d57e7bc 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -380,6 +380,129 @@ static char *get_dirname(const char *filename)
 	return slash ? xstrndup(filename, slash - filename) : xstrdup("");
 }
 
+static void dirname_munge(char *filename)
+{
+	char *slash = strrchr(filename, '/');
+	if (!slash)
+		slash = filename;
+	*slash = '\0';
+}
+
+static void increment_count(struct strmap *dir_rename_count,
+			    char *old_dir,
+			    char *new_dir)
+{
+	struct strintmap *counts;
+	struct strmap_entry *e;
+
+	/* Get the {new_dirs -> counts} mapping using old_dir */
+	e = strmap_get_entry(dir_rename_count, old_dir);
+	if (e) {
+		counts = e->value;
+	} else {
+		counts = xmalloc(sizeof(*counts));
+		strintmap_init_with_options(counts, 0, NULL, 1);
+		strmap_put(dir_rename_count, old_dir, counts);
+	}
+
+	/* Increment the count for new_dir */
+	strintmap_incr(counts, new_dir, 1);
+}
+
+static void update_dir_rename_counts(struct strmap *dir_rename_count,
+				     struct strset *dirs_removed,
+				     const char *oldname,
+				     const char *newname)
+{
+	char *old_dir = xstrdup(oldname);
+	char *new_dir = xstrdup(newname);
+	char new_dir_first_char = new_dir[0];
+	int first_time_in_loop = 1;
+
+	while (1) {
+		dirname_munge(old_dir);
+		dirname_munge(new_dir);
+
+		/*
+		 * When renaming
+		 *   "a/b/c/d/e/foo.c" -> "a/b/some/thing/else/e/foo.c"
+		 * then this suggests that both
+		 *   a/b/c/d/e/ => a/b/some/thing/else/e/
+		 *   a/b/c/d/   => a/b/some/thing/else/
+		 * so we want to increment counters for both.  We do NOT,
+		 * however, also want to suggest that there was the following
+		 * rename:
+		 *   a/b/c/ => a/b/some/thing/
+		 * so we need to quit at that point.
+		 *
+		 * Note the when first_time_in_loop, we only strip off the
+		 * basename, and we don't care if that's different.
+		 */
+		if (!first_time_in_loop) {
+			char *old_sub_dir = strchr(old_dir, '\0')+1;
+			char *new_sub_dir = strchr(new_dir, '\0')+1;
+			if (!*new_dir) {
+				/*
+				 * Special case when renaming to root directory,
+				 * i.e. when new_dir == "".  In this case, we had
+				 * something like
+				 *    a/b/subdir => subdir
+				 * and so dirname_munge() sets things up so that
+				 *    old_dir = "a/b\0subdir\0"
+				 *    new_dir = "\0ubdir\0"
+				 * We didn't have a '/' to overwrite a '\0' onto
+				 * in new_dir, so we have to compare differently.
+				 */
+				if (new_dir_first_char != old_sub_dir[0] ||
+				    strcmp(old_sub_dir+1, new_sub_dir))
+					break;
+			} else {
+				if (strcmp(old_sub_dir, new_sub_dir))
+					break;
+			}
+		}
+
+		if (strset_contains(dirs_removed, old_dir))
+			increment_count(dir_rename_count, old_dir, new_dir);
+		else
+			break;
+
+		/* If we hit toplevel directory ("") for old or new dir, quit */
+		if (!*old_dir || !*new_dir)
+			break;
+
+		first_time_in_loop = 0;
+	}
+
+	/* Free resources we don't need anymore */
+	free(old_dir);
+	free(new_dir);
+}
+
+static void compute_dir_rename_counts(struct strmap *dir_rename_count,
+				      struct strset *dirs_removed)
+{
+	int i;
+
+	/* Set up dir_rename_count */
+	for (i = 0; i < rename_dst_nr; ++i) {
+		/* File not part of directory rename counts if not a rename */
+		if (!rename_dst[i].is_rename)
+			continue;
+
+		/*
+		 * Make dir_rename_count contain a map of a map:
+		 *   old_directory -> {new_directory -> count}
+		 * In other words, for every pair look at the directories for
+		 * the old filename and the new filename and count how many
+		 * times that pairing occurs.
+		 */
+		update_dir_rename_counts(dir_rename_count, dirs_removed,
+					 rename_dst[i].p->one->path,
+					 rename_dst[i].p->two->path);
+	}
+}
+
 static void initialize_dir_rename_info(struct dir_rename_info *info)
 {
 	int i;
@@ -790,7 +913,9 @@ static void remove_unneeded_paths_from_src(int detecting_copies)
 	rename_src_nr = new_num_src;
 }
 
-void diffcore_rename(struct diff_options *options)
+void diffcore_rename_extended(struct diff_options *options,
+			      struct strset *dirs_removed,
+			      struct strmap *dir_rename_count)
 {
 	int detect_rename = options->detect_rename;
 	int minimum_score = options->rename_score;
@@ -805,6 +930,7 @@ void diffcore_rename(struct diff_options *options)
 
 	trace2_region_enter("diff", "setup", options->repo);
 	info.setup = 0;
+	assert(!dir_rename_count || strmap_empty(dir_rename_count));
 	want_copies = (detect_rename == DIFF_DETECT_COPY);
 	if (!minimum_score)
 		minimum_score = DEFAULT_RENAME_SCORE;
@@ -999,6 +1125,11 @@ void diffcore_rename(struct diff_options *options)
 	trace2_region_leave("diff", "inexact renames", options->repo);
 
  cleanup:
+	/*
+	 * Now that renames have been computed, compute dir_rename_count */
+	if (dirs_removed && dir_rename_count)
+		compute_dir_rename_counts(dir_rename_count, dirs_removed);
+
 	/* At this point, we have found some renames and copies and they
 	 * are recorded in rename_dst.  The original list is still in *q.
 	 */
@@ -1082,3 +1213,8 @@ void diffcore_rename(struct diff_options *options)
 	trace2_region_leave("diff", "write back to queue", options->repo);
 	return;
 }
+
+void diffcore_rename(struct diff_options *options)
+{
+	diffcore_rename_extended(options, NULL, NULL);
+}
diff --git a/diffcore.h b/diffcore.h
index d2a63c5c71f4..db55d3853071 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -8,6 +8,8 @@
 
 struct diff_options;
 struct repository;
+struct strmap;
+struct strset;
 struct userdiff_driver;
 
 /* This header file is internal between diff.c and its diff transformers
@@ -161,6 +163,9 @@ void diff_q(struct diff_queue_struct *, struct diff_filepair *);
 
 void diffcore_break(struct repository *, int);
 void diffcore_rename(struct diff_options *);
+void diffcore_rename_extended(struct diff_options *options,
+			      struct strset *dirs_removed,
+			      struct strmap *dir_rename_count);
 void diffcore_merge_broken(void);
 void diffcore_pickaxe(struct diff_options *);
 void diffcore_order(const char *orderfile);
diff --git a/merge-ort.c b/merge-ort.c
index 603d30c52170..c4467e073b45 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -1302,131 +1302,6 @@ static char *handle_path_level_conflicts(struct merge_options *opt,
 	return new_path;
 }
 
-static void dirname_munge(char *filename)
-{
-	char *slash = strrchr(filename, '/');
-	if (!slash)
-		slash = filename;
-	*slash = '\0';
-}
-
-static void increment_count(struct strmap *dir_rename_count,
-			    char *old_dir,
-			    char *new_dir)
-{
-	struct strintmap *counts;
-	struct strmap_entry *e;
-
-	/* Get the {new_dirs -> counts} mapping using old_dir */
-	e = strmap_get_entry(dir_rename_count, old_dir);
-	if (e) {
-		counts = e->value;
-	} else {
-		counts = xmalloc(sizeof(*counts));
-		strintmap_init_with_options(counts, 0, NULL, 1);
-		strmap_put(dir_rename_count, old_dir, counts);
-	}
-
-	/* Increment the count for new_dir */
-	strintmap_incr(counts, new_dir, 1);
-}
-
-static void update_dir_rename_counts(struct strmap *dir_rename_count,
-				     struct strset *dirs_removed,
-				     const char *oldname,
-				     const char *newname)
-{
-	char *old_dir = xstrdup(oldname);
-	char *new_dir = xstrdup(newname);
-	char new_dir_first_char = new_dir[0];
-	int first_time_in_loop = 1;
-
-	while (1) {
-		dirname_munge(old_dir);
-		dirname_munge(new_dir);
-
-		/*
-		 * When renaming
-		 *   "a/b/c/d/e/foo.c" -> "a/b/some/thing/else/e/foo.c"
-		 * then this suggests that both
-		 *   a/b/c/d/e/ => a/b/some/thing/else/e/
-		 *   a/b/c/d/   => a/b/some/thing/else/
-		 * so we want to increment counters for both.  We do NOT,
-		 * however, also want to suggest that there was the following
-		 * rename:
-		 *   a/b/c/ => a/b/some/thing/
-		 * so we need to quit at that point.
-		 *
-		 * Note the when first_time_in_loop, we only strip off the
-		 * basename, and we don't care if that's different.
-		 */
-		if (!first_time_in_loop) {
-			char *old_sub_dir = strchr(old_dir, '\0')+1;
-			char *new_sub_dir = strchr(new_dir, '\0')+1;
-			if (!*new_dir) {
-				/*
-				 * Special case when renaming to root directory,
-				 * i.e. when new_dir == "".  In this case, we had
-				 * something like
-				 *    a/b/subdir => subdir
-				 * and so dirname_munge() sets things up so that
-				 *    old_dir = "a/b\0subdir\0"
-				 *    new_dir = "\0ubdir\0"
-				 * We didn't have a '/' to overwrite a '\0' onto
-				 * in new_dir, so we have to compare differently.
-				 */
-				if (new_dir_first_char != old_sub_dir[0] ||
-				    strcmp(old_sub_dir+1, new_sub_dir))
-					break;
-			} else {
-				if (strcmp(old_sub_dir, new_sub_dir))
-					break;
-			}
-		}
-
-		if (strset_contains(dirs_removed, old_dir))
-			increment_count(dir_rename_count, old_dir, new_dir);
-		else
-			break;
-
-		/* If we hit toplevel directory ("") for old or new dir, quit */
-		if (!*old_dir || !*new_dir)
-			break;
-
-		first_time_in_loop = 0;
-	}
-
-	/* Free resources we don't need anymore */
-	free(old_dir);
-	free(new_dir);
-}
-
-static void compute_rename_counts(struct diff_queue_struct *pairs,
-				  struct strmap *dir_rename_count,
-				  struct strset *dirs_removed)
-{
-	int i;
-
-	for (i = 0; i < pairs->nr; ++i) {
-		struct diff_filepair *pair = pairs->queue[i];
-
-		/* File not part of directory rename if it wasn't renamed */
-		if (pair->status != 'R')
-			continue;
-
-		/*
-		 * Make dir_rename_count contain a map of a map:
-		 *   old_directory -> {new_directory -> count}
-		 * In other words, for every pair look at the directories for
-		 * the old filename and the new filename and count how many
-		 * times that pairing occurs.
-		 */
-		update_dir_rename_counts(dir_rename_count, dirs_removed,
-					 pair->one->path,
-					 pair->two->path);
-	}
-}
-
 static void get_provisional_directory_renames(struct merge_options *opt,
 					      unsigned side,
 					      int *clean)
@@ -1435,9 +1310,6 @@ static void get_provisional_directory_renames(struct merge_options *opt,
 	struct strmap_entry *entry;
 	struct rename_info *renames = &opt->priv->renames;
 
-	compute_rename_counts(&renames->pairs[side],
-			      &renames->dir_rename_count[side],
-			      &renames->dirs_removed[side]);
 	/*
 	 * Collapse
 	 *    dir_rename_count: old_directory -> {new_directory -> count}
@@ -2162,7 +2034,9 @@ static void detect_regular_renames(struct merge_options *opt,
 
 	diff_queued_diff = renames->pairs[side_index];
 	trace2_region_enter("diff", "diffcore_rename", opt->repo);
-	diffcore_rename(&diff_opts);
+	diffcore_rename_extended(&diff_opts,
+				 &renames->dirs_removed[side_index],
+				 &renames->dir_rename_count[side_index]);
 	trace2_region_leave("diff", "diffcore_rename", opt->repo);
 	resolve_diffpair_statuses(&diff_queued_diff);
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v4 05/10] diffcore-rename: add function for clearing dir_rename_count
  2021-02-27  0:30     ` [PATCH v4 " Elijah Newren via GitGitGadget
                         ` (3 preceding siblings ...)
  2021-02-27  0:30       ` [PATCH v4 04/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
@ 2021-02-27  0:30       ` Elijah Newren via GitGitGadget
  2021-02-27  0:30       ` [PATCH v4 06/10] diffcore-rename: move dir_rename_counts into dir_rename_info struct Elijah Newren via GitGitGadget
                         ` (5 subsequent siblings)
  10 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-27  0:30 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Elijah Newren, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

As we adjust the usage of dir_rename_count we want to have a function
for clearing, or partially clearing it out.  Add a
partial_clear_dir_rename_count() function for this purpose.

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 12 ++++++++++++
 diffcore.h        |  2 ++
 merge-ort.c       | 12 +++---------
 3 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 39e23d57e7bc..7dd475ff9a9f 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -528,6 +528,18 @@ static void initialize_dir_rename_info(struct dir_rename_info *info)
 	}
 }
 
+void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
+{
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
+
+	strmap_for_each_entry(dir_rename_count, &iter, entry) {
+		struct strintmap *counts = entry->value;
+		strintmap_clear(counts);
+	}
+	strmap_partial_clear(dir_rename_count, 1);
+}
+
 static void cleanup_dir_rename_info(struct dir_rename_info *info)
 {
 	if (!info->setup)
diff --git a/diffcore.h b/diffcore.h
index db55d3853071..c6ba64abd198 100644
--- a/diffcore.h
+++ b/diffcore.h
@@ -161,6 +161,8 @@ struct diff_filepair *diff_queue(struct diff_queue_struct *,
 				 struct diff_filespec *);
 void diff_q(struct diff_queue_struct *, struct diff_filepair *);
 
+void partial_clear_dir_rename_count(struct strmap *dir_rename_count);
+
 void diffcore_break(struct repository *, int);
 void diffcore_rename(struct diff_options *);
 void diffcore_rename_extended(struct diff_options *options,
diff --git a/merge-ort.c b/merge-ort.c
index c4467e073b45..467404cc0a35 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -351,17 +351,11 @@ static void clear_or_reinit_internal_opts(struct merge_options_internal *opti,
 
 	/* Free memory used by various renames maps */
 	for (i = MERGE_SIDE1; i <= MERGE_SIDE2; ++i) {
-		struct hashmap_iter iter;
-		struct strmap_entry *entry;
-
 		strset_func(&renames->dirs_removed[i]);
 
-		strmap_for_each_entry(&renames->dir_rename_count[i],
-				      &iter, entry) {
-			struct strintmap *counts = entry->value;
-			strintmap_clear(counts);
-		}
-		strmap_func(&renames->dir_rename_count[i], 1);
+		partial_clear_dir_rename_count(&renames->dir_rename_count[i]);
+		if (!reinitialize)
+			strmap_clear(&renames->dir_rename_count[i], 1);
 
 		strmap_func(&renames->dir_renames[i], 0);
 	}
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v4 06/10] diffcore-rename: move dir_rename_counts into dir_rename_info struct
  2021-02-27  0:30     ` [PATCH v4 " Elijah Newren via GitGitGadget
                         ` (4 preceding siblings ...)
  2021-02-27  0:30       ` [PATCH v4 05/10] diffcore-rename: add function for clearing dir_rename_count Elijah Newren via GitGitGadget
@ 2021-02-27  0:30       ` Elijah Newren via GitGitGadget
  2021-02-27  0:30       ` [PATCH v4 07/10] diffcore-rename: extend cleanup_dir_rename_info() Elijah Newren via GitGitGadget
                         ` (4 subsequent siblings)
  10 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-27  0:30 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Elijah Newren, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

This continues the migration of the directory rename detection code into
diffcore-rename, now taking the simple step of combining it with the
dir_rename_info struct.  Future commits will then make dir_rename_counts
be computed in stages, and add computation of dir_rename_guess.

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 7dd475ff9a9f..a1ccf14001f5 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -388,7 +388,7 @@ static void dirname_munge(char *filename)
 	*slash = '\0';
 }
 
-static void increment_count(struct strmap *dir_rename_count,
+static void increment_count(struct dir_rename_info *info,
 			    char *old_dir,
 			    char *new_dir)
 {
@@ -396,20 +396,20 @@ static void increment_count(struct strmap *dir_rename_count,
 	struct strmap_entry *e;
 
 	/* Get the {new_dirs -> counts} mapping using old_dir */
-	e = strmap_get_entry(dir_rename_count, old_dir);
+	e = strmap_get_entry(info->dir_rename_count, old_dir);
 	if (e) {
 		counts = e->value;
 	} else {
 		counts = xmalloc(sizeof(*counts));
 		strintmap_init_with_options(counts, 0, NULL, 1);
-		strmap_put(dir_rename_count, old_dir, counts);
+		strmap_put(info->dir_rename_count, old_dir, counts);
 	}
 
 	/* Increment the count for new_dir */
 	strintmap_incr(counts, new_dir, 1);
 }
 
-static void update_dir_rename_counts(struct strmap *dir_rename_count,
+static void update_dir_rename_counts(struct dir_rename_info *info,
 				     struct strset *dirs_removed,
 				     const char *oldname,
 				     const char *newname)
@@ -463,7 +463,7 @@ static void update_dir_rename_counts(struct strmap *dir_rename_count,
 		}
 
 		if (strset_contains(dirs_removed, old_dir))
-			increment_count(dir_rename_count, old_dir, new_dir);
+			increment_count(info, old_dir, new_dir);
 		else
 			break;
 
@@ -479,12 +479,15 @@ static void update_dir_rename_counts(struct strmap *dir_rename_count,
 	free(new_dir);
 }
 
-static void compute_dir_rename_counts(struct strmap *dir_rename_count,
-				      struct strset *dirs_removed)
+static void compute_dir_rename_counts(struct dir_rename_info *info,
+				      struct strset *dirs_removed,
+				      struct strmap *dir_rename_count)
 {
 	int i;
 
-	/* Set up dir_rename_count */
+	info->setup = 1;
+	info->dir_rename_count = dir_rename_count;
+
 	for (i = 0; i < rename_dst_nr; ++i) {
 		/* File not part of directory rename counts if not a rename */
 		if (!rename_dst[i].is_rename)
@@ -497,7 +500,7 @@ static void compute_dir_rename_counts(struct strmap *dir_rename_count,
 		 * the old filename and the new filename and count how many
 		 * times that pairing occurs.
 		 */
-		update_dir_rename_counts(dir_rename_count, dirs_removed,
+		update_dir_rename_counts(info, dirs_removed,
 					 rename_dst[i].p->one->path,
 					 rename_dst[i].p->two->path);
 	}
@@ -551,7 +554,9 @@ static void cleanup_dir_rename_info(struct dir_rename_info *info)
 	/* dir_rename_guess */
 	strmap_clear(&info->dir_rename_guess, 1);
 
-	/* Nothing to do for dir_rename_count, yet */
+	/* dir_rename_count */
+	partial_clear_dir_rename_count(info->dir_rename_count);
+	strmap_clear(info->dir_rename_count, 1);
 }
 
 static const char *get_basename(const char *filename)
@@ -1140,7 +1145,7 @@ void diffcore_rename_extended(struct diff_options *options,
 	/*
 	 * Now that renames have been computed, compute dir_rename_count */
 	if (dirs_removed && dir_rename_count)
-		compute_dir_rename_counts(dir_rename_count, dirs_removed);
+		compute_dir_rename_counts(&info, dirs_removed, dir_rename_count);
 
 	/* At this point, we have found some renames and copies and they
 	 * are recorded in rename_dst.  The original list is still in *q.
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v4 07/10] diffcore-rename: extend cleanup_dir_rename_info()
  2021-02-27  0:30     ` [PATCH v4 " Elijah Newren via GitGitGadget
                         ` (5 preceding siblings ...)
  2021-02-27  0:30       ` [PATCH v4 06/10] diffcore-rename: move dir_rename_counts into dir_rename_info struct Elijah Newren via GitGitGadget
@ 2021-02-27  0:30       ` Elijah Newren via GitGitGadget
  2021-02-27  0:30       ` [PATCH v4 08/10] diffcore-rename: compute dir_rename_counts in stages Elijah Newren via GitGitGadget
                         ` (3 subsequent siblings)
  10 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-27  0:30 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Elijah Newren, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

When diffcore_rename_extended() is passed a NULL dir_rename_count, we
will still want to create a temporary one for use by
find_basename_matches(), but have it fully deallocated before
diffcore_rename_extended() returns.  However, when
diffcore_rename_extended() is passed a dir_rename_count, we want to fill
that strmap with appropriate values and return it.  However, for our
interim purposes we may also add entries corresponding to directories
that cannot have been renamed due to still existing on both sides.

Extend cleanup_dir_rename_info() to handle these two different cases,
cleaning up the relevant bits of information for each case.

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 40 ++++++++++++++++++++++++++++++++++++----
 1 file changed, 36 insertions(+), 4 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index a1ccf14001f5..2cf9c47c6364 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -543,8 +543,15 @@ void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
 	strmap_partial_clear(dir_rename_count, 1);
 }
 
-static void cleanup_dir_rename_info(struct dir_rename_info *info)
+static void cleanup_dir_rename_info(struct dir_rename_info *info,
+				    struct strset *dirs_removed,
+				    int keep_dir_rename_count)
 {
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
+	struct string_list to_remove = STRING_LIST_INIT_NODUP;
+	int i;
+
 	if (!info->setup)
 		return;
 
@@ -555,8 +562,33 @@ static void cleanup_dir_rename_info(struct dir_rename_info *info)
 	strmap_clear(&info->dir_rename_guess, 1);
 
 	/* dir_rename_count */
-	partial_clear_dir_rename_count(info->dir_rename_count);
-	strmap_clear(info->dir_rename_count, 1);
+	if (!keep_dir_rename_count) {
+		partial_clear_dir_rename_count(info->dir_rename_count);
+		strmap_clear(info->dir_rename_count, 1);
+		FREE_AND_NULL(info->dir_rename_count);
+		return;
+	}
+
+	/*
+	 * Although dir_rename_count was passed in
+	 * diffcore_rename_extended() and we want to keep it around and
+	 * return it to that caller, we first want to remove any data
+	 * associated with directories that weren't renamed.
+	 */
+	strmap_for_each_entry(info->dir_rename_count, &iter, entry) {
+		const char *source_dir = entry->key;
+		struct strintmap *counts = entry->value;
+
+		if (!strset_contains(dirs_removed, source_dir)) {
+			string_list_append(&to_remove, source_dir);
+			strintmap_clear(counts);
+			continue;
+		}
+	}
+	for (i = 0; i < to_remove.nr; ++i)
+		strmap_remove(info->dir_rename_count,
+			      to_remove.items[i].string, 1);
+	string_list_clear(&to_remove, 0);
 }
 
 static const char *get_basename(const char *filename)
@@ -1218,7 +1250,7 @@ void diffcore_rename_extended(struct diff_options *options,
 		if (rename_dst[i].filespec_to_free)
 			free_filespec(rename_dst[i].filespec_to_free);
 
-	cleanup_dir_rename_info(&info);
+	cleanup_dir_rename_info(&info, dirs_removed, dir_rename_count != NULL);
 	FREE_AND_NULL(rename_dst);
 	rename_dst_nr = rename_dst_alloc = 0;
 	FREE_AND_NULL(rename_src);
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v4 08/10] diffcore-rename: compute dir_rename_counts in stages
  2021-02-27  0:30     ` [PATCH v4 " Elijah Newren via GitGitGadget
                         ` (6 preceding siblings ...)
  2021-02-27  0:30       ` [PATCH v4 07/10] diffcore-rename: extend cleanup_dir_rename_info() Elijah Newren via GitGitGadget
@ 2021-02-27  0:30       ` Elijah Newren via GitGitGadget
  2021-02-27  0:30       ` [PATCH v4 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs Elijah Newren via GitGitGadget
                         ` (2 subsequent siblings)
  10 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-27  0:30 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Elijah Newren, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

Compute dir_rename_counts based just on exact renames to start, as that
can provide us useful information in find_basename_matches().  This is
done by moving the code from compute_dir_rename_counts() into
initialize_dir_rename_info(), resulting in it being computed earlier and
based just on exact renames.  Since that's an incomplete result, we
augment the counts via calling update_dir_rename_counts() after each
basename-guide and inexact rename detection match is found.

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 110 +++++++++++++++++++++++++++++-----------------
 1 file changed, 70 insertions(+), 40 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 2cf9c47c6364..10f8f4a301e3 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -419,6 +419,28 @@ static void update_dir_rename_counts(struct dir_rename_info *info,
 	char new_dir_first_char = new_dir[0];
 	int first_time_in_loop = 1;
 
+	if (!info->setup)
+		/*
+		 * info->setup is 0 here in two cases: (1) all auxiliary
+		 * vars (like dirs_removed) were NULL so
+		 * initialize_dir_rename_info() returned early, or (2)
+		 * either break detection or copy detection are active so
+		 * that we never called initialize_dir_rename_info().  In
+		 * the former case, we don't have enough info to know if
+		 * directories were renamed (because dirs_removed lets us
+		 * know about a necessary prerequisite, namely if they were
+		 * removed), and in the latter, we don't care about
+		 * directory renames or find_basename_matches.
+		 *
+		 * This matters because both basename and inexact matching
+		 * will also call update_dir_rename_counts().  In either of
+		 * the above two cases info->dir_rename_counts will not
+		 * have been properly initialized which prevents us from
+		 * updating it, but in these two cases we don't care about
+		 * dir_rename_counts anyway, so we can just exit early.
+		 */
+		return;
+
 	while (1) {
 		dirname_munge(old_dir);
 		dirname_munge(new_dir);
@@ -479,45 +501,29 @@ static void update_dir_rename_counts(struct dir_rename_info *info,
 	free(new_dir);
 }
 
-static void compute_dir_rename_counts(struct dir_rename_info *info,
-				      struct strset *dirs_removed,
-				      struct strmap *dir_rename_count)
+static void initialize_dir_rename_info(struct dir_rename_info *info,
+				       struct strset *dirs_removed,
+				       struct strmap *dir_rename_count)
 {
 	int i;
 
-	info->setup = 1;
-	info->dir_rename_count = dir_rename_count;
-
-	for (i = 0; i < rename_dst_nr; ++i) {
-		/* File not part of directory rename counts if not a rename */
-		if (!rename_dst[i].is_rename)
-			continue;
-
-		/*
-		 * Make dir_rename_count contain a map of a map:
-		 *   old_directory -> {new_directory -> count}
-		 * In other words, for every pair look at the directories for
-		 * the old filename and the new filename and count how many
-		 * times that pairing occurs.
-		 */
-		update_dir_rename_counts(info, dirs_removed,
-					 rename_dst[i].p->one->path,
-					 rename_dst[i].p->two->path);
+	if (!dirs_removed) {
+		info->setup = 0;
+		return;
 	}
-}
-
-static void initialize_dir_rename_info(struct dir_rename_info *info)
-{
-	int i;
-
 	info->setup = 1;
 
+	info->dir_rename_count = dir_rename_count;
+	if (!info->dir_rename_count) {
+		info->dir_rename_count = xmalloc(sizeof(*dir_rename_count));
+		strmap_init(info->dir_rename_count);
+	}
 	strintmap_init_with_options(&info->idx_map, -1, NULL, 0);
 	strmap_init_with_options(&info->dir_rename_guess, NULL, 0);
-	info->dir_rename_count = NULL;
 
 	/*
-	 * Loop setting up both info->idx_map.
+	 * Loop setting up both info->idx_map, and doing setup of
+	 * info->dir_rename_count.
 	 */
 	for (i = 0; i < rename_dst_nr; ++i) {
 		/*
@@ -527,7 +533,20 @@ static void initialize_dir_rename_info(struct dir_rename_info *info)
 		if (!rename_dst[i].is_rename) {
 			char *filename = rename_dst[i].p->two->path;
 			strintmap_set(&info->idx_map, filename, i);
+			continue;
 		}
+
+		/*
+		 * For everything else (i.e. renamed files), make
+		 * dir_rename_count contain a map of a map:
+		 *   old_directory -> {new_directory -> count}
+		 * In other words, for every pair look at the directories for
+		 * the old filename and the new filename and count how many
+		 * times that pairing occurs.
+		 */
+		update_dir_rename_counts(info, dirs_removed,
+					 rename_dst[i].p->one->path,
+					 rename_dst[i].p->two->path);
 	}
 }
 
@@ -682,7 +701,8 @@ static int idx_possible_rename(char *filename, struct dir_rename_info *info)
 
 static int find_basename_matches(struct diff_options *options,
 				 int minimum_score,
-				 struct dir_rename_info *info)
+				 struct dir_rename_info *info,
+				 struct strset *dirs_removed)
 {
 	/*
 	 * When I checked in early 2020, over 76% of file renames in linux
@@ -810,6 +830,8 @@ static int find_basename_matches(struct diff_options *options,
 				continue;
 			record_rename_pair(dst_index, src_index, score);
 			renames++;
+			update_dir_rename_counts(info, dirs_removed,
+						 one->path, two->path);
 
 			/*
 			 * Found a rename so don't need text anymore; if we
@@ -893,7 +915,12 @@ static int too_many_rename_candidates(int num_destinations, int num_sources,
 	return 1;
 }
 
-static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, int copies)
+static int find_renames(struct diff_score *mx,
+			int dst_cnt,
+			int minimum_score,
+			int copies,
+			struct dir_rename_info *info,
+			struct strset *dirs_removed)
 {
 	int count = 0, i;
 
@@ -910,6 +937,9 @@ static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, i
 			continue;
 		record_rename_pair(mx[i].dst, mx[i].src, mx[i].score);
 		count++;
+		update_dir_rename_counts(info, dirs_removed,
+					 rename_src[mx[i].src].p->one->path,
+					 rename_dst[mx[i].dst].p->two->path);
 	}
 	return count;
 }
@@ -981,6 +1011,8 @@ void diffcore_rename_extended(struct diff_options *options,
 	info.setup = 0;
 	assert(!dir_rename_count || strmap_empty(dir_rename_count));
 	want_copies = (detect_rename == DIFF_DETECT_COPY);
+	if (dirs_removed && (break_idx || want_copies))
+		BUG("dirs_removed incompatible with break/copy detection");
 	if (!minimum_score)
 		minimum_score = DEFAULT_RENAME_SCORE;
 
@@ -1074,14 +1106,15 @@ void diffcore_rename_extended(struct diff_options *options,
 
 		/* Preparation for basename-driven matching. */
 		trace2_region_enter("diff", "dir rename setup", options->repo);
-		initialize_dir_rename_info(&info);
+		initialize_dir_rename_info(&info,
+					   dirs_removed, dir_rename_count);
 		trace2_region_leave("diff", "dir rename setup", options->repo);
 
 		/* Utilize file basenames to quickly find renames. */
 		trace2_region_enter("diff", "basename matches", options->repo);
 		rename_count += find_basename_matches(options,
 						      min_basename_score,
-						      &info);
+						      &info, dirs_removed);
 		trace2_region_leave("diff", "basename matches", options->repo);
 
 		/*
@@ -1167,18 +1200,15 @@ void diffcore_rename_extended(struct diff_options *options,
 	/* cost matrix sorted by most to least similar pair */
 	STABLE_QSORT(mx, dst_cnt * NUM_CANDIDATE_PER_DST, score_compare);
 
-	rename_count += find_renames(mx, dst_cnt, minimum_score, 0);
+	rename_count += find_renames(mx, dst_cnt, minimum_score, 0,
+				     &info, dirs_removed);
 	if (want_copies)
-		rename_count += find_renames(mx, dst_cnt, minimum_score, 1);
+		rename_count += find_renames(mx, dst_cnt, minimum_score, 1,
+					     &info, dirs_removed);
 	free(mx);
 	trace2_region_leave("diff", "inexact renames", options->repo);
 
  cleanup:
-	/*
-	 * Now that renames have been computed, compute dir_rename_count */
-	if (dirs_removed && dir_rename_count)
-		compute_dir_rename_counts(&info, dirs_removed, dir_rename_count);
-
 	/* At this point, we have found some renames and copies and they
 	 * are recorded in rename_dst.  The original list is still in *q.
 	 */
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v4 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs
  2021-02-27  0:30     ` [PATCH v4 " Elijah Newren via GitGitGadget
                         ` (7 preceding siblings ...)
  2021-02-27  0:30       ` [PATCH v4 08/10] diffcore-rename: compute dir_rename_counts in stages Elijah Newren via GitGitGadget
@ 2021-02-27  0:30       ` Elijah Newren via GitGitGadget
  2021-02-27  0:30       ` [PATCH v4 10/10] diffcore-rename: compute dir_rename_guess from dir_rename_counts Elijah Newren via GitGitGadget
  2021-03-09 21:52       ` [PATCH v4 00/10] Optimization batch 8: use file basenames even more Derrick Stolee
  10 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-27  0:30 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Elijah Newren, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

We are using dir_rename_counts to count the number of other directories
that files within a directory moved to.  We only need this information
for directories that disappeared, though, so we can return early from
update_dir_rename_counts() for other paths.

If dirs_removed is passed to diffcore_rename_extended(), then it
provides the relevant bits of information for us to limit this counting
to relevant dirs.  If dirs_removed is not passed, we would need to
compute some replacement in order to do this limiting.  Introduce a new
info->relevant_source_dirs variable for this purpose, even though at
this stage we will only set it to dirs_removed for simplicity.

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index 10f8f4a301e3..e5fa0cb555dd 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -371,6 +371,7 @@ struct dir_rename_info {
 	struct strintmap idx_map;
 	struct strmap dir_rename_guess;
 	struct strmap *dir_rename_count;
+	struct strset *relevant_source_dirs;
 	unsigned setup;
 };
 
@@ -442,7 +443,13 @@ static void update_dir_rename_counts(struct dir_rename_info *info,
 		return;
 
 	while (1) {
+		/* Get old_dir, skip if its directory isn't relevant. */
 		dirname_munge(old_dir);
+		if (info->relevant_source_dirs &&
+		    !strset_contains(info->relevant_source_dirs, old_dir))
+			break;
+
+		/* Get new_dir */
 		dirname_munge(new_dir);
 
 		/*
@@ -521,6 +528,9 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 	strintmap_init_with_options(&info->idx_map, -1, NULL, 0);
 	strmap_init_with_options(&info->dir_rename_guess, NULL, 0);
 
+	/* Setup info->relevant_source_dirs */
+	info->relevant_source_dirs = dirs_removed;
+
 	/*
 	 * Loop setting up both info->idx_map, and doing setup of
 	 * info->dir_rename_count.
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH v4 10/10] diffcore-rename: compute dir_rename_guess from dir_rename_counts
  2021-02-27  0:30     ` [PATCH v4 " Elijah Newren via GitGitGadget
                         ` (8 preceding siblings ...)
  2021-02-27  0:30       ` [PATCH v4 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs Elijah Newren via GitGitGadget
@ 2021-02-27  0:30       ` Elijah Newren via GitGitGadget
  2021-03-09 21:52       ` [PATCH v4 00/10] Optimization batch 8: use file basenames even more Derrick Stolee
  10 siblings, 0 replies; 61+ messages in thread
From: Elijah Newren via GitGitGadget @ 2021-02-27  0:30 UTC (permalink / raw)
  To: git
  Cc: Derrick Stolee, Elijah Newren, Junio C Hamano,
	Ævar Arnfjörð Bjarmason, Elijah Newren,
	Elijah Newren

From: Elijah Newren <newren@gmail.com>

dir_rename_counts has a mapping of a mapping, in particular, it has
   old_dir => { new_dir => count }
We want a simple mapping of
   old_dir => new_dir
based on which new_dir had the highest count for a given old_dir.
Compute this and store it in dir_rename_guess.

This is the final piece of the puzzle needed to make our guesses at
which directory files have been moved to when basenames aren't unique.

For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
performance work; instrument with trace2_region_* calls", 2020-10-28),
this change improves the performance as follows:

                            Before                  After
    no-renames:       12.775 s ±  0.062 s    12.596 s ±  0.061 s
    mega-renames:    188.754 s ±  0.284 s   130.465 s ±  0.259 s
    just-one-mega:     5.599 s ±  0.019 s     3.958 s ±  0.010 s

Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
---
 diffcore-rename.c | 45 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 41 insertions(+), 4 deletions(-)

diff --git a/diffcore-rename.c b/diffcore-rename.c
index e5fa0cb555dd..1fe902ed2af0 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -389,6 +389,24 @@ static void dirname_munge(char *filename)
 	*slash = '\0';
 }
 
+static const char *get_highest_rename_path(struct strintmap *counts)
+{
+	int highest_count = 0;
+	const char *highest_destination_dir = NULL;
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
+
+	strintmap_for_each_entry(counts, &iter, entry) {
+		const char *destination_dir = entry->key;
+		intptr_t count = (intptr_t)entry->value;
+		if (count > highest_count) {
+			highest_count = count;
+			highest_destination_dir = destination_dir;
+		}
+	}
+	return highest_destination_dir;
+}
+
 static void increment_count(struct dir_rename_info *info,
 			    char *old_dir,
 			    char *new_dir)
@@ -512,6 +530,8 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 				       struct strset *dirs_removed,
 				       struct strmap *dir_rename_count)
 {
+	struct hashmap_iter iter;
+	struct strmap_entry *entry;
 	int i;
 
 	if (!dirs_removed) {
@@ -558,6 +578,23 @@ static void initialize_dir_rename_info(struct dir_rename_info *info,
 					 rename_dst[i].p->one->path,
 					 rename_dst[i].p->two->path);
 	}
+
+	/*
+	 * Now we collapse
+	 *    dir_rename_count: old_directory -> {new_directory -> count}
+	 * down to
+	 *    dir_rename_guess: old_directory -> best_new_directory
+	 * where best_new_directory is the one with the highest count.
+	 */
+	strmap_for_each_entry(info->dir_rename_count, &iter, entry) {
+		/* entry->key is source_dir */
+		struct strintmap *counts = entry->value;
+		char *best_newdir;
+
+		best_newdir = xstrdup(get_highest_rename_path(counts));
+		strmap_put(&info->dir_rename_guess, entry->key,
+			   best_newdir);
+	}
 }
 
 void partial_clear_dir_rename_count(struct strmap *dir_rename_count)
@@ -682,10 +719,10 @@ static int idx_possible_rename(char *filename, struct dir_rename_info *info)
 	 *       rename.
 	 *
 	 * This function, idx_possible_rename(), is only responsible for (4).
-	 * The conditions/steps in (1)-(3) will be handled via setting up
-	 * dir_rename_count and dir_rename_guess in a future
-	 * initialize_dir_rename_info() function.  Steps (0) and (5) are
-	 * handled by the caller of this function.
+	 * The conditions/steps in (1)-(3) are handled via setting up
+	 * dir_rename_count and dir_rename_guess in
+	 * initialize_dir_rename_info().  Steps (0) and (5) are handled by
+	 * the caller of this function.
 	 */
 	char *old_dir, *new_dir;
 	struct strbuf new_path = STRBUF_INIT;
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH v4 00/10] Optimization batch 8: use file basenames even more
  2021-02-27  0:30     ` [PATCH v4 " Elijah Newren via GitGitGadget
                         ` (9 preceding siblings ...)
  2021-02-27  0:30       ` [PATCH v4 10/10] diffcore-rename: compute dir_rename_guess from dir_rename_counts Elijah Newren via GitGitGadget
@ 2021-03-09 21:52       ` Derrick Stolee
  10 siblings, 0 replies; 61+ messages in thread
From: Derrick Stolee @ 2021-03-09 21:52 UTC (permalink / raw)
  To: Elijah Newren via GitGitGadget, git
  Cc: Elijah Newren, Junio C Hamano, Ævar Arnfjörð Bjarmason

On 2/26/2021 7:30 PM, Elijah Newren via GitGitGadget wrote:
> This series depends on en/diffcore-rename (a concatenation of what I was
> calling ort-perf-batch-6 and ort-perf-batch-7).
> 
> Changes since v3:

I'm very late in doing this, but I reviewed the range-diff and found
this version to be satisfactory. Thanks!

-Stolee

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2021-03-09 21:53 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-14  7:58 [PATCH 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
2021-02-14  7:58 ` [PATCH 01/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
2021-02-14  7:58 ` [PATCH 02/10] diffcore-rename: add functions for clearing dir_rename_count Elijah Newren via GitGitGadget
2021-02-14  7:58 ` [PATCH 03/10] diffcore-rename: move dir_rename_counts into a dir_rename_info struct Elijah Newren via GitGitGadget
2021-02-14  7:58 ` [PATCH 04/10] diffcore-rename: extend cleanup_dir_rename_info() Elijah Newren via GitGitGadget
2021-02-14  7:58 ` [PATCH 05/10] diffcore-rename: compute dir_rename_counts in stages Elijah Newren via GitGitGadget
2021-02-14  7:58 ` [PATCH 06/10] diffcore-rename: add a mapping of destination names to their indices Elijah Newren via GitGitGadget
2021-02-14  7:59 ` [PATCH 07/10] diffcore-rename: add a dir_rename_guess field to dir_rename_info Elijah Newren via GitGitGadget
2021-02-14  7:59 ` [PATCH 08/10] diffcore-rename: add a new idx_possible_rename function Elijah Newren via GitGitGadget
2021-02-14  7:59 ` [PATCH 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs Elijah Newren via GitGitGadget
2021-02-14  7:59 ` [PATCH 10/10] diffcore-rename: use directory rename guided basename comparisons Elijah Newren via GitGitGadget
2021-02-23 23:43 ` [PATCH v2 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
2021-02-23 23:43   ` [PATCH v2 01/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
2021-02-24 15:25     ` Derrick Stolee
2021-02-24 18:50       ` Elijah Newren
2021-02-23 23:43   ` [PATCH v2 02/10] diffcore-rename: add functions for clearing dir_rename_count Elijah Newren via GitGitGadget
2021-02-23 23:44   ` [PATCH v2 03/10] diffcore-rename: move dir_rename_counts into a dir_rename_info struct Elijah Newren via GitGitGadget
2021-02-23 23:44   ` [PATCH v2 04/10] diffcore-rename: extend cleanup_dir_rename_info() Elijah Newren via GitGitGadget
2021-02-24 15:37     ` Derrick Stolee
2021-02-25  2:16     ` Ævar Arnfjörð Bjarmason
2021-02-25  2:26       ` Ævar Arnfjörð Bjarmason
2021-02-25  2:34       ` Junio C Hamano
2021-02-23 23:44   ` [PATCH v2 05/10] diffcore-rename: compute dir_rename_counts in stages Elijah Newren via GitGitGadget
2021-02-24 15:43     ` Derrick Stolee
2021-02-23 23:44   ` [PATCH v2 06/10] diffcore-rename: add a mapping of destination names to their indices Elijah Newren via GitGitGadget
2021-02-23 23:44   ` [PATCH v2 07/10] diffcore-rename: add a dir_rename_guess field to dir_rename_info Elijah Newren via GitGitGadget
2021-02-23 23:44   ` [PATCH v2 08/10] diffcore-rename: add a new idx_possible_rename function Elijah Newren via GitGitGadget
2021-02-24 17:35     ` Derrick Stolee
2021-02-25  1:13       ` Elijah Newren
2021-02-23 23:44   ` [PATCH v2 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs Elijah Newren via GitGitGadget
2021-02-23 23:44   ` [PATCH v2 10/10] diffcore-rename: use directory rename guided basename comparisons Elijah Newren via GitGitGadget
2021-02-24 17:44     ` Derrick Stolee
2021-02-24 17:50   ` [PATCH v2 00/10] Optimization batch 8: use file basenames even more Derrick Stolee
2021-02-25  1:38     ` Elijah Newren
2021-02-26  1:58   ` [PATCH v3 " Elijah Newren via GitGitGadget
2021-02-26  1:58     ` [PATCH v3 01/10] diffcore-rename: use directory rename guided basename comparisons Elijah Newren via GitGitGadget
2021-02-26  1:58     ` [PATCH v3 02/10] diffcore-rename: add a new idx_possible_rename function Elijah Newren via GitGitGadget
2021-02-26 15:52       ` Derrick Stolee
2021-02-26  1:58     ` [PATCH v3 03/10] diffcore-rename: add a mapping of destination names to their indices Elijah Newren via GitGitGadget
2021-02-26  1:58     ` [PATCH v3 04/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
2021-02-26 15:55       ` Derrick Stolee
2021-02-26  1:58     ` [PATCH v3 05/10] diffcore-rename: add function for clearing dir_rename_count Elijah Newren via GitGitGadget
2021-02-26  1:58     ` [PATCH v3 06/10] diffcore-rename: move dir_rename_counts into dir_rename_info struct Elijah Newren via GitGitGadget
2021-02-26  1:58     ` [PATCH v3 07/10] diffcore-rename: extend cleanup_dir_rename_info() Elijah Newren via GitGitGadget
2021-02-26  1:58     ` [PATCH v3 08/10] diffcore-rename: compute dir_rename_counts in stages Elijah Newren via GitGitGadget
2021-02-26  1:58     ` [PATCH v3 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs Elijah Newren via GitGitGadget
2021-02-26  1:58     ` [PATCH v3 10/10] diffcore-rename: compute dir_rename_guess from dir_rename_counts Elijah Newren via GitGitGadget
2021-02-26 16:34     ` [PATCH v3 00/10] Optimization batch 8: use file basenames even more Derrick Stolee
2021-02-26 19:28       ` Elijah Newren
2021-02-27  0:30     ` [PATCH v4 " Elijah Newren via GitGitGadget
2021-02-27  0:30       ` [PATCH v4 01/10] diffcore-rename: use directory rename guided basename comparisons Elijah Newren via GitGitGadget
2021-02-27  0:30       ` [PATCH v4 02/10] diffcore-rename: provide basic implementation of idx_possible_rename() Elijah Newren via GitGitGadget
2021-02-27  0:30       ` [PATCH v4 03/10] diffcore-rename: add a mapping of destination names to their indices Elijah Newren via GitGitGadget
2021-02-27  0:30       ` [PATCH v4 04/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
2021-02-27  0:30       ` [PATCH v4 05/10] diffcore-rename: add function for clearing dir_rename_count Elijah Newren via GitGitGadget
2021-02-27  0:30       ` [PATCH v4 06/10] diffcore-rename: move dir_rename_counts into dir_rename_info struct Elijah Newren via GitGitGadget
2021-02-27  0:30       ` [PATCH v4 07/10] diffcore-rename: extend cleanup_dir_rename_info() Elijah Newren via GitGitGadget
2021-02-27  0:30       ` [PATCH v4 08/10] diffcore-rename: compute dir_rename_counts in stages Elijah Newren via GitGitGadget
2021-02-27  0:30       ` [PATCH v4 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs Elijah Newren via GitGitGadget
2021-02-27  0:30       ` [PATCH v4 10/10] diffcore-rename: compute dir_rename_guess from dir_rename_counts Elijah Newren via GitGitGadget
2021-03-09 21:52       ` [PATCH v4 00/10] Optimization batch 8: use file basenames even more Derrick Stolee

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).