From: "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: "Derrick Stolee" <stolee@gmail.com>,
"Elijah Newren" <newren@gmail.com>,
"Junio C Hamano" <gitster@pobox.com>,
"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
"Elijah Newren" <newren@gmail.com>,
"Elijah Newren" <newren@gmail.com>
Subject: [PATCH v4 02/10] diffcore-rename: provide basic implementation of idx_possible_rename()
Date: Sat, 27 Feb 2021 00:30:40 +0000 [thread overview]
Message-ID: <2dde621d7de596d7aa0bb31245b04683de2fa3d9.1614385849.git.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.844.v4.git.1614385849.gitgitgadget@gmail.com>
From: Elijah Newren <newren@gmail.com>
Add a new struct dir_rename_info with various values we need inside our
idx_possible_rename() function introduced in the previous commit. Add a
basic implementation for this function showing how we plan to use the
variables, but which will just return early with a value of -1 (not
found) when those variables are not set up.
Future commits will do the work necessary to set up those other
variables so that idx_possible_rename() does not always return -1.
Reviewed-by: Derrick Stolee <dstolee@microsoft.com>
Signed-off-by: Elijah Newren <newren@gmail.com>
---
diffcore-rename.c | 100 +++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 94 insertions(+), 6 deletions(-)
diff --git a/diffcore-rename.c b/diffcore-rename.c
index b3055683bac2..edb0effb6ef4 100644
--- a/diffcore-rename.c
+++ b/diffcore-rename.c
@@ -367,6 +367,19 @@ static int find_exact_renames(struct diff_options *options)
return renames;
}
+struct dir_rename_info {
+ struct strintmap idx_map;
+ struct strmap dir_rename_guess;
+ struct strmap *dir_rename_count;
+ unsigned setup;
+};
+
+static char *get_dirname(const char *filename)
+{
+ char *slash = strrchr(filename, '/');
+ return slash ? xstrndup(filename, slash - filename) : xstrdup("");
+}
+
static const char *get_basename(const char *filename)
{
/*
@@ -379,14 +392,86 @@ static const char *get_basename(const char *filename)
return base ? base + 1 : filename;
}
-static int idx_possible_rename(char *filename)
+static int idx_possible_rename(char *filename, struct dir_rename_info *info)
{
- /* Unconditionally return -1, "not found", for now */
- return -1;
+ /*
+ * Our comparison of files with the same basename (see
+ * find_basename_matches() below), is only helpful when after exact
+ * rename detection we have exactly one file with a given basename
+ * among the rename sources and also only exactly one file with
+ * that basename among the rename destinations. When we have
+ * multiple files with the same basename in either set, we do not
+ * know which to compare against. However, there are some
+ * filenames that occur in large numbers (particularly
+ * build-related filenames such as 'Makefile', '.gitignore', or
+ * 'build.gradle' that potentially exist within every single
+ * subdirectory), and for performance we want to be able to quickly
+ * find renames for these files too.
+ *
+ * The reason basename comparisons are a useful heuristic was that it
+ * is common for people to move files across directories while keeping
+ * their filename the same. If we had a way of determining or even
+ * making a good educated guess about which directory these non-unique
+ * basename files had moved the file to, we could check it.
+ * Luckily...
+ *
+ * When an entire directory is in fact renamed, we have two factors
+ * helping us out:
+ * (a) the original directory disappeared giving us a hint
+ * about when we can apply an extra heuristic.
+ * (a) we often have several files within that directory and
+ * subdirectories that are renamed without changes
+ * So, rules for a heuristic:
+ * (0) If there basename matches are non-unique (the condition under
+ * which this function is called) AND
+ * (1) the directory in which the file was found has disappeared
+ * (i.e. dirs_removed is non-NULL and has a relevant entry) THEN
+ * (2) use exact renames of files within the directory to determine
+ * where the directory is likely to have been renamed to. IF
+ * there is at least one exact rename from within that
+ * directory, we can proceed.
+ * (3) If there are multiple places the directory could have been
+ * renamed to based on exact renames, ignore all but one of them.
+ * Just use the destination with the most renames going to it.
+ * (4) Check if applying that directory rename to the original file
+ * would result in a destination filename that is in the
+ * potential rename set. If so, return the index of the
+ * destination file (the index within rename_dst).
+ * (5) Compare the original file and returned destination for
+ * similarity, and if they are sufficiently similar, record the
+ * rename.
+ *
+ * This function, idx_possible_rename(), is only responsible for (4).
+ * The conditions/steps in (1)-(3) will be handled via setting up
+ * dir_rename_count and dir_rename_guess in a future
+ * initialize_dir_rename_info() function. Steps (0) and (5) are
+ * handled by the caller of this function.
+ */
+ char *old_dir, *new_dir;
+ struct strbuf new_path = STRBUF_INIT;
+ int idx;
+
+ if (!info->setup)
+ return -1;
+
+ old_dir = get_dirname(filename);
+ new_dir = strmap_get(&info->dir_rename_guess, old_dir);
+ free(old_dir);
+ if (!new_dir)
+ return -1;
+
+ strbuf_addstr(&new_path, new_dir);
+ strbuf_addch(&new_path, '/');
+ strbuf_addstr(&new_path, get_basename(filename));
+
+ idx = strintmap_get(&info->idx_map, new_path.buf);
+ strbuf_release(&new_path);
+ return idx;
}
static int find_basename_matches(struct diff_options *options,
- int minimum_score)
+ int minimum_score,
+ struct dir_rename_info *info)
{
/*
* When I checked in early 2020, over 76% of file renames in linux
@@ -494,7 +579,7 @@ static int find_basename_matches(struct diff_options *options,
dst_index = strintmap_get(&dests, base);
if (src_index == -1 || dst_index == -1) {
src_index = i;
- dst_index = idx_possible_rename(filename);
+ dst_index = idx_possible_rename(filename, info);
}
if (dst_index == -1)
continue;
@@ -677,8 +762,10 @@ void diffcore_rename(struct diff_options *options)
int num_destinations, dst_cnt;
int num_sources, want_copies;
struct progress *progress = NULL;
+ struct dir_rename_info info;
trace2_region_enter("diff", "setup", options->repo);
+ info.setup = 0;
want_copies = (detect_rename == DIFF_DETECT_COPY);
if (!minimum_score)
minimum_score = DEFAULT_RENAME_SCORE;
@@ -774,7 +861,8 @@ void diffcore_rename(struct diff_options *options)
/* Utilize file basenames to quickly find renames. */
trace2_region_enter("diff", "basename matches", options->repo);
rename_count += find_basename_matches(options,
- min_basename_score);
+ min_basename_score,
+ &info);
trace2_region_leave("diff", "basename matches", options->repo);
/*
--
gitgitgadget
next prev parent reply other threads:[~2021-02-27 0:32 UTC|newest]
Thread overview: 61+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-02-14 7:58 [PATCH 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
2021-02-14 7:58 ` [PATCH 01/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
2021-02-14 7:58 ` [PATCH 02/10] diffcore-rename: add functions for clearing dir_rename_count Elijah Newren via GitGitGadget
2021-02-14 7:58 ` [PATCH 03/10] diffcore-rename: move dir_rename_counts into a dir_rename_info struct Elijah Newren via GitGitGadget
2021-02-14 7:58 ` [PATCH 04/10] diffcore-rename: extend cleanup_dir_rename_info() Elijah Newren via GitGitGadget
2021-02-14 7:58 ` [PATCH 05/10] diffcore-rename: compute dir_rename_counts in stages Elijah Newren via GitGitGadget
2021-02-14 7:58 ` [PATCH 06/10] diffcore-rename: add a mapping of destination names to their indices Elijah Newren via GitGitGadget
2021-02-14 7:59 ` [PATCH 07/10] diffcore-rename: add a dir_rename_guess field to dir_rename_info Elijah Newren via GitGitGadget
2021-02-14 7:59 ` [PATCH 08/10] diffcore-rename: add a new idx_possible_rename function Elijah Newren via GitGitGadget
2021-02-14 7:59 ` [PATCH 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs Elijah Newren via GitGitGadget
2021-02-14 7:59 ` [PATCH 10/10] diffcore-rename: use directory rename guided basename comparisons Elijah Newren via GitGitGadget
2021-02-23 23:43 ` [PATCH v2 00/10] Optimization batch 8: use file basenames even more Elijah Newren via GitGitGadget
2021-02-23 23:43 ` [PATCH v2 01/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
2021-02-24 15:25 ` Derrick Stolee
2021-02-24 18:50 ` Elijah Newren
2021-02-23 23:43 ` [PATCH v2 02/10] diffcore-rename: add functions for clearing dir_rename_count Elijah Newren via GitGitGadget
2021-02-23 23:44 ` [PATCH v2 03/10] diffcore-rename: move dir_rename_counts into a dir_rename_info struct Elijah Newren via GitGitGadget
2021-02-23 23:44 ` [PATCH v2 04/10] diffcore-rename: extend cleanup_dir_rename_info() Elijah Newren via GitGitGadget
2021-02-24 15:37 ` Derrick Stolee
2021-02-25 2:16 ` Ævar Arnfjörð Bjarmason
2021-02-25 2:26 ` Ævar Arnfjörð Bjarmason
2021-02-25 2:34 ` Junio C Hamano
2021-02-23 23:44 ` [PATCH v2 05/10] diffcore-rename: compute dir_rename_counts in stages Elijah Newren via GitGitGadget
2021-02-24 15:43 ` Derrick Stolee
2021-02-23 23:44 ` [PATCH v2 06/10] diffcore-rename: add a mapping of destination names to their indices Elijah Newren via GitGitGadget
2021-02-23 23:44 ` [PATCH v2 07/10] diffcore-rename: add a dir_rename_guess field to dir_rename_info Elijah Newren via GitGitGadget
2021-02-23 23:44 ` [PATCH v2 08/10] diffcore-rename: add a new idx_possible_rename function Elijah Newren via GitGitGadget
2021-02-24 17:35 ` Derrick Stolee
2021-02-25 1:13 ` Elijah Newren
2021-02-23 23:44 ` [PATCH v2 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs Elijah Newren via GitGitGadget
2021-02-23 23:44 ` [PATCH v2 10/10] diffcore-rename: use directory rename guided basename comparisons Elijah Newren via GitGitGadget
2021-02-24 17:44 ` Derrick Stolee
2021-02-24 17:50 ` [PATCH v2 00/10] Optimization batch 8: use file basenames even more Derrick Stolee
2021-02-25 1:38 ` Elijah Newren
2021-02-26 1:58 ` [PATCH v3 " Elijah Newren via GitGitGadget
2021-02-26 1:58 ` [PATCH v3 01/10] diffcore-rename: use directory rename guided basename comparisons Elijah Newren via GitGitGadget
2021-02-26 1:58 ` [PATCH v3 02/10] diffcore-rename: add a new idx_possible_rename function Elijah Newren via GitGitGadget
2021-02-26 15:52 ` Derrick Stolee
2021-02-26 1:58 ` [PATCH v3 03/10] diffcore-rename: add a mapping of destination names to their indices Elijah Newren via GitGitGadget
2021-02-26 1:58 ` [PATCH v3 04/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
2021-02-26 15:55 ` Derrick Stolee
2021-02-26 1:58 ` [PATCH v3 05/10] diffcore-rename: add function for clearing dir_rename_count Elijah Newren via GitGitGadget
2021-02-26 1:58 ` [PATCH v3 06/10] diffcore-rename: move dir_rename_counts into dir_rename_info struct Elijah Newren via GitGitGadget
2021-02-26 1:58 ` [PATCH v3 07/10] diffcore-rename: extend cleanup_dir_rename_info() Elijah Newren via GitGitGadget
2021-02-26 1:58 ` [PATCH v3 08/10] diffcore-rename: compute dir_rename_counts in stages Elijah Newren via GitGitGadget
2021-02-26 1:58 ` [PATCH v3 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs Elijah Newren via GitGitGadget
2021-02-26 1:58 ` [PATCH v3 10/10] diffcore-rename: compute dir_rename_guess from dir_rename_counts Elijah Newren via GitGitGadget
2021-02-26 16:34 ` [PATCH v3 00/10] Optimization batch 8: use file basenames even more Derrick Stolee
2021-02-26 19:28 ` Elijah Newren
2021-02-27 0:30 ` [PATCH v4 " Elijah Newren via GitGitGadget
2021-02-27 0:30 ` [PATCH v4 01/10] diffcore-rename: use directory rename guided basename comparisons Elijah Newren via GitGitGadget
2021-02-27 0:30 ` Elijah Newren via GitGitGadget [this message]
2021-02-27 0:30 ` [PATCH v4 03/10] diffcore-rename: add a mapping of destination names to their indices Elijah Newren via GitGitGadget
2021-02-27 0:30 ` [PATCH v4 04/10] Move computation of dir_rename_count from merge-ort to diffcore-rename Elijah Newren via GitGitGadget
2021-02-27 0:30 ` [PATCH v4 05/10] diffcore-rename: add function for clearing dir_rename_count Elijah Newren via GitGitGadget
2021-02-27 0:30 ` [PATCH v4 06/10] diffcore-rename: move dir_rename_counts into dir_rename_info struct Elijah Newren via GitGitGadget
2021-02-27 0:30 ` [PATCH v4 07/10] diffcore-rename: extend cleanup_dir_rename_info() Elijah Newren via GitGitGadget
2021-02-27 0:30 ` [PATCH v4 08/10] diffcore-rename: compute dir_rename_counts in stages Elijah Newren via GitGitGadget
2021-02-27 0:30 ` [PATCH v4 09/10] diffcore-rename: limit dir_rename_counts computation to relevant dirs Elijah Newren via GitGitGadget
2021-02-27 0:30 ` [PATCH v4 10/10] diffcore-rename: compute dir_rename_guess from dir_rename_counts Elijah Newren via GitGitGadget
2021-03-09 21:52 ` [PATCH v4 00/10] Optimization batch 8: use file basenames even more Derrick Stolee
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: http://vger.kernel.org/majordomo-info.html
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=2dde621d7de596d7aa0bb31245b04683de2fa3d9.1614385849.git.gitgitgadget@gmail.com \
--to=gitgitgadget@gmail.com \
--cc=avarab@gmail.com \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=newren@gmail.com \
--cc=stolee@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
Code repositories for project(s) associated with this public inbox
https://80x24.org/mirrors/git.git
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).