git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: Derrick Stolee <dstolee@microsoft.com>,
	Jonathan Tan <jonathantanmy@google.com>,
	Taylor Blau <me@ttaylorr.com>, Junio C Hamano <gitster@pobox.com>,
	Jeff King <peff@peff.net>, Elijah Newren <newren@gmail.com>,
	Derrick Stolee <stolee@gmail.com>,
	Elijah Newren <newren@gmail.com>
Subject: [PATCH v5 0/6] Optimization batch 7: use file basenames to guide rename detection
Date: Sun, 14 Feb 2021 07:51:45 +0000	[thread overview]
Message-ID: <pull.843.v5.git.1613289112.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.843.v4.git.1613031350.gitgitgadget@gmail.com>

This series depends on ort-perf-batch-6. It appears Junio has appended an
earlier round of this series on that one and called the combined series
en/diffcore-rename. I'm still resubmitting the two separately to preserve
the threaded discussion in the archives and because gitgitgadget can provide
proper range-diffs that way.

This series uses file basenames (portion of the path after final '/',
including extension) in a basic fashion to guide rename detection.

Changes since v4:

 * add wording to make it clearer that we are considering remaining
   basenames after exact rename detection
 * add three minor optimizations to patch 3. (All three will have to be
   undone by the next series, but this series is probably clearer with
   them.)
 * a typo fix or two
 * v2 of ort-perf-batch-6 added some changes around consistency of
   rename_src_nr; make similar changes in using this variable in
   find_basename_changes() for consistency
 * fix the testcase so the expected comments about the change in behavior
   only show up after we change the behavior
 * attempt a rewrite of the commit message for the new testcase, who knows
   if I'll get it right this time.

Elijah Newren (6):
  t4001: add a test comparing basename similarity and content similarity
  diffcore-rename: compute basenames of source and dest candidates
  diffcore-rename: complete find_basename_matches()
  diffcore-rename: guide inexact rename detection based on basenames
  gitdiffcore doc: mention new preliminary step for rename detection
  merge-ort: call diffcore_rename() directly

 Documentation/gitdiffcore.txt |  20 ++++
 diffcore-rename.c             | 190 +++++++++++++++++++++++++++++++++-
 merge-ort.c                   |  66 ++++++++++--
 t/t4001-diff-rename.sh        |  24 +++++
 4 files changed, 289 insertions(+), 11 deletions(-)


base-commit: dd6595b45640ee9894293e8b729ef9a254564a49
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-843%2Fnewren%2Fort-perf-batch-7-v5
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-843/newren/ort-perf-batch-7-v5
Pull-Request: https://github.com/gitgitgadget/git/pull/843

Range-diff vs v4:

 1:  3e6af929d135 ! 1:  6848422e47e8 t4001: add a test comparing basename similarity and content similarity
     @@ Commit message
      
          Add a simple test where a removed file is similar to two different added
          files; one of them has the same basename, and the other has a slightly
     -    higher content similarity.  Without break detection, filename similarity
     -    of 100% trumps content similarity for pairing up related files.  For
     -    any filename similarity less than 100%, the opposite is true -- content
     -    similarity is all that matters.  Add a testcase that documents this.
     +    higher content similarity.  In the current test, content similarity is
     +    weighted higher than filename similarity.
      
     -    Subsequent commits will add a new rule that includes an inbetween state,
     -    where a mixture of filename similarity and content similarity are
     -    weighed, and which will change the outcome of this testcase.
     +    Subsequent commits will add a new rule that weighs a mixture of filename
     +    similarity and content similarity in a manner that will change the
     +    outcome of this testcase.
      
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
     @@ t/t4001-diff-rename.sh: test_expect_success 'diff-tree -l0 defaults to a big ren
      +	git add file.txt file.md &&
      +	git commit -a -m "rename" &&
      +	git diff-tree -r -M --name-status HEAD^ HEAD >actual &&
     -+	# subdir/file.txt is 89% similar to file.md, 78% similar to file.txt,
     -+	# but since same basenames are checked first...
     ++	# subdir/file.txt is 88% similar to file.md and 78% similar to file.txt
      +	cat >expected <<-\EOF &&
      +	R088	subdir/file.txt	file.md
      +	A	file.txt
 2:  4fff9b1ff57b ! 2:  73baae2535d0 diffcore-rename: compute basenames of all source and dest candidates
     @@ Metadata
      Author: Elijah Newren <newren@gmail.com>
      
       ## Commit message ##
     -    diffcore-rename: compute basenames of all source and dest candidates
     +    diffcore-rename: compute basenames of source and dest candidates
      
     -    We want to make use of unique basenames to help inform rename detection,
     -    so that more likely pairings can be checked first.  (src/moduleA/foo.txt
     -    and source/module/A/foo.txt are likely related if there are no other
     -    'foo.txt' files among the deleted and added files.)  Add a new function,
     -    not yet used, which creates a map of the unique basenames within
     -    rename_src and another within rename_dst, together with the indices
     -    within rename_src/rename_dst where those basenames show up.  Non-unique
     -    basenames still show up in the map, but have an invalid index (-1).
     +    We want to make use of unique basenames among remaining source and
     +    destination files to help inform rename detection, so that more likely
     +    pairings can be checked first.  (src/moduleA/foo.txt and
     +    source/module/A/foo.txt are likely related if there are no other
     +    'foo.txt' files among the remaining deleted and added files.)  Add a new
     +    function, not yet used, which creates a map of the unique basenames
     +    within rename_src and another within rename_dst, together with the
     +    indices within rename_src/rename_dst where those basenames show up.
     +    Non-unique basenames still show up in the map, but have an invalid index
     +    (-1).
      
          This function was inspired by the fact that in real world repositories,
          files are often moved across directories without changing names.  Here
     @@ diffcore-rename.c: static int find_exact_renames(struct diff_options *options)
      +static const char *get_basename(const char *filename)
      +{
      +	/*
     -+	 * gitbasename() has to worry about special drivers, multiple
     ++	 * gitbasename() has to worry about special drives, multiple
      +	 * directory separator characters, trailing slashes, NULL or
      +	 * empty strings, etc.  We only work on filenames as stored in
      +	 * git, and thus get to ignore all those complications.
     @@ diffcore-rename.c: static int find_exact_renames(struct diff_options *options)
      +
      +MAYBE_UNUSED
      +static int find_basename_matches(struct diff_options *options,
     -+				 int minimum_score,
     -+				 int num_src)
     ++				 int minimum_score)
      +{
      +	int i;
      +	struct strintmap sources;
      +	struct strintmap dests;
      +
     -+	/* Create maps of basename -> fullname(s) for sources and dests */
     ++	/*
     ++	 * Create maps of basename -> fullname(s) for remaining sources and
     ++	 * dests.
     ++	 */
      +	strintmap_init_with_options(&sources, -1, NULL, 0);
      +	strintmap_init_with_options(&dests, -1, NULL, 0);
     -+	for (i = 0; i < num_src; ++i) {
     ++	for (i = 0; i < rename_src_nr; ++i) {
      +		char *filename = rename_src[i].p->one->path;
      +		const char *base;
      +
 3:  dc26881e4ed3 ! 3:  ece76429dc35 diffcore-rename: complete find_basename_matches()
     @@ Commit message
          This means we are adding a set of preliminary additional comparisons,
          but for each file we only compare it with at most one other file.  For
          example, if there was a include/media/device.h that was deleted and a
     -    src/module/media/device.h that was added, and there were no other
     -    device.h files added or deleted between the commits being compared,
     -    then these two files would be compared in the preliminary step.
     +    src/module/media/device.h that was added, and there are no other
     +    device.h files in the remaining sets of added and deleted files after
     +    exact rename detection, then these two files would be compared in the
     +    preliminary step.
      
          This commit does not yet actually employ this new optimization, it
          merely adds a function which can be used for this purpose.  The next
     @@ Commit message
          Signed-off-by: Elijah Newren <newren@gmail.com>
      
       ## diffcore-rename.c ##
     -@@ diffcore-rename.c: static int find_basename_matches(struct diff_options *options,
     - 				 int minimum_score,
     - 				 int num_src)
     +@@ diffcore-rename.c: MAYBE_UNUSED
     + static int find_basename_matches(struct diff_options *options,
     + 				 int minimum_score)
       {
      -	int i;
      +	/*
     @@ diffcore-rename.c: static int find_basename_matches(struct diff_options *options
      +	int i, renames = 0;
       	struct strintmap sources;
       	struct strintmap dests;
     - 
     ++	struct hashmap_iter iter;
     ++	struct strmap_entry *entry;
     ++
      +	/*
      +	 * The prefeteching stuff wants to know if it can skip prefetching
      +	 * blobs that are unmodified...and will then do a little extra work
     @@ diffcore-rename.c: static int find_basename_matches(struct diff_options *options
      +	 * unmodified would be a small waste.
      +	 */
      +	int skip_unmodified = 0;
     -+
     - 	/* Create maps of basename -> fullname(s) for sources and dests */
     - 	strintmap_init_with_options(&sources, -1, NULL, 0);
     - 	strintmap_init_with_options(&dests, -1, NULL, 0);
     + 
     + 	/*
     + 	 * Create maps of basename -> fullname(s) for remaining sources and
      @@ diffcore-rename.c: static int find_basename_matches(struct diff_options *options,
       			strintmap_set(&dests, base, i);
       	}
       
      -	/* TODO: Make use of basenames source and destination basenames */
      +	/* Now look for basename matchups and do similarity estimation */
     -+	for (i = 0; i < num_src; ++i) {
     -+		char *filename = rename_src[i].p->one->path;
     -+		const char *base = NULL;
     -+		intptr_t src_index;
     ++	strintmap_for_each_entry(&sources, &iter, entry) {
     ++		const char *base = entry->key;
     ++		intptr_t src_index = (intptr_t)entry->value;
      +		intptr_t dst_index;
     -+
     -+		/* Find out if this basename is unique among sources */
     -+		base = get_basename(filename);
     -+		src_index = strintmap_get(&sources, base);
      +		if (src_index == -1)
     -+			continue; /* not a unique basename; skip it */
     -+		assert(src_index == i);
     ++			continue;
      +
     -+		if (strintmap_contains(&dests, base)) {
     ++		if (0 <= (dst_index = strintmap_get(&dests, base))) {
      +			struct diff_filespec *one, *two;
      +			int score;
      +
     -+			/* Find out if this basename is unique among dests */
     -+			dst_index = strintmap_get(&dests, base);
     -+			if (dst_index == -1)
     -+				continue; /* not a unique basename; skip it */
     -+
     -+			/* Ignore this dest if already used in a rename */
     -+			if (rename_dst[dst_index].is_rename)
     -+				continue; /* already used previously */
     -+
      +			/* Estimate the similarity */
      +			one = rename_src[src_index].p->one;
      +			two = rename_dst[dst_index].p->two;
 4:  2493f4b2f55d ! 4:  122902e2706f diffcore-rename: guide inexact rename detection based on basenames
     @@ Commit message
          files based on basenames.  As a quick reminder (see the last two commit
          messages for more details), this means for example that
          docs/extensions.txt and docs/config/extensions.txt are considered likely
     -    renames if there are no 'extensions.txt' files elsewhere among the added
     -    and deleted files, and if a similarity check confirms they are similar,
     -    then they are marked as a rename without looking for a better similarity
     -    match among other files.  This is a behavioral change, as covered in
     -    more detail in the previous commit message.
     +    renames if there are no remaining 'extensions.txt' files elsewhere among
     +    the added and deleted files, and if a similarity check confirms they are
     +    similar, then they are marked as a rename without looking for a better
     +    similarity match among other files.  This is a behavioral change, as
     +    covered in more detail in the previous commit message.
      
          We do not use this heuristic together with either break or copy
          detection.  The point of break detection is to say that filename
     @@ diffcore-rename.c: static const char *get_basename(const char *filename)
       
      -MAYBE_UNUSED
       static int find_basename_matches(struct diff_options *options,
     - 				 int minimum_score,
     - 				 int num_src)
     + 				 int minimum_score)
     + {
      @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
       	if (minimum_score == MAX_SCORE)
       		goto cleanup;
       
     -+	num_sources = rename_src_nr;
     -+
     +-	/* Calculate how many renames are left */
     +-	num_destinations = (rename_dst_nr - rename_count);
     +-	remove_unneeded_paths_from_src(want_copies);
     + 	num_sources = rename_src_nr;
     + 
      +	if (want_copies || break_idx) {
      +		/*
      +		 * Cull sources:
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
      +		/* Utilize file basenames to quickly find renames. */
      +		trace2_region_enter("diff", "basename matches", options->repo);
      +		rename_count += find_basename_matches(options,
     -+						      min_basename_score,
     -+						      rename_src_nr);
     ++						      min_basename_score);
      +		trace2_region_leave("diff", "basename matches", options->repo);
      +
      +		/*
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
      +		trace2_region_leave("diff", "cull basename", options->repo);
      +	}
      +
     - 	/*
     --	 * Calculate how many renames are left
     -+	 * Calculate how many rename destinations are left
     - 	 */
     - 	num_destinations = (rename_dst_nr - rename_count);
     --	remove_unneeded_paths_from_src(want_copies);
     --	num_sources = rename_src_nr;
     ++	/* Calculate how many rename destinations are left */
     ++	num_destinations = (rename_dst_nr - rename_count);
      +	num_sources = rename_src_nr; /* rename_src_nr reflects lower number */
     - 
     ++
       	/* All done? */
       	if (!num_destinations || !num_sources)
     + 		goto cleanup;
      @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
       		struct diff_score *m;
       
     @@ diffcore-rename.c: void diffcore_rename(struct diff_options *options)
      
       ## t/t4001-diff-rename.sh ##
      @@ t/t4001-diff-rename.sh: test_expect_success 'basename similarity vs best similarity' '
     - 	# subdir/file.txt is 89% similar to file.md, 78% similar to file.txt,
     - 	# but since same basenames are checked first...
     + 	git add file.txt file.md &&
     + 	git commit -a -m "rename" &&
     + 	git diff-tree -r -M --name-status HEAD^ HEAD >actual &&
     +-	# subdir/file.txt is 88% similar to file.md and 78% similar to file.txt
     ++	# subdir/file.txt is 88% similar to file.md, 78% similar to file.txt,
     ++	# but since same basenames are checked first...
       	cat >expected <<-\EOF &&
      -	R088	subdir/file.txt	file.md
      -	A	file.txt
 5:  4e86ed3f29d4 ! 5:  6f5584f61350 gitdiffcore doc: mention new preliminary step for rename detection
     @@ Documentation/gitdiffcore.txt: a similarity score different from the default of
      +preliminary "match same filename" step uses a bit higher threshold to
      +mark a file pair as a rename and stop considering other candidates for
      +better matches.  At most, one comparison is done per file in this
     -+preliminary pass; so if there are several ext.txt files throughout the
     -+directory hierarchy that were added and deleted, this preliminary step
     -+will be skipped for those files.
     ++preliminary pass; so if there are several remaining ext.txt files
     ++throughout the directory hierarchy after exact rename detection, this
     ++preliminary step will be skipped for those files.
      +
       Note.  When the "-C" option is used with `--find-copies-harder`
       option, 'git diff-{asterisk}' commands feed unmodified filepairs to
 6:  fedb3d323d94 = 6:  aeca14f748af merge-ort: call diffcore_rename() directly

-- 
gitgitgadget

  parent reply	other threads:[~2021-02-14  7:54 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-06 22:52 [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 1/3] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 2/3] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-07 14:38   ` Derrick Stolee
2021-02-07 19:51     ` Junio C Hamano
2021-02-08  8:38       ` Elijah Newren
2021-02-08 11:43         ` Derrick Stolee
2021-02-08 16:25           ` Elijah Newren
2021-02-08 17:37         ` Junio C Hamano
2021-02-08 22:00           ` Elijah Newren
2021-02-08 23:43             ` Junio C Hamano
2021-02-08 23:52               ` Elijah Newren
2021-02-08  8:27     ` Elijah Newren
2021-02-08 11:31       ` Derrick Stolee
2021-02-08 16:09         ` Elijah Newren
2021-02-07  5:19 ` [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
2021-02-07  6:05   ` Elijah Newren
2021-02-09 11:32 ` [PATCH v2 0/4] " Elijah Newren via GitGitGadget
2021-02-09 11:32   ` [PATCH v2 1/4] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-09 13:17     ` Derrick Stolee
2021-02-09 16:56       ` Elijah Newren
2021-02-09 17:02         ` Derrick Stolee
2021-02-09 17:42           ` Elijah Newren
2021-02-09 11:32   ` [PATCH v2 2/4] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-09 13:25     ` Derrick Stolee
2021-02-09 17:17       ` Elijah Newren
2021-02-09 17:34         ` Derrick Stolee
2021-02-09 11:32   ` [PATCH v2 3/4] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-09 13:33     ` Derrick Stolee
2021-02-09 17:41       ` Elijah Newren
2021-02-09 18:59         ` Junio C Hamano
2021-02-09 11:32   ` [PATCH v2 4/4] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-09 12:59     ` Derrick Stolee
2021-02-09 17:03       ` Junio C Hamano
2021-02-09 17:44         ` Elijah Newren
2021-02-10 15:15   ` [PATCH v3 0/5] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
2021-02-10 15:15     ` [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-13  1:15       ` Junio C Hamano
2021-02-13  4:50         ` Elijah Newren
2021-02-13 23:56           ` Junio C Hamano
2021-02-14  1:24             ` Elijah Newren
2021-02-14  1:32               ` Junio C Hamano
2021-02-14  3:14                 ` Elijah Newren
2021-02-10 15:15     ` [PATCH v3 2/5] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-13  1:32       ` Junio C Hamano
2021-02-10 15:15     ` [PATCH v3 3/5] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-13  1:48       ` Junio C Hamano
2021-02-13 18:34         ` Elijah Newren
2021-02-13 23:55           ` Junio C Hamano
2021-02-14  3:08             ` Elijah Newren
2021-02-10 15:15     ` [PATCH v3 4/5] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-13  1:49       ` Junio C Hamano
2021-02-10 15:15     ` [PATCH v3 5/5] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-10 16:41       ` Junio C Hamano
2021-02-10 17:20         ` Elijah Newren
2021-02-11  8:15     ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 2/6] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget
2021-02-13  1:53       ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
2021-02-14  7:51       ` Elijah Newren via GitGitGadget [this message]
2021-02-14  7:51         ` [PATCH v5 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 2/6] diffcore-rename: compute basenames of source and dest candidates Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=pull.843.v5.git.1613289112.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=me@ttaylorr.com \
    --cc=newren@gmail.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).