git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: Elijah Newren via GitGitGadget <gitgitgadget@gmail.com>,
	Git Mailing List <git@vger.kernel.org>,
	Derrick Stolee <dstolee@microsoft.com>,
	Jonathan Tan <jonathantanmy@google.com>,
	Taylor Blau <me@ttaylorr.com>, Junio C Hamano <gitster@pobox.com>,
	Jeff King <peff@peff.net>
Subject: Re: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames
Date: Mon, 8 Feb 2021 00:27:00 -0800	[thread overview]
Message-ID: <CABPp-BHPgUHFFzTd7suhqj=zEXQ61vxKU6X9gZvow5a=TLg3iw@mail.gmail.com> (raw)
In-Reply-To: <9fbed0f9-032e-3f99-8467-f8a9cfa2d8f1@gmail.com>

Hi,

On Sun, Feb 7, 2021 at 6:38 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/6/21 5:52 PM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
> >
> > Make use of the new find_basename_matches() function added in the last
> > two patches, to find renames more rapidly in cases where we can match up
> > files based on basenames.
>
> This is a valuable heuristic.
>
> > For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
> > performance work; instrument with trace2_region_* calls", 2020-10-28),
> > this change improves the performance as follows:
> >
> >                             Before                  After
> >     no-renames:       13.815 s ±  0.062 s    13.138 s ±  0.086 s
> >     mega-renames:   1799.937 s ±  0.493 s   169.488 s ±  0.494 s
> >     just-one-mega:    51.289 s ±  0.019 s     5.061 s ±  0.017 s
>
> These numbers are very impressive.
>
> Before I get too deep into reviewing these patches, I do want
> to make it clear that the speed-up is coming at the cost of
> a behavior change. We are restricting the "best match" search
> to be first among files with common base name (although maybe
> I would use 'suffix'?). If we search for a rename among all
> additions and deletions ending the ".txt" we might find a
> similarity match that is 60% and declare that a rename, even
> if there is a ".txt" -> ".md" pair that has a 70% match.

I'm glad you all are open to possible behavioral changes, but I was
proposing a much smaller behavioral change that is quite different
than what you have suggested here.  Perhaps my wording was poor; I
apologize for forgetting that "basename" has different meanings in
different contexts.  Let me try again; I am not treating the filename
extension as special in any manner here; by "basename" I just mean the
portion of the path ignoring any leading directories.  Thus
    src/foo.txt
might be a good match against
    source/foo.txt
but this optimization as a preliminary step would not consider
matching src/foo.txt against any of
    source/bar.txt
    source/foo.md
since the basenames ('bar.txt' and 'foo.md') do not match our original
file's basename ('foo.txt').

Of course, if this preliminary optimization step fails to find another
"foo.txt" to match src/foo.txt against (or finds more than one and
thus doesn't compare against any of them), then the fallback inexact
rename detection matrix might match it against either of those two
latter paths, as it always has.

> This could be documented in a test case, to demonstrate that
> we are making this choice explicitly.
>
> For example, here is a test that passes now, but would
> start failing with your patches (if I understand them
> correctly):
>
> diff --git a/t/t4001-diff-rename.sh b/t/t4001-diff-rename.sh
> index c16486a9d41..e4c71fcf3be 100755
> --- a/t/t4001-diff-rename.sh
> +++ b/t/t4001-diff-rename.sh
> @@ -262,4 +262,21 @@ test_expect_success 'diff-tree -l0 defaults to a big rename limit, not zero' '
>         grep "myotherfile.*myfile" actual
>  '
>
> +test_expect_success 'multiple similarity choices' '
> +       test_write_lines line1 line2 line3 line4 line5 \
> +                        line6 line7 line8 line9 line10 >delete.txt &&
> +       git add delete.txt &&
> +       git commit -m "base txt" &&
> +
> +       rm delete.txt &&
> +       test_write_lines line1 line2 line3 line4 line5 \
> +                         line6 line7 line8 >add.txt &&
> +       test_write_lines line1 line2 line3 line4 line5 \
> +                         line6 line7 line8 line9 >add.md &&
> +       git add add.txt add.md &&
> +       git commit -a -m "rename" &&
> +       git diff-tree -M HEAD HEAD^ >actual &&
> +       grep "add.md    delete.txt" actual
> +'
> +
>  test_done
>
> Personally, I'm fine with making this assumption. All of
> our renames are based on heuristics, so any opportunity
> to reduce the number of content comparisons is a win in
> my mind. We also don't report a rename unless there _is_
> an add/delete pair that is sufficiently close in content.
>
> So, in this way, we are changing the optimization function
> that is used to determine the "best" rename available. It
> might be good to update documentation for how we choose
> renames:

Seems reasonable; I'll add some commentary below on the rules...

>
>   An add/delete pair is marked as a rename based on the
>   following similarity function:
>

0. Unless break detection is on, files with the same fullname are
considered the same file even if their content is completely
different.  (With break detection turned on, we can have e.g. both
src/foo.txt -> src/bar.txt and otherdir/baz.txt -> src/foo.txt, i.e.
src/foo.txt can be both a source and a destination of a rename.)

[The merge machinery never turns break detection on, but
diffcore-rename is used by git diff, git log, etc. too, so if we're
documenting the rules we should cover all the cases.]

>   1. If the blob content is identical, then those files
>      are marked as a rename. (Should we break ties here
>      based on the basename?)

find_identical_files() already breaks ties based on basename_same(),
yes.  So there's another area of the code that uses basenames to guide
rename detection already, just in a much more limited fashion.

>   2. Among pairs whose content matches the minimum
>      similarity limit, we optimize for:
>
>      i. among files with the same basename (trailer
>         after final '.') select pairs with highest
>         similarity.

This is an interesting idea, but is not what I implemented.  It is
possible that your suggestion is also a useful optimization; it'd be
hard to know without trying.  However, as noted in optimization batch
8 that I'll be submitting later, I'm worried about having any
optimization pre-steps doing more than O(1) comparisons per path (and
here you suggest comparing each .txt file with all other .txt files);
doing that can interact badly with optimization batch 9.
Additionally, unless we do something to avoid re-comparing files again
when doing the later all-unmatched-files-against-each-other check,
then worst case behavior can approach twice as slow as the original
code.

Anyway, the explanation I'd use for the optimization I've added in
this series is:

       i. if looking through the two sets (of add pairs, and of delete
pairs), there is exactly one file with the same basename from each
set, and they have the minimum similarity, then mark them as a rename

Optimization batch 8 will extend this particular rule.

Optimization batches 9 and 10 will optimize the rename detection more,
but instead of using rule changes, will instead pass in a list of
"irrelevant" sources that can be skipped.  The trick is in determining
source files that are irrelevant and why.  I'm not sure if we want to
also mention in the rules that different areas of the code (the merge
machinery, log --follow, etc.) can make the rename detection focus
just on some "relevant" subset of files.  (Which will also touch on
optimization batch 12.)

>     ii. if no files with the same basename have the
>         minimum similarity, then select pairs with
>         highest similarity across all filenames.

Yes, this will remain as the fallback at the very end.

> The above was written quickly as an attempt, so it will
> require careful editing to actually make sense to end
> users.

Yeah, and we probably also need to mention copy detection above
somehow too, and add more precise wording about how break detection is
involved.

  parent reply	other threads:[~2021-02-08  8:29 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-06 22:52 [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 1/3] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 2/3] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-07 14:38   ` Derrick Stolee
2021-02-07 19:51     ` Junio C Hamano
2021-02-08  8:38       ` Elijah Newren
2021-02-08 11:43         ` Derrick Stolee
2021-02-08 16:25           ` Elijah Newren
2021-02-08 17:37         ` Junio C Hamano
2021-02-08 22:00           ` Elijah Newren
2021-02-08 23:43             ` Junio C Hamano
2021-02-08 23:52               ` Elijah Newren
2021-02-08  8:27     ` Elijah Newren [this message]
2021-02-08 11:31       ` Derrick Stolee
2021-02-08 16:09         ` Elijah Newren
2021-02-07  5:19 ` [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
2021-02-07  6:05   ` Elijah Newren
2021-02-09 11:32 ` [PATCH v2 0/4] " Elijah Newren via GitGitGadget
2021-02-09 11:32   ` [PATCH v2 1/4] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-09 13:17     ` Derrick Stolee
2021-02-09 16:56       ` Elijah Newren
2021-02-09 17:02         ` Derrick Stolee
2021-02-09 17:42           ` Elijah Newren
2021-02-09 11:32   ` [PATCH v2 2/4] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-09 13:25     ` Derrick Stolee
2021-02-09 17:17       ` Elijah Newren
2021-02-09 17:34         ` Derrick Stolee
2021-02-09 11:32   ` [PATCH v2 3/4] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-09 13:33     ` Derrick Stolee
2021-02-09 17:41       ` Elijah Newren
2021-02-09 18:59         ` Junio C Hamano
2021-02-09 11:32   ` [PATCH v2 4/4] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-09 12:59     ` Derrick Stolee
2021-02-09 17:03       ` Junio C Hamano
2021-02-09 17:44         ` Elijah Newren
2021-02-10 15:15   ` [PATCH v3 0/5] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
2021-02-10 15:15     ` [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-13  1:15       ` Junio C Hamano
2021-02-13  4:50         ` Elijah Newren
2021-02-13 23:56           ` Junio C Hamano
2021-02-14  1:24             ` Elijah Newren
2021-02-14  1:32               ` Junio C Hamano
2021-02-14  3:14                 ` Elijah Newren
2021-02-10 15:15     ` [PATCH v3 2/5] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-13  1:32       ` Junio C Hamano
2021-02-10 15:15     ` [PATCH v3 3/5] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-13  1:48       ` Junio C Hamano
2021-02-13 18:34         ` Elijah Newren
2021-02-13 23:55           ` Junio C Hamano
2021-02-14  3:08             ` Elijah Newren
2021-02-10 15:15     ` [PATCH v3 4/5] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-13  1:49       ` Junio C Hamano
2021-02-10 15:15     ` [PATCH v3 5/5] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-10 16:41       ` Junio C Hamano
2021-02-10 17:20         ` Elijah Newren
2021-02-11  8:15     ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 2/6] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget
2021-02-13  1:53       ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
2021-02-14  7:51       ` [PATCH v5 " Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 2/6] diffcore-rename: compute basenames of source and dest candidates Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CABPp-BHPgUHFFzTd7suhqj=zEXQ61vxKU6X9gZvow5a=TLg3iw@mail.gmail.com' \
    --to=newren@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).