git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: Junio C Hamano <gitster@pobox.com>,
	Elijah Newren via GitGitGadget <gitgitgadget@gmail.com>,
	Git Mailing List <git@vger.kernel.org>,
	Derrick Stolee <dstolee@microsoft.com>,
	Jonathan Tan <jonathantanmy@google.com>,
	Taylor Blau <me@ttaylorr.com>, Jeff King <peff@peff.net>
Subject: Re: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames
Date: Mon, 8 Feb 2021 08:25:00 -0800	[thread overview]
Message-ID: <CABPp-BF-YB4gjAPFWExY01ffLY1oLdUhih99AER9Sy0xaA1VfQ@mail.gmail.com> (raw)
In-Reply-To: <57d30e7d-7727-8d98-e3ef-bcfeebf9edd3@gmail.com>

On Mon, Feb 8, 2021 at 3:44 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/8/2021 3:38 AM, Elijah Newren wrote:
> > On Sun, Feb 7, 2021 at 11:51 AM Junio C Hamano <gitster@pobox.com> wrote:
> >>
> >> Derrick Stolee <stolee@gmail.com> writes:
> >>
> >>> Before I get too deep into reviewing these patches, I do want
> >>> to make it clear that the speed-up is coming at the cost of
> >>> a behavior change. We are restricting the "best match" search
> >>> to be first among files with common base name (although maybe
> >>> I would use 'suffix'?). If we search for a rename among all
> >>> additions and deletions ending the ".txt" we might find a
> >>> similarity match that is 60% and declare that a rename, even
> >>> if there is a ".txt" -> ".md" pair that has a 70% match.
> >>
> >> Yes, my initial reaction to the idea was that "yuck, our rename
> >> detection lost its purity".  diffcore-rename strived to base its
> >> decision purely on content similarity, primarily because it is one
> >> of the oldest part of Git where the guiding principle has always
> >> been that the content is the king.  I think my aversion to the "all
> >> of my neighbors are relocating, so should I move to the same place"
> >> (aka "directory rename") comes from a similar place, but in a sense
> >> this was worse.
> >>
> >> At least, until I got over the initial bump.  I do not think the
> >> suffix match is necessarily a bad idea, but it adds more "magically
> >> doing a wrong thing" failure modes (e.g. the ".txt" to ".md" example
> >> would probably have more variants that impact the real life
> >> projects; ".C" vs ".cc" vs ".cxx" vs ".cpp" immediately comes to
> >> mind), and a tool that silently does a wrong thing because it uses
> >> more magic would be a tool that is hard to explain why it did the
> >> wrong thing when it does.
> >
> > Stolee explained a new algorithm different than what I have proposed,
>
> Yes, sorry for adding noise. The point stands that we are changing
> the behavior in some cases, so that must be agreed upon. What you
> are _actually_ proposing is a much smaller change than I thought,
> but it is still worth pointing out the behavior change.

Again, no need to apologize; if my commit messages weren't clear, they
need to be fixed.  I am much more surprised here by your repeated
point that it's worth pointing out the behavior change.  I totally
agree with that, and it's why I spent two paragraphs on the second
commit message explicitly covering this and listing four enumerated
reasons to argue for the change.  So, I thought I had done that, but
it apparently isn't very clear...and I'm left wondering how to clarify
it further.  So, a question for you: How should I change it?  Should I
just modify the third commit message to re-highlight that it does
change behavior and refer to the second commit message for details?
Because repeating the point is the only way I can think of to make it
clearer.  Is there anything else I can or should do?

> > I think based on the apparent different meanings of "basename" that
> > exist.  I tried to clarify that in response to his email, but I wanted
> > to clarify one additional thing here too:
> >> diffcore-rename has not in the past based its decision solely on
> > content similarity.  It only does that when break detection is on.
> > Otherwise, 0% content similarity is trumped by sufficient filename
> > similarity (namely, with a filename similarity of 100%).  If the
> > filename similarity wasn't sufficiently high (anything less than an
> > exact match), then it completely ignored filename similarity and
> > looked only at content similarity.  It thus jumped from one extreme to
> > another.
>
> This idea of optimizing first for 100% filename similarity is a
> good perspective on Git's rename detection algorithm. The canonical
> example of this 100% filename similarity is a rename cycle:
>
>         A -> B
>         B -> C
>         C -> A
>
> Even if the OIDs are distinct and exactly match across these renames,
> we see that there are no adds or deletes, so we do not even trigger
> rename detection and report A, B, and C as edited instead.
>
> A "rename path" (not cycle) such as:
>
>         A -> B
>         B -> C
>
> does trigger rename detection, but B will never be considered. Instead,
> "A -> C" will be checked for similarity to see if it is within the
> threshold.
>
> Of course, I am _not_ advocating that we change this behavior. These
> situations are incredibly rare and we should not sacrifice performance
> in the typical case to handle them.
>
> > My optimization is adding an in-between state.  When the basename (the
> > part of the path excluding the leading directory) matches the basename
> > of another file (and those basenames are unique on each side), then
> > compare content similarity and mark the files as a rename if the two
> > are sufficiently similar.  It is thus a position that considers both
> > filename similarity (basename match) and content similarity together.
> >
> >>> This could be documented in a test case, to demonstrate that
> >>> we are making this choice explicitly.
> >>
> >> Yes.  I wonder if we can solve it by requiring a lot better than
> >> minimum match when trying the "suffix match" first, or something?
> >
> > This may still be a useful idea, and was something I had considered,
> > but more in the context of more generic filename similarity
> > comparisons.  We could still discuss it even when basenames match, but
> > basenames matching seems strong enough to me that I wasn't sure extra
> > configuration knobs were warranted.
>
> I think this is a complication that we might not want to add to the
> heuristic, at least not at first. We might want to have a follow-up
> that adjusts that value to be higher. A natural way would be through
> a config option, so users can select something incredibly high like
> 99%. Another option would be to take a minimum that is halfway between
> the existing similarity minimum and 100%.
>
> >> Provided if we agree that it is a good idea to insert this between
> >> "exact contents match" and "full matrix", I have one question to
> >> Elijah on what the code does.
> >>
> >> To me, it seems that the "full matrix" part still uses the remaining
> >> src and dst candidates fully.  But if "A.txt" and "B.txt" are still
> >> surviving in the src/dst at that stage, shouldn't we be saying that
> >> "no way these can be similar enough---we've checked in the middle
> >> stage where only the ones with the same suffix are considered and
> >> this pair didn't turn into a rename"?
> >
> > This is a very good point.  A.txt and B.txt will not have been
> > compared previously since their basenames do not match, but the basic
> > idea is still possible.  For example, A.txt could have been compared
> > to source/some-module/A.txt.  And I don't do anything in the final
> > "full matrix" stage to avoid re-comparing those two files again.
> > However, it is worth noting that A.txt will have been compared to at
> > most one other file, not N files.  And thus while we are wasting some
> > re-comparisons, it is at most O(N) duplicated comparisons, not O(N^2).
> > I thought about that, but decided to not bother, based on the
> > following thinking:
> >
> > 1) The most expensive comparison is the first one, because when we do
> > that one, we first have to populate the list of integers that lines in
> > the file hash to.  Subsequent comparisons are relatively cheap since
> > this list of integers has already been computed.
> >
> > 2) This would only save us from at most N comparisons in the N x M
> > matrix (since no file in this optimization is compared to more than
> > one other)
> >
> > 3) Checking if two files have previously been compared requires more
> > code, in what is already a tight nested loop.  My experience
> > attempting to modify that tight loop for extra conditions (e.g. don't
> > compare files that are too large), is that it's easy to accidentally
> > make the code slower.  In fact, this is in part what led to the
> > addition of the remove_unneed_paths_from_src() function.
>
> Even storing a single bit to say "these were already compared" takes
> quadratic space. The hope is to not have quadratic behavior if it
> can be avoided.
>
> > 4) There were plenty of other interesting ideas and maybe I was a tad lazy.  :-)
> >
> > I think removing these already-compared cases could be done, but I
> > just avoided it.  If we were to do the "attempt to match files with
> > the same extension" optimization that Stolee outlines/invents above,
> > then we'd definitely need to consider it.  Otherwise, it's just a
> > minor additional optimization that someone could add to my patches.
>
> The more I think about it, the less my idea makes sense. I'm sorry
> for adding noise to the thread.
>
> Thanks,
> -Stolee

  reply	other threads:[~2021-02-08 16:28 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-06 22:52 [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 1/3] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 2/3] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-07 14:38   ` Derrick Stolee
2021-02-07 19:51     ` Junio C Hamano
2021-02-08  8:38       ` Elijah Newren
2021-02-08 11:43         ` Derrick Stolee
2021-02-08 16:25           ` Elijah Newren [this message]
2021-02-08 17:37         ` Junio C Hamano
2021-02-08 22:00           ` Elijah Newren
2021-02-08 23:43             ` Junio C Hamano
2021-02-08 23:52               ` Elijah Newren
2021-02-08  8:27     ` Elijah Newren
2021-02-08 11:31       ` Derrick Stolee
2021-02-08 16:09         ` Elijah Newren
2021-02-07  5:19 ` [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
2021-02-07  6:05   ` Elijah Newren
2021-02-09 11:32 ` [PATCH v2 0/4] " Elijah Newren via GitGitGadget
2021-02-09 11:32   ` [PATCH v2 1/4] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-09 13:17     ` Derrick Stolee
2021-02-09 16:56       ` Elijah Newren
2021-02-09 17:02         ` Derrick Stolee
2021-02-09 17:42           ` Elijah Newren
2021-02-09 11:32   ` [PATCH v2 2/4] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-09 13:25     ` Derrick Stolee
2021-02-09 17:17       ` Elijah Newren
2021-02-09 17:34         ` Derrick Stolee
2021-02-09 11:32   ` [PATCH v2 3/4] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-09 13:33     ` Derrick Stolee
2021-02-09 17:41       ` Elijah Newren
2021-02-09 18:59         ` Junio C Hamano
2021-02-09 11:32   ` [PATCH v2 4/4] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-09 12:59     ` Derrick Stolee
2021-02-09 17:03       ` Junio C Hamano
2021-02-09 17:44         ` Elijah Newren
2021-02-10 15:15   ` [PATCH v3 0/5] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
2021-02-10 15:15     ` [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-13  1:15       ` Junio C Hamano
2021-02-13  4:50         ` Elijah Newren
2021-02-13 23:56           ` Junio C Hamano
2021-02-14  1:24             ` Elijah Newren
2021-02-14  1:32               ` Junio C Hamano
2021-02-14  3:14                 ` Elijah Newren
2021-02-10 15:15     ` [PATCH v3 2/5] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-13  1:32       ` Junio C Hamano
2021-02-10 15:15     ` [PATCH v3 3/5] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-13  1:48       ` Junio C Hamano
2021-02-13 18:34         ` Elijah Newren
2021-02-13 23:55           ` Junio C Hamano
2021-02-14  3:08             ` Elijah Newren
2021-02-10 15:15     ` [PATCH v3 4/5] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-13  1:49       ` Junio C Hamano
2021-02-10 15:15     ` [PATCH v3 5/5] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-10 16:41       ` Junio C Hamano
2021-02-10 17:20         ` Elijah Newren
2021-02-11  8:15     ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 2/6] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget
2021-02-13  1:53       ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
2021-02-14  7:51       ` [PATCH v5 " Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 2/6] diffcore-rename: compute basenames of source and dest candidates Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABPp-BF-YB4gjAPFWExY01ffLY1oLdUhih99AER9Sy0xaA1VfQ@mail.gmail.com \
    --to=newren@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).