git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Derrick Stolee <stolee@gmail.com>,
	Elijah Newren via GitGitGadget <gitgitgadget@gmail.com>,
	Git Mailing List <git@vger.kernel.org>,
	Derrick Stolee <dstolee@microsoft.com>,
	Jonathan Tan <jonathantanmy@google.com>,
	Taylor Blau <me@ttaylorr.com>, Jeff King <peff@peff.net>
Subject: Re: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames
Date: Mon, 8 Feb 2021 14:00:30 -0800	[thread overview]
Message-ID: <CABPp-BH_DWEE-3M96e=PPNwDqeYPaax9s1kBDhS8a6GtxsW=Mg@mail.gmail.com> (raw)
In-Reply-To: <xmqq4kim7964.fsf@gitster.c.googlers.com>

Hi,

On Mon, Feb 8, 2021 at 9:37 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> Elijah Newren <newren@gmail.com> writes:
>
> > idea is still possible.  For example, A.txt could have been compared
> > to source/some-module/A.txt.  And I don't do anything in the final
> > "full matrix" stage to avoid re-comparing those two files again.
> > However, it is worth noting that A.txt will have been compared to at
> > most one other file, not N files.
>
> Sorry, but where does this "at most one other file" come from?  "It
> is rare to remove source/some-other-module/A.txt at the same time
> while the above is happening"?  If so, yes, that sounds like a
> sensible thing.

It comes from the current implementation.  If both src/module1/A.txt
and src/module2/A.txt were removed, then I don't have a unique 'A.txt'
that was deleted.  In such a case, all 'A.txt' files are excluded from
this optimization step -- both sources and destinations.  This
optimization only kicks in for basenames where there was exactly one
of them deleted somewhere, and exactly one of them added somewhere.

One could consider trying to compare all deleted 'A.txt' files with
all added 'A.txt' files.  I tried that, but it interacts badly with
optimization batch 9 and tossed it aside; I will not be submitting
such a method.

Naturally, this "unique basename" limitation presents problems for
file basenames like .gitignore, Makefile, build.gradle, or even
ObjectFactory.java, setup.c, etc. that tend to appear in several
locations throughout the tree.  As of this series, we have to fall
back to the full N x M matrix comparison to detect the renames for
such non-unique basenames.  The next series I am planning on
submitting will do something smarter for those files while still
ensuring that the preliminary step only compares any given file to at
most one other file.

> > 1) The most expensive comparison is the first one,...
>
> Yes. we keep the spanhash table across comparison.
>
> > 2) This would only save us from at most N comparisons in the N x M
> > matrix (since no file in this optimization is compared to more than
> > one other)
>
> True, but doesn't rename_src[] and rename_dst[] entries have the
> original pathname, where you can see A.txt and some-module/A.txt
> share the same filename part cheaply?  Is that more expensive than
> comparing spanhash tables?

For a small enough number of renames, no, it won't be more expensive.
But I don't want to optimize for low numbers of renames; the code is
fast enough for those.  And with a large enough number of renames,
yes, the basename comparisons in aggregate will be more expensive than
the number of spanhash array comparisons you avoid redoing.  The
preliminary step from this optimization at most only did O(N) spanhash
comparisons, because it would only compare any given file to at most
one other file.  (Any file that didn't have a matching basename on the
other side, or wasn't a unique basename, wouldn't have been compared
to anything.)  So, at most, we save O(N) spanhash comparisons.  In
order to avoid repeating those O(N) comparisons, you are adding O(NxM)
basename comparisons.  Once M is large enough, the O(NxM) basename
comparisons you added will be more expensive than the O(N) spanhash
comparisons you are saving.  Recall that my testcase used N and M of
approximately 26,000.  The real world repository I based it on had
over 30K renames.  And if I know of a repository with 30K renames with
only 50K files (at the time), I think we shouldn't be using that as an
upper bound either.

> Having asked these, I do think it is not worth pursuing, especially
> because I agree with Derrick that this "we see a new file whose name
> is the same as the one deleted from a different directory, so if
> they are similar enough, let's declare victory and not bother
> finding a better match" needs to be used with higher similarity bar
> than the normal one.

You say you agree with Stolee, but that's not what I understood Stolee
as saying at all.  He said he thought it wasn't worth the complication
of trying to use a different value for the basename minimum similarity
than the normal minimum similarity, at least not at first.  He
suggested we could add that in the future at some time, and then
talked a bit about how to add it if we do.

> If -M60 says "only consider pairs that are
> with at least 60% similarity index", finding one at 60% similarity
> and stopping at it only because the pair looks to move a file from
> one directory to another directory while retaining the same name,
> rejecting other paring, feels a bit too crude a heuristics.  And if
> we require higher similarity levels to short-circuit, the later full
> matrix stage won't be helped with "we must have already rejected"
> logic.  A.txt and some-module/A.txt may not have been similar enough
> to short-circuit and reject others in the earlier part, but the
> full-matrix part work at a lower bar, which may consider the pair
> good enough to keep as match candidates.

I'm sorry, but I'm not following you.  As best I can tell, you seem to
be suggesting that if we were to use a higher similarity bar for
checking same-basename files, that such a difference would end up not
accelerating the diffcore-rename algorithm at all?  Is that correct?
If not, I don't understand what you're saying.

If by chance my restatement is an accurate summary of your claim, then
allow me to disabuse you of your assumptions here; you're way off.  I
wrote find_basename_matches() to take a similarity score, so that it
could take a different one than is used elsewhere in the algorithm.  I
didn't think it was necessary, but it does make it easy to test your
hypothesis.  Here are some results:

Original, not using basename-guided rename detection:
    no-renames:       13.815 s ±  0.062 s
    mega-renames:   1799.937 s ±  0.493 s
    just-one-mega:    51.289 s ±  0.019 s

Using basename_min_score = minimum_score, i.e. 50%:
    no-renames:       13.428 s ±  0.119 s
    mega-renames:    172.137 s ±  0.958 s
    just-one-mega:     5.154 s ±  0.025 s

Using basename_min_score = 0.5 * (minimum_score + MAX_SCORE), i.e. 75%:
    no-renames:       13.543 s ±  0.094 s
    mega-renames:    189.598 s ±  0.726 s
    just-one-mega:     5.647 s ±  0.016 s

Using basename_min_score = 0.1 * (minimum_score + 9*MAX_SCORE), i.e. 95%:
    no-renames:       13.733 s ±  0.086 s
    mega-renames:    353.479 s ±  2.574 s
    just-one-mega:    10.351 s ±  0.030 s


So, when we bump the bar for basename similarity much past your
hypothetical 60% all the way up to 75% (i.e. just take a simple
average of minimum score and MAX_SCORE), we see almost identical
speedups (factor of 9 or so instead of 10 or so).  And even when we go
to the extreme of requiring a 95% or greater similarity in order to
pair up basenames, we still see a speed-up factor of 5-6; that's less
than the factor of 10 we could get by allowing basename_min_score to
match minimum_score at 50%, but it's still _most_ of the speedup.

Granted, this is just one testcase.  It's going to vary a lot between
testcases and repositories and how far back or forward in history you
are rebasing or merging, etc.  The fact that this particular testcase
was obtained by doing a "git mv drivers/ pilots/" in the linux kernel
and then finding a topic to rebase across that rename boundary makes
it a bit special.  But....even if we were only able to pair 50% of the
files due to basename similarity, that would save 75% of the spanhash
comparisons.  Even in git.git where only 16% of the renames change the
basename, if we could pair up 16% of the files based on basenames it'd
save roughly 30% of the spanhash comparisons.  The numbers are
probably a lot better than either of those, though.  Since 76% of
renames in the linux kernel don't change the basename, 64% of the
renames in gcc don't, over 79% of the renames in the gecko repository
don't, and over 89% of the renames in the WebKit repository don't, I
think this is a really valuable heuristic to use.

  reply	other threads:[~2021-02-08 22:02 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-06 22:52 [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 1/3] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 2/3] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-06 22:52 ` [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-07 14:38   ` Derrick Stolee
2021-02-07 19:51     ` Junio C Hamano
2021-02-08  8:38       ` Elijah Newren
2021-02-08 11:43         ` Derrick Stolee
2021-02-08 16:25           ` Elijah Newren
2021-02-08 17:37         ` Junio C Hamano
2021-02-08 22:00           ` Elijah Newren [this message]
2021-02-08 23:43             ` Junio C Hamano
2021-02-08 23:52               ` Elijah Newren
2021-02-08  8:27     ` Elijah Newren
2021-02-08 11:31       ` Derrick Stolee
2021-02-08 16:09         ` Elijah Newren
2021-02-07  5:19 ` [PATCH 0/3] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
2021-02-07  6:05   ` Elijah Newren
2021-02-09 11:32 ` [PATCH v2 0/4] " Elijah Newren via GitGitGadget
2021-02-09 11:32   ` [PATCH v2 1/4] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-09 13:17     ` Derrick Stolee
2021-02-09 16:56       ` Elijah Newren
2021-02-09 17:02         ` Derrick Stolee
2021-02-09 17:42           ` Elijah Newren
2021-02-09 11:32   ` [PATCH v2 2/4] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-09 13:25     ` Derrick Stolee
2021-02-09 17:17       ` Elijah Newren
2021-02-09 17:34         ` Derrick Stolee
2021-02-09 11:32   ` [PATCH v2 3/4] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-09 13:33     ` Derrick Stolee
2021-02-09 17:41       ` Elijah Newren
2021-02-09 18:59         ` Junio C Hamano
2021-02-09 11:32   ` [PATCH v2 4/4] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-09 12:59     ` Derrick Stolee
2021-02-09 17:03       ` Junio C Hamano
2021-02-09 17:44         ` Elijah Newren
2021-02-10 15:15   ` [PATCH v3 0/5] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
2021-02-10 15:15     ` [PATCH v3 1/5] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-13  1:15       ` Junio C Hamano
2021-02-13  4:50         ` Elijah Newren
2021-02-13 23:56           ` Junio C Hamano
2021-02-14  1:24             ` Elijah Newren
2021-02-14  1:32               ` Junio C Hamano
2021-02-14  3:14                 ` Elijah Newren
2021-02-10 15:15     ` [PATCH v3 2/5] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-13  1:32       ` Junio C Hamano
2021-02-10 15:15     ` [PATCH v3 3/5] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-13  1:48       ` Junio C Hamano
2021-02-13 18:34         ` Elijah Newren
2021-02-13 23:55           ` Junio C Hamano
2021-02-14  3:08             ` Elijah Newren
2021-02-10 15:15     ` [PATCH v3 4/5] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-13  1:49       ` Junio C Hamano
2021-02-10 15:15     ` [PATCH v3 5/5] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-10 16:41       ` Junio C Hamano
2021-02-10 17:20         ` Elijah Newren
2021-02-11  8:15     ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide " Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 2/6] diffcore-rename: compute basenames of all source and dest candidates Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-11  8:15       ` [PATCH v4 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget
2021-02-13  1:53       ` [PATCH v4 0/6] Optimization batch 7: use file basenames to guide rename detection Junio C Hamano
2021-02-14  7:51       ` [PATCH v5 " Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 1/6] t4001: add a test comparing basename similarity and content similarity Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 2/6] diffcore-rename: compute basenames of source and dest candidates Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 3/6] diffcore-rename: complete find_basename_matches() Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 4/6] diffcore-rename: guide inexact rename detection based on basenames Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 5/6] gitdiffcore doc: mention new preliminary step for rename detection Elijah Newren via GitGitGadget
2021-02-14  7:51         ` [PATCH v5 6/6] merge-ort: call diffcore_rename() directly Elijah Newren via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CABPp-BH_DWEE-3M96e=PPNwDqeYPaax9s1kBDhS8a6GtxsW=Mg@mail.gmail.com' \
    --to=newren@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).