git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: Elijah Newren via GitGitGadget <gitgitgadget@gmail.com>,
	Git Mailing List <git@vger.kernel.org>,
	Derrick Stolee <dstolee@microsoft.com>,
	Jonathan Tan <jonathantanmy@google.com>,
	Taylor Blau <me@ttaylorr.com>, Junio C Hamano <gitster@pobox.com>,
	Jeff King <peff@peff.net>, Karsten Blees <blees@dcon.de>
Subject: Re: [PATCH 2/2] diffcore-rename: filter rename_src list when possible
Date: Wed, 3 Feb 2021 09:12:47 -0800	[thread overview]
Message-ID: <CABPp-BG8mu4qGQCz5sMWqJHSb9o1ipk9FgxUS34SbArhQcPutQ@mail.gmail.com> (raw)
In-Reply-To: <13feb106-c3a7-a26d-0e6e-013aa45c58d4@gmail.com>

On Wed, Feb 3, 2021 at 4:00 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 2/3/2021 12:49 AM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@gmail.com>
> >
> > We have to look at each entry in rename_src a total of rename_dst_nr
> > times.  When we're not detecting copies, any exact renames or ignorable
> > rename paths will just be skipped over.  While checking that these can
> > be skipped over is a relatively cheap check, it's still a waste of time
> > to do that check more than once, let alone rename_dst_nr times.  When
> > rename_src_nr is a few thousand times bigger than the number of relevant
> > sources (such as when cherry-picking a commit that only touched a
> > handful of files, but from a side of history that has different names
> > for some high level directories), this time can add up.
> >
> > First make an initial pass over the rename_src array and move all the
> > relevant entries to the front, so that we can iterate over just those
> > relevant entries.
> >
> > For the testcases mentioned in commit 557ac0350d ("merge-ort: begin
> > performance work; instrument with trace2_region_* calls", 2020-10-28),
> > this change improves the performance as follows:
> >
> >                             Before                  After
> >     no-renames:       14.119 s ±  0.101 s    13.815 s ±  0.062 s
> >     mega-renames:   1802.044 s ±  0.828 s  1799.937 s ±  0.493 s
> >     just-one-mega:    51.391 s ±  0.028 s    51.289 s ±  0.019 s
> >
> > Signed-off-by: Elijah Newren <newren@gmail.com>
> > ---
> >  diffcore-rename.c | 67 ++++++++++++++++++++++++++++++++++++++++-------
> >  1 file changed, 57 insertions(+), 10 deletions(-)
> >
> > diff --git a/diffcore-rename.c b/diffcore-rename.c
> > index e3047da3aaf..2a8e7b84b9c 100644
> > --- a/diffcore-rename.c
> > +++ b/diffcore-rename.c
> > @@ -454,6 +454,55 @@ static int find_renames(struct diff_score *mx, int dst_cnt, int minimum_score, i
> >       return count;
> >  }
> >
> > +static int remove_unneeded_paths_from_src(int num_src,
> > +                                       int detecting_copies)
> > +{
> > +     int i, new_num_src;
> > +
> > +     /*
> > +      * Note on reasons why we cull unneeded sources but not destinations:
> > +      *   1) Pairings are stored in rename_dst (not rename_src), which we
> > +      *      need to keep around.  So, we just can't cull rename_dst even
> > +      *      if we wanted to.  But doing so wouldn't help because...
> > +      *
> > +      *   2) There is a matrix pairwise comparison that follows the
> > +      *      "Performing inexact rename detection" progress message.
> > +      *      Iterating over the destinations is done in the outer loop,
> > +      *      hence we only iterate over each of those once and we can
> > +      *      easily skip the outer loop early if the destination isn't
> > +      *      relevant.  That's only one check per destination path to
> > +      *      skip.
> > +      *
> > +      *      By contrast, the sources are iterated in the inner loop; if
> > +      *      we check whether a source can be skipped, then we'll be
> > +      *      checking it N separate times, once for each destination.
> > +      *      We don't want to have to iterate over known-not-needed
> > +      *      sources N times each, so avoid that by removing the sources
> > +      *      from rename_src here.
> > +      */
> > +     if (detecting_copies)
> > +             return num_src; /* nothing to remove */
> > +     if (break_idx)
> > +             return num_src; /* culling incompatbile with break detection */
>
> While the comments here are certainly descriptive, they could probably
> be sidelined to "return when culling is incompatible" with this before
> the big comment:

Yeah, I guess at this stage a bigger explanation isn't needed.  But I
think I'd end up just adding these comments back in later at an
unrelated time if I were to take them out at this point.  By the time
merge-ort is finished, I'll have over a dozen additional
optimizations, many of them in diffcore-rename.c.  There will be files
for which four different optimizations are all applicable, but they
are mutually exclusive so we have to pick one of the four and exclude
the other three.  Best results come from knowing which one provides
the best speedup and why (and which is second if the best doesn't
apply, etc.).  So what I'll present as four optimizations originally
felt to me like 10 or so different ones, because I didn't start
knowing the best, meaning I kept adding later additional optimizations
that fed the appropriate data in all the right places so that I could
tweak which one(s) took priority and get the best speedups.  You'll
get the cleaner version to review that pretends I knew the right
answers all along.  (Not to mention that I think I tried a lot more
optimization ideas that failed than the number I tried that succeeded,
and longer comments of mine sometimes highlight details that could
have prevented the need to attempt some of the failed ideas.)  Anyway,
long story short: at some point along the way, the more detailed
description felt very appropriate, but I don't have a clear breakpoint
for where that is and the longer description most logically goes along
with this early patch so I squashed it in here.  *shrug*

>         if (detecting_copies || break_idx)
>                 return num_src;
>
> This is a very minor nit pick, though.

Moving these conditions above the comment makes sense, yes.  I guess
combining the two if conditions here could also be done, though the
reasoning is different as noted by the comments so I would have to
toss the comments.  Since a later optimization will need to add an
additional condition that is only relevant for one of the two reasons,
I would just need to split again later and I think I would want the
comments back at that time.  So I think the split and keeping the
comments seems more natural to me.

> > +
> > +     for (i = 0, new_num_src = 0; i < num_src; i++) {
> > +             /*
> > +              * renames are stored in rename_dst, so if a rename has
> > +              * already been detected using this source, we can just
> > +              * remove the source knowing rename_dst has its info.
> > +              */
> > +             if (rename_src[i].p->one->rename_used)
> > +                     continue;
> > +
> > +             if (new_num_src < i)
> > +                     memcpy(&rename_src[new_num_src], &rename_src[i],
> > +                            sizeof(struct diff_rename_src));
>
> 'struct diff_rename_src' is currently just a pointer and a short, so
> this is a very small amount of data, but this conditional move doesn't
> hurt.
>
> > +             new_num_src++;
> > +     }
> > +
> > +     return new_num_src;
> > +}
> > +
>
>
> >  void diffcore_rename(struct diff_options *options)
> >  {
> >       int detect_rename = options->detect_rename;
> > @@ -463,10 +512,11 @@ void diffcore_rename(struct diff_options *options)
> >       struct diff_score *mx;
> >       int i, j, rename_count, skip_unmodified = 0;
> >       int num_destinations, dst_cnt;
> > -     int num_sources;
> > +     int num_sources, want_copies;
> >       struct progress *progress = NULL;
> >
> >       trace2_region_enter("diff", "setup", options->repo);
> > +     want_copies = (detect_rename == DIFF_DETECT_COPY);
>
> I was looking at this and thinking "why wasn't this in the previous
> patch?" I see that you delete the "if (detect_rename != DIFF_DETECT_COPY)"
> below in favor of the call to the helper.
>
> _If_ you were re-rolling, then maybe it would be slightly cleaner
> to add 'want_copies' to the previous patch?

Oh, yeah, I think that does make sense.  And you highlight some good
reasons to re-roll below, so I'll do that.

> >       /*
> > -      * Calculate how many renames are left (but all the source
> > -      * files still remain as options for rename/copies!)
> > +      * Calculate how many renames are left
> >        */
> >       num_destinations = (rename_dst_nr - rename_count);
> > -     num_sources = rename_src_nr;
> > -     if (detect_rename != DIFF_DETECT_COPY)
> > -             num_sources -= rename_count;
> > +     num_sources = remove_unneeded_paths_from_src(rename_src_nr, want_copies);
>
> So here we update the local variable...
>
> >
> >       /* All done? */
> >       if (!num_destinations || !num_sources)
> > @@ -573,13 +620,13 @@ void diffcore_rename(struct diff_options *options)
> >               for (j = 0; j < NUM_CANDIDATE_PER_DST; j++)
> >                       m[j].dst = -1;
> >
> > -             for (j = 0; j < rename_src_nr; j++) {
> > +             for (j = 0; j < num_sources; j++) {
>
> ...and drop the use of the static global. However, don't we need to
> update the global itself? There are other loops limited by
> rename_src_nr that should only work on this smaller list, right?
>
> In particular, it seems that the prefetch() method would benefit
> from looking at this smaller list. It fetches the blob contents in
> a partial clone inside of estimate_similarity(). When there are a
> lot of exact renames but even one possible inexact rename, it seems
> to currently download the content for all of the blobs including the
> exactly-matched ones.

Ah, indeed.  The prefetch() method didn't exist at the time I wrote
this patch, but there's always the possibility of other reasons.  It
turns out I did make rename_src_nr match num_sources, but that diff
hunk got placed in a later series.  I agree with you that it should go
here instead.

> For interesting tests around this prefetch() call, I think the
> 'diff with rename detection batches blobs' test in t4067 should help.
>
> >                       struct diff_filespec *one = rename_src[j].p->one;
> >                       struct diff_score this_src;
> >
> > -                     if (one->rename_used &&
> > -                         detect_rename != DIFF_DETECT_COPY)
> > -                             continue;
> > +                     assert(!one->rename_used ||
> > +                            detect_rename == DIFF_DETECT_COPY ||
> > +                            break_idx);
>
> Looks like you could use 'want_copies' here (and in the previous
> patch).

Yep, good catch.  I'll make the change.

> Thanks,
> -Stolee

Thanks for reviewing!

  parent reply	other threads:[~2021-02-03 17:14 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-03  5:49 [PATCH 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget
2021-02-03  5:49 ` [PATCH 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget
2021-02-03 11:44   ` Derrick Stolee
2021-02-03 16:31     ` Elijah Newren
2021-02-03 18:46     ` Junio C Hamano
2021-02-03 19:10       ` Elijah Newren
2021-02-03  5:49 ` [PATCH 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget
     [not found]   ` <13feb106-c3a7-a26d-0e6e-013aa45c58d4@gmail.com>
2021-02-03 17:12     ` Elijah Newren [this message]
2021-02-03 19:12   ` Junio C Hamano
2021-02-03 19:19     ` Elijah Newren
2021-02-03 20:03 ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Elijah Newren via GitGitGadget
2021-02-03 20:03   ` [PATCH v2 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget
2021-02-03 20:03   ` [PATCH v2 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget
2021-02-13  1:04     ` Junio C Hamano
2021-02-13  4:24       ` Elijah Newren
2021-02-13  1:06     ` Junio C Hamano
2021-02-13  4:43       ` Elijah Newren
2021-02-03 21:56   ` [PATCH v2 0/2] Optimization batch 6: make full use of exact renames Junio C Hamano
2021-02-03 23:06     ` Elijah Newren
2021-02-03 23:26       ` Junio C Hamano
2021-02-03 23:36       ` Jeff King
2021-02-04  0:05         ` Elijah Newren
2021-02-14  7:34   ` [PATCH v3 " Elijah Newren via GitGitGadget
2021-02-14  7:35     ` [PATCH v3 1/2] diffcore-rename: no point trying to find a match better than exact Elijah Newren via GitGitGadget
2021-02-14  7:35     ` [PATCH v3 2/2] diffcore-rename: filter rename_src list when possible Elijah Newren via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABPp-BG8mu4qGQCz5sMWqJHSb9o1ipk9FgxUS34SbArhQcPutQ@mail.gmail.com \
    --to=newren@gmail.com \
    --cc=blees@dcon.de \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=me@ttaylorr.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).