git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Jeff King <peff@peff.net>
Cc: "Git Mailing List" <git@vger.kernel.org>,
	"Derrick Stolee" <stolee@gmail.com>,
	"Junio C Hamano" <gitster@pobox.com>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"Jonathan Tan" <jonathantanmy@google.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: Re: [RFC] Bump {diff,merge}.renameLimit ?
Date: Mon, 12 Jul 2021 13:23:38 -0700	[thread overview]
Message-ID: <CABPp-BEdUmxXVCx=5pb0=LN-0YBtrEB-wngPC5vys6fjVctgaQ@mail.gmail.com> (raw)
In-Reply-To: <YOx43fqGyQMy7+tR@coredump.intra.peff.net>

On Mon, Jul 12, 2021 at 10:16 AM Jeff King <peff@peff.net> wrote:
>
> tl;dr Bumping the limit seems like a good idea to me.

:-)

> On Sat, Jul 10, 2021 at 05:28:43PM -0700, Elijah Newren wrote:
>
> > * My colleagues happily raised merge.renameLimit beyond 32767 when the
> >   artificial cap was removed.  10 minute waits were annoying, but much
> >   less so than having to manually cherry-pick commits (especially given
> >   the risk of getting it wrong).[13]
>
> One tricky thing here is that waiting 10 minutes may be worth it _if the
> rename detection finds something_. If it doesn't, then it's just
> annoying.
>
> I do think progress meters help a bit there, because then at least the
> user understands what's going on. I'll go into more detail in the
> sub-thread there. :)

Another thing that helps here (at least for merges rather than diffs)
is that we can determine a priori whether individual renames will have
no effect on the merge result and just exclude those particular
renames from the detection machinery.  Well, at least we can do this
with the new merge-ort backend.  That should dramatically reduce the
annoyance factor once we make it the default.

...
> Yeah, it is definitely time to revisit the default numbers. I think at
> one point we talked about letting it run for N wallclock seconds before
> giving up, but we've been hesitant to introduce that kind of time-based
> limit, because it ends up with non-deterministic results (plus you don't
> realize you're not going to finish until you've already wasted a bunch
> of time, whereas the static limits can avoid even beginning the work).

Yeah, I'm kinda glad we didn't go that route; seems problematic to me.

...
> I don't remember my methodology at this point, but perhaps it was based
> on blobs in the graph, not just one tree, like:
>
>   $ git rev-list --objects v2.6.25 |
>     git cat-file --batch-check='%(objecttype) %(objectsize) %(rest)' |
>     awk '
>       /^blob/ { sum += $2; total += 1 }
>       END { print sum / total }
>     '
>   27535.8
>
> I suspect the difference versus a single tree is that there is a
> quadratic-ish property going on with file size: the bigger the file, the
> more likely it is to be touched (so total storage is closer to bytes^2).

Ah, gotcha.  That makes sense.

> Looking at single-tree blob sizes is probably better though, as rename
> detection will happen between two single trees.

Agreed.

> > * I think the median file size is a better predictor of rename
> >   performance than mean file size, and median file size is ~2.5x smaller
> >   than the mean[18].
>
> There you might get hit with the quadratic-update thing again, though.
> The big files are more likely to be touched, so could be weighted more
> (though are they more likely to have been added/delete/renamed? Who
> knows).

I'll agree that big files are more likely to be updated, but I don't
think renames are weighted towards bigger files.  In fact, I wrote a
quick script to look at the sizes of all the renamed files in the
history of v2.6.25, and the mean (8034.1) and median (3866) of the
renamed files sizes in that history are comparable to the mean
(11150.3) and median (4198) of the files sizes in the v2.6.25 tree.

I re-did the calculations using v5.5, and found that the mean
(12495.1) and median (3702) sizes of renames in all linux history up
to that point again were a bit less than the mean (13449.2) and median
(3860) file size of a file in the final v5.5 tree.

Granted, this is a bit hand-wavy (what about creations or deletions?
Is there too much bias from the fact that I did rename sizes over all
history (due to needing enough to get statistics) while just grabbing
regular file sizes just in the end tree?), but I think it provides
pretty good first order approximation suggesting that mean/median
sizes of files involved in rename detection will be similar to the
mean/median sizes of other files within the relevant trees.

> I don't think file size matters all _that_ much, though, as it has a
> linear relationship to time spent. Whereas the number of entries is
> quadratic. And of course the whole experiment is ball-parking in the
> first place. We're looking for order-of-magnitude approximations, I'd
> think.

I agree that the number of entries is what's important; in fact,
that's why I think the median file size is more important than the
mean file size:

In the case of the linux kernel, since the mean is 2.5x bigger than
the median file size...
  => diffcore-rename checks file sizes before comparing content
  => Files more than 2x different in size can't be more than 50% similar
  => Therefore, files of the mean size will not be compared to files
of the median size (or less)
  => Therefore, we automatically know that files of mean size will not
be compared to more than half the files.

> > * The feedback about the limit is better today than when we last changed
> >   the limits, and folks can configure a higher limit relatively easily.
> >   Many already have.
>
> I can't remember the last time I saw the limit kick in in practice, but
> then I don't generally work with super-large repos (and my workflows
> typically do not encourage merging across big segments of history).
> Nor do I remember the topic coming up on the list after the last bump.
> So maybe that means that people are happily bumping the limits
> themselves via config.

It might also mean that you're missing more emails than you used to,
or just forgot them.  :-)

e.g.:
https://lore.kernel.org/git/20171129201154.192379-1-jonathantanmy@google.com/
https://lore.kernel.org/git/20171113201600.24878-1-newren@gmail.com/

But I do certainly suspect it's come up less often than it would have before.

> But I don't think that's really an argument against at least a moderate
> bump. If it helps even a few people avoid having to learn about the
> config, that's time saved. And it's a trivial code change on our end.

:-)

  reply	other threads:[~2021-07-12 20:23 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-11  0:28 [RFC] Bump {diff,merge}.renameLimit ? Elijah Newren
2021-07-11 16:42 ` Ævar Arnfjörð Bjarmason
2021-07-12 15:23   ` Elijah Newren
2021-07-12 16:48     ` Ævar Arnfjörð Bjarmason
2021-07-12 17:39       ` Jeff King
2021-07-12 17:16 ` Jeff King
2021-07-12 20:23   ` Elijah Newren [this message]
2021-07-12 20:58     ` Felipe Contreras
2021-07-12 21:41     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CABPp-BEdUmxXVCx=5pb0=LN-0YBtrEB-wngPC5vys6fjVctgaQ@mail.gmail.com' \
    --to=newren@gmail.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).