[RFC] Bump {diff,merge}.renameLimit ?

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [RFC] Bump {diff,merge}.renameLimit ?
@ 2021-07-11  0:28 Elijah Newren
  2021-07-11 16:42 ` Ævar Arnfjörð Bjarmason
  2021-07-12 17:16 ` Jeff King
  0 siblings, 2 replies; 9+ messages in thread
From: Elijah Newren @ 2021-07-11  0:28 UTC (permalink / raw)
  To: Git Mailing List
  Cc: Jeff King, Derrick Stolee, Junio C Hamano, Linus Torvalds,
	Jonathan Tan, Ævar Arnfjörð Bjarmason

[CC'ing folks I've seen comment on these limits.]

Hi everyone,

I'm considering bumping {diff,merge}.renameLimit, which control the
quadratic portion of rename/copy detection.  Should they be bumped?
If so, moderately higher, or much higher?

I lean towards a moderate bump for diff.renameLimit, and preferably
more than just a moderate bump for merge.renameLimit.  I have
calculations for what "moderate" translates to, based on a number of
assumptions.  But there's several reasons to break with past
guideposts for how these limits were picked.  See below for various
arguments in each of the directions.

So...thoughts?

Thanks,
Elijah

==> Arguments for bumping MUCH higher:

* Linus said the real reason for the renameLimit was excessive memory
  usage (not perf)[1].  But Junio dropped the memory requirements to
  linear in commit 6d24ad971c81 (Optimize rename detection for a huge
  diff, 2008-01-29)

* Linus oft recommends setting diff.renameLimit=0 [2,3,4,5,6,7,8,9,10,11]
  (which maps to 32767 [12]).

* My colleagues happily raised merge.renameLimit beyond 32767 when the
  artificial cap was removed.  10 minute waits were annoying, but much
  less so than having to manually cherry-pick commits (especially given
  the risk of getting it wrong).[13]

==> Arguments for bumping MODERATELY higher:

* We have bumped the limits twice before (in 2008 and 2011), both times
  stating performance as the limiting factor.  Processors are faster
  today than then.[14,15]

* Peff's computations for performance in the last two bumps used "the
  average size of a file in the linux-2.6 repository"[16], for which I
  assume average==mean, but the file selected was actually ~2x larger
  than the mean file size according to my calculations[17].

* I think the median file size is a better predictor of rename
  performance than mean file size, and median file size is ~2.5x smaller
  than the mean[18].

==> Arguments for not bumping either limit:

* The feedback about the limit is better today than when we last changed
  the limits, and folks can configure a higher limit relatively easily.
  Many already have.

* This issue won't come up nearly as much any more once we switch the
  default merge backend to ort, due to my performance work[19] (many
  renames can be outright skipped without affecting merge quality, and
  many others can be detected in linear time -- the cherry-picks that
  used to require merge.renameLimit=48941 and took 10 minutes can now
  complete in less than a second with the default merge.renameLimit of
  1000.)

* It'd take too long to read all the footnotes in this email, so screw
  it.  :-)

==> Footnotes:

[1] https://lore.kernel.org/git/AANLkTimKp+Z==QXJg2Bagot+Df4REeANuxwVi7bpPCXr@mail.gmail.com/

[2] https://lore.kernel.org/git/alpine.LFD.0.999.0710161030430.6887@woody.linux-foundation.org/
 2007-10-16
[3] https://lore.kernel.org/git/alpine.LFD.2.00.0811032021210.3419@nehalem.linux-foundation.org/
 2008-11-03
[4] https://lore.kernel.org/git/AANLkTimKp+Z==QXJg2Bagot+Df4REeANuxwVi7bpPCXr@mail.gmail.com/
 2011-02-18
[5] https://lore.kernel.org/lkml/alpine.LFD.0.999.0710111944120.6887@woody.linux-foundation.org/
 2007-10-11
[6] https://lore.kernel.org/lkml/alpine.LFD.1.10.0808052157400.15995@nehalem.linux-foundation.org/
 2008-08-05
[7] https://lore.kernel.org/lkml/alpine.LFD.2.01.0909160813400.4950@localhost.localdomain/
  2009-09-16
[8] https://lore.kernel.org/lkml/CA+55aFw1GVvszqoC_f0RAvG5t1xj0CSYLhLU=y0gQ+_54Gsomw@mail.gmail.com/
 2011-10-25
[9] https://lore.kernel.org/lkml/CA+55aFyWefZ1jJLMJKXhy0Qif-iBmjG6n-evcbvkbWS5mDrs0g@mail.gmail.com/
 2015-02-16
[10] https://lore.kernel.org/lkml/CA+55aFxODGv7-AvnqFmxrXBcS2w0XzHuZ7UuRi3EMQz4-oeLJA@mail.gmail.com/
 2018-04-11
[11] https://lore.kernel.org/lkml/CAHk-=wg=CTtNrxPeFzkDw053dY3urchiyxevHnUXHhTGbK=9OQ@mail.gmail.com/
 2020-06-03

[12] 89973554b52c (diffcore-rename: make diff-tree -l0 mean -l<large>,
     2017-11-29)

[13] https://lore.kernel.org/git/20171110173956.25105-3-newren@gmail.com/

[14] 50705915eae8 (bump rename limit defaults, 2008-04-30)
[15] 92c57e5c1d29 (bump rename limit defaults (again), 2011-02-19)

[16] https://lore.kernel.org/git/20080211113516.GB6344@coredump.intra.peff.net/

[17] Calculated and compared as follows (num files, mean size, size Peff used):
  $ git ls-tree -rl v2.6.25 | wc -l
  23810
  $ git ls-tree -rl v2.6.25 | awk '{sum += $4} END{print sum/23810}'
  11150.3
  $ git show v2.6.25:arch/m68k/Kconfig | wc -c
  20977

[18] Calculated as 4198 as follows (note: 11905 = 23810/2):
  $ git ls-tree -rl v2.6.25 | sort -n -k 4 | head -n 11905 | tail -n 1
  100644 blob 29510dc515109ad5dd8a16b5936f1f6086ae417c    4198
Documentation/lguest/lguest.txt

[19] See "Overall Results" from https://github.com/gitgitgadget/git/pull/990

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Bump {diff,merge}.renameLimit ?
  2021-07-11  0:28 [RFC] Bump {diff,merge}.renameLimit ? Elijah Newren
@ 2021-07-11 16:42 ` Ævar Arnfjörð Bjarmason
  2021-07-12 15:23   ` Elijah Newren
  2021-07-12 17:16 ` Jeff King
  1 sibling, 1 reply; 9+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-07-11 16:42 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Jeff King, Derrick Stolee, Junio C Hamano,
	Linus Torvalds, Jonathan Tan

On Sat, Jul 10 2021, Elijah Newren wrote:

> I'm considering bumping {diff,merge}.renameLimit, which control the
> quadratic portion of rename/copy detection.  Should they be bumped?
> If so, moderately higher, or much higher?
>
> I lean towards a moderate bump for diff.renameLimit, and preferably
> more than just a moderate bump for merge.renameLimit.  I have
> calculations for what "moderate" translates to, based on a number of
> assumptions.  But there's several reasons to break with past
> guideposts for how these limits were picked.  See below for various
> arguments in each of the directions.
>
> So...thoughts?

I think the most relevant is something you didn't state: That when this
limit was introduced (well, diff.*, not merge.*) there was no progress
output in git.

We should err entirely on producing consistent and predictable results,
and not change how git works when we it hits some arbitrary limit. To
the extent that this is needed it's sufficient to opt-in to it, i.e. we
do/should show a progress bar, advice() etc. showing why we're doing
this much work, so those users can adjust the limit (or not).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Bump {diff,merge}.renameLimit ?
  2021-07-11 16:42 ` Ævar Arnfjörð Bjarmason
@ 2021-07-12 15:23   ` Elijah Newren
  2021-07-12 16:48     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 9+ messages in thread
From: Elijah Newren @ 2021-07-12 15:23 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Git Mailing List, Jeff King, Derrick Stolee, Junio C Hamano,
	Linus Torvalds, Jonathan Tan

Hi Ævar,

Thanks for reading and commenting.  You certainly brought a new angle
to the question...

On Sun, Jul 11, 2021 at 10:00 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Sat, Jul 10 2021, Elijah Newren wrote:
>
> > I'm considering bumping {diff,merge}.renameLimit, which control the
> > quadratic portion of rename/copy detection.  Should they be bumped?
> > If so, moderately higher, or much higher?
> >
> > I lean towards a moderate bump for diff.renameLimit, and preferably
> > more than just a moderate bump for merge.renameLimit.  I have
> > calculations for what "moderate" translates to, based on a number of
> > assumptions.  But there's several reasons to break with past
> > guideposts for how these limits were picked.  See below for various
> > arguments in each of the directions.
> >
> > So...thoughts?
>
> I think the most relevant is something you didn't state: That when this
> limit was introduced (well, diff.*, not merge.*) there was no progress
> output in git.

I am convinced that good progress output is very important.  I've
submitted multiple patches for progress output specifically for rename
detection[1]

However, I am not convinced that the lack of progress output in git
when this limit was introduced is the most relevant thing.  If it
were, then the lively thread when Peff posted his past series to both
introduce the progress output for rename detection and simultaneously
bump the limits probably would have spurred comments about not needing
both[2].

> We should err entirely on producing consistent and predictable results,
> and not change how git works when we it hits some arbitrary limit. To
> the extent that this is needed it's sufficient to opt-in to it, i.e. we
> do/should show a progress bar, advice() etc. showing why we're doing
> this much work, so those users can adjust the limit (or not).

So I've read and re-read your response multiple times, but I am still
not sure what you're advocating for.  I think you're either advocating
for rename detection to be turned off by default, or for a new
"unlimited" mode to be introduced and be the default (maybe even
redefining what the value of "0" means in order to implement this),
but I can't tell which.  Could you clarify?

[1] In particular:
   d6861d0258df (progress: fix progress meters when dealing with lots
of work, 2017-11-13)
   9268cf4a2ef6 (sequencer: show rename progress during cherry picks,
2017-11-13)
   81c4bf02964e (diffcore-rename: reduce jumpiness in progress
counters, 2020-12-11)

[2] See https://lore.kernel.org/git/20110219101936.GB20577@sigill.intra.peff.net/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Bump {diff,merge}.renameLimit ?
  2021-07-12 15:23   ` Elijah Newren
@ 2021-07-12 16:48     ` Ævar Arnfjörð Bjarmason
  2021-07-12 17:39       ` Jeff King
  0 siblings, 1 reply; 9+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-07-12 16:48 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Jeff King, Derrick Stolee, Junio C Hamano,
	Linus Torvalds, Jonathan Tan


On Mon, Jul 12 2021, Elijah Newren wrote:

> Hi Ævar,
>
> Thanks for reading and commenting.  You certainly brought a new angle
> to the question...
>
> On Sun, Jul 11, 2021 at 10:00 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>> On Sat, Jul 10 2021, Elijah Newren wrote:
>>
>> > I'm considering bumping {diff,merge}.renameLimit, which control the
>> > quadratic portion of rename/copy detection.  Should they be bumped?
>> > If so, moderately higher, or much higher?
>> >
>> > I lean towards a moderate bump for diff.renameLimit, and preferably
>> > more than just a moderate bump for merge.renameLimit.  I have
>> > calculations for what "moderate" translates to, based on a number of
>> > assumptions.  But there's several reasons to break with past
>> > guideposts for how these limits were picked.  See below for various
>> > arguments in each of the directions.
>> >
>> > So...thoughts?
>>
>> I think the most relevant is something you didn't state: That when this
>> limit was introduced (well, diff.*, not merge.*) there was no progress
>> output in git.
>
> I am convinced that good progress output is very important.  I've
> submitted multiple patches for progress output specifically for rename
> detection[1]
>
> However, I am not convinced that the lack of progress output in git
> when this limit was introduced is the most relevant thing.  If it
> were, then the lively thread when Peff posted his past series to both
> introduce the progress output for rename detection and simultaneously
> bump the limits probably would have spurred comments about not needing
> both[2].

I see I had a dead-end reply in that thread in 2011 (didn't spot the
progress patch there).

...

>> We should err entirely on producing consistent and predictable results,
>> and not change how git works when we it hits some arbitrary limit. To
>> the extent that this is needed it's sufficient to opt-in to it, i.e. we
>> do/should show a progress bar, advice() etc. showing why we're doing
>> this much work, so those users can adjust the limit (or not).

> So I've read and re-read your response multiple times, but I am still
> not sure what you're advocating for.  I think you're either advocating
> for rename detection to be turned off by default, or for a new
> "unlimited" mode to be introduced and be the default (maybe even
> redefining what the value of "0" means in order to implement this),
> but I can't tell which.  Could you clarify?

I'm advocating for an "unlimited" default as long as we have
progress/advice or whatever other output would direct users for whom
it's very slow to tweaking the setting (or not).

Anyway, yes some may disagree with this stance. I'm not saying it's
demonstrably obvious that we should have "unlimited".

I do think that it's much better if git behaves that way, i.e. that we
don't have arbitrary limits that completely change behavior once they're
tripped.

Better to spend more CPU, and if it's too slow for someone they can
tweak the limits.

> [1] In particular:
>    d6861d0258df (progress: fix progress meters when dealing with lots
> of work, 2017-11-13)
>    9268cf4a2ef6 (sequencer: show rename progress during cherry picks,
> 2017-11-13)
>    81c4bf02964e (diffcore-rename: reduce jumpiness in progress
> counters, 2020-12-11)
>
> [2] See https://lore.kernel.org/git/20110219101936.GB20577@sigill.intra.peff.net/


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Bump {diff,merge}.renameLimit ?
  2021-07-11  0:28 [RFC] Bump {diff,merge}.renameLimit ? Elijah Newren
  2021-07-11 16:42 ` Ævar Arnfjörð Bjarmason
@ 2021-07-12 17:16 ` Jeff King
  2021-07-12 20:23   ` Elijah Newren
  1 sibling, 1 reply; 9+ messages in thread
From: Jeff King @ 2021-07-12 17:16 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Derrick Stolee, Junio C Hamano, Linus Torvalds,
	Jonathan Tan, Ævar Arnfjörð Bjarmason

tl;dr Bumping the limit seems like a good idea to me.

On Sat, Jul 10, 2021 at 05:28:43PM -0700, Elijah Newren wrote:

> * My colleagues happily raised merge.renameLimit beyond 32767 when the
>   artificial cap was removed.  10 minute waits were annoying, but much
>   less so than having to manually cherry-pick commits (especially given
>   the risk of getting it wrong).[13]

One tricky thing here is that waiting 10 minutes may be worth it _if the
rename detection finds something_. If it doesn't, then it's just
annoying.

I do think progress meters help a bit there, because then at least the
user understands what's going on. I'll go into more detail in the
sub-thread there. :)

> ==> Arguments for bumping MODERATELY higher:
> 
> * We have bumped the limits twice before (in 2008 and 2011), both times
>   stating performance as the limiting factor.  Processors are faster
>   today than then.[14,15]

Yeah, it is definitely time to revisit the default numbers. I think at
one point we talked about letting it run for N wallclock seconds before
giving up, but we've been hesitant to introduce that kind of time-based
limit, because it ends up with non-deterministic results (plus you don't
realize you're not going to finish until you've already wasted a bunch
of time, whereas the static limits can avoid even beginning the work).

> * Peff's computations for performance in the last two bumps used "the
>   average size of a file in the linux-2.6 repository"[16], for which I
>   assume average==mean, but the file selected was actually ~2x larger
>   than the mean file size according to my calculations[17].
> [...]
> [17] Calculated and compared as follows (num files, mean size, size Peff used):
>   $ git ls-tree -rl v2.6.25 | wc -l
>   23810
>   $ git ls-tree -rl v2.6.25 | awk '{sum += $4} END{print sum/23810}'
>   11150.3
>   $ git show v2.6.25:arch/m68k/Kconfig | wc -c
>   20977

I don't remember my methodology at this point, but perhaps it was based
on blobs in the graph, not just one tree, like:

  $ git rev-list --objects v2.6.25 |
    git cat-file --batch-check='%(objecttype) %(objectsize) %(rest)' |
    awk '
      /^blob/ { sum += $2; total += 1 }
      END { print sum / total }
    '
  27535.8

I suspect the difference versus a single tree is that there is a
quadratic-ish property going on with file size: the bigger the file, the
more likely it is to be touched (so total storage is closer to bytes^2).

Looking at single-tree blob sizes is probably better though, as rename
detection will happen between two single trees.

> * I think the median file size is a better predictor of rename
>   performance than mean file size, and median file size is ~2.5x smaller
>   than the mean[18].

There you might get hit with the quadratic-update thing again, though.
The big files are more likely to be touched, so could be weighted more
(though are they more likely to have been added/delete/renamed? Who
knows).

I don't think file size matters all _that_ much, though, as it has a
linear relationship to time spent. Whereas the number of entries is
quadratic. And of course the whole experiment is ball-parking in the
first place. We're looking for order-of-magnitude approximations, I'd
think.

> * The feedback about the limit is better today than when we last changed
>   the limits, and folks can configure a higher limit relatively easily.
>   Many already have.

I can't remember the last time I saw the limit kick in in practice, but
then I don't generally work with super-large repos (and my workflows
typically do not encourage merging across big segments of history).
Nor do I remember the topic coming up on the list after the last bump.
So maybe that means that people are happily bumping the limits
themselves via config.

But I don't think that's really an argument against at least a moderate
bump. If it helps even a few people avoid having to learn about the
config, that's time saved. And it's a trivial code change on our end.

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Bump {diff,merge}.renameLimit ?
  2021-07-12 16:48     ` Ævar Arnfjörð Bjarmason
@ 2021-07-12 17:39       ` Jeff King
  0 siblings, 0 replies; 9+ messages in thread
From: Jeff King @ 2021-07-12 17:39 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Elijah Newren, Git Mailing List, Derrick Stolee, Junio C Hamano,
	Linus Torvalds, Jonathan Tan

On Mon, Jul 12, 2021 at 06:48:50PM +0200, Ævar Arnfjörð Bjarmason wrote:

> > So I've read and re-read your response multiple times, but I am still
> > not sure what you're advocating for.  I think you're either advocating
> > for rename detection to be turned off by default, or for a new
> > "unlimited" mode to be introduced and be the default (maybe even
> > redefining what the value of "0" means in order to implement this),
> > but I can't tell which.  Could you clarify?
> 
> I'm advocating for an "unlimited" default as long as we have
> progress/advice or whatever other output would direct users for whom
> it's very slow to tweaking the setting (or not).
> 
> Anyway, yes some may disagree with this stance. I'm not saying it's
> demonstrably obvious that we should have "unlimited".
> 
> I do think that it's much better if git behaves that way, i.e. that we
> don't have arbitrary limits that completely change behavior once they're
> tripped.
> 
> Better to spend more CPU, and if it's too slow for someone they can
> tweak the limits.

This seems like a reasonable view to me for merges. As noted, we already
show progress for rename detection there, so at least the use knows that
we're working.

It might be nice to provide similar advice to what we show now when the
limit kicks in.  I.e., now we say "we turned off rename detection
because this merge is huge; you can bump this config to enable it". But
we may want to say "woah, this rename detection is taking a long time;
you can skip it by tweaking this config". Probably with a much higher
limit than the progress meter kicking in (I'm thinking something like
30+ seconds).

For diffs we don't ever show progress. There are two things that make
that a bit hairy:

  1. Most diff operations use the pager, and we send stderr there. That
     gets weird with our "\r" overwriting.

  2. Traversing via git-log, etc, will show a series of diffs, which
     means an unbounded number of progress meters. And potentially we're
     doing work that doesn't match what the user is looking at in the
     pager (e.g., they are looking at commit X, but we are computing
     X~30 while the pager has buffered all points in between; showing a
     progress meter for X~30 is distracting at that point).

I posted a series back in 2011 to try to deal with these. It's in the
sub-thread at:

  https://lore.kernel.org/git/20110324174556.GA30661@sigill.intra.peff.net/

It's also been part of my "rebase forward and merge into my custom
build" strategy for the past 10+ years (even though I had totally
forgotten it existed). So you can get working current patches by
fetching:

  https://github/com/peff/git jk/rename-progress

They do work, and running:

  git -c diff.renamelimit=65535 diff v2.6.12 --raw

in linux.git is very satisfying. I think one sticking point is that
sending separate lines to stderr (outside of the pager) only looks good
if your pager is careful not to take over the terminal until there's
something to show. "less" does this, and I'd expect most traditional
pagers to do so. But there may be those that don't, so we'd probably
need another config option to give an escape hatch.

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Bump {diff,merge}.renameLimit ?
  2021-07-12 17:16 ` Jeff King
@ 2021-07-12 20:23   ` Elijah Newren
  2021-07-12 20:58     ` Felipe Contreras
  2021-07-12 21:41     ` Jeff King
  0 siblings, 2 replies; 9+ messages in thread
From: Elijah Newren @ 2021-07-12 20:23 UTC (permalink / raw)
  To: Jeff King
  Cc: Git Mailing List, Derrick Stolee, Junio C Hamano, Linus Torvalds,
	Jonathan Tan, Ævar Arnfjörð Bjarmason

On Mon, Jul 12, 2021 at 10:16 AM Jeff King <peff@peff.net> wrote:
>
> tl;dr Bumping the limit seems like a good idea to me.

:-)

> On Sat, Jul 10, 2021 at 05:28:43PM -0700, Elijah Newren wrote:
>
> > * My colleagues happily raised merge.renameLimit beyond 32767 when the
> >   artificial cap was removed.  10 minute waits were annoying, but much
> >   less so than having to manually cherry-pick commits (especially given
> >   the risk of getting it wrong).[13]
>
> One tricky thing here is that waiting 10 minutes may be worth it _if the
> rename detection finds something_. If it doesn't, then it's just
> annoying.
>
> I do think progress meters help a bit there, because then at least the
> user understands what's going on. I'll go into more detail in the
> sub-thread there. :)

Another thing that helps here (at least for merges rather than diffs)
is that we can determine a priori whether individual renames will have
no effect on the merge result and just exclude those particular
renames from the detection machinery.  Well, at least we can do this
with the new merge-ort backend.  That should dramatically reduce the
annoyance factor once we make it the default.

...
> Yeah, it is definitely time to revisit the default numbers. I think at
> one point we talked about letting it run for N wallclock seconds before
> giving up, but we've been hesitant to introduce that kind of time-based
> limit, because it ends up with non-deterministic results (plus you don't
> realize you're not going to finish until you've already wasted a bunch
> of time, whereas the static limits can avoid even beginning the work).

Yeah, I'm kinda glad we didn't go that route; seems problematic to me.

...
> I don't remember my methodology at this point, but perhaps it was based
> on blobs in the graph, not just one tree, like:
>
>   $ git rev-list --objects v2.6.25 |
>     git cat-file --batch-check='%(objecttype) %(objectsize) %(rest)' |
>     awk '
>       /^blob/ { sum += $2; total += 1 }
>       END { print sum / total }
>     '
>   27535.8
>
> I suspect the difference versus a single tree is that there is a
> quadratic-ish property going on with file size: the bigger the file, the
> more likely it is to be touched (so total storage is closer to bytes^2).

Ah, gotcha.  That makes sense.

> Looking at single-tree blob sizes is probably better though, as rename
> detection will happen between two single trees.

Agreed.

> > * I think the median file size is a better predictor of rename
> >   performance than mean file size, and median file size is ~2.5x smaller
> >   than the mean[18].
>
> There you might get hit with the quadratic-update thing again, though.
> The big files are more likely to be touched, so could be weighted more
> (though are they more likely to have been added/delete/renamed? Who
> knows).

I'll agree that big files are more likely to be updated, but I don't
think renames are weighted towards bigger files.  In fact, I wrote a
quick script to look at the sizes of all the renamed files in the
history of v2.6.25, and the mean (8034.1) and median (3866) of the
renamed files sizes in that history are comparable to the mean
(11150.3) and median (4198) of the files sizes in the v2.6.25 tree.

I re-did the calculations using v5.5, and found that the mean
(12495.1) and median (3702) sizes of renames in all linux history up
to that point again were a bit less than the mean (13449.2) and median
(3860) file size of a file in the final v5.5 tree.

Granted, this is a bit hand-wavy (what about creations or deletions?
Is there too much bias from the fact that I did rename sizes over all
history (due to needing enough to get statistics) while just grabbing
regular file sizes just in the end tree?), but I think it provides
pretty good first order approximation suggesting that mean/median
sizes of files involved in rename detection will be similar to the
mean/median sizes of other files within the relevant trees.

> I don't think file size matters all _that_ much, though, as it has a
> linear relationship to time spent. Whereas the number of entries is
> quadratic. And of course the whole experiment is ball-parking in the
> first place. We're looking for order-of-magnitude approximations, I'd
> think.

I agree that the number of entries is what's important; in fact,
that's why I think the median file size is more important than the
mean file size:

In the case of the linux kernel, since the mean is 2.5x bigger than
the median file size...
  => diffcore-rename checks file sizes before comparing content
  => Files more than 2x different in size can't be more than 50% similar
  => Therefore, files of the mean size will not be compared to files
of the median size (or less)
  => Therefore, we automatically know that files of mean size will not
be compared to more than half the files.

> > * The feedback about the limit is better today than when we last changed
> >   the limits, and folks can configure a higher limit relatively easily.
> >   Many already have.
>
> I can't remember the last time I saw the limit kick in in practice, but
> then I don't generally work with super-large repos (and my workflows
> typically do not encourage merging across big segments of history).
> Nor do I remember the topic coming up on the list after the last bump.
> So maybe that means that people are happily bumping the limits
> themselves via config.

It might also mean that you're missing more emails than you used to,
or just forgot them.  :-)

e.g.:
https://lore.kernel.org/git/20171129201154.192379-1-jonathantanmy@google.com/
https://lore.kernel.org/git/20171113201600.24878-1-newren@gmail.com/

But I do certainly suspect it's come up less often than it would have before.

> But I don't think that's really an argument against at least a moderate
> bump. If it helps even a few people avoid having to learn about the
> config, that's time saved. And it's a trivial code change on our end.

:-)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Bump {diff,merge}.renameLimit ?
  2021-07-12 20:23   ` Elijah Newren
@ 2021-07-12 20:58     ` Felipe Contreras
  2021-07-12 21:41     ` Jeff King
  1 sibling, 0 replies; 9+ messages in thread
From: Felipe Contreras @ 2021-07-12 20:58 UTC (permalink / raw)
  To: Elijah Newren, Jeff King
  Cc: Git Mailing List, Derrick Stolee, Junio C Hamano, Linus Torvalds,
	Jonathan Tan, Ævar Arnfjörð Bjarmason

Elijah Newren wrote:
> On Mon, Jul 12, 2021 at 10:16 AM Jeff King <peff@peff.net> wrote:

> > > * I think the median file size is a better predictor of rename
> > >   performance than mean file size, and median file size is ~2.5x smaller
> > >   than the mean[18].
> >
> > There you might get hit with the quadratic-update thing again, though.
> > The big files are more likely to be touched, so could be weighted more
> > (though are they more likely to have been added/delete/renamed? Who
> > knows).
> 
> I'll agree that big files are more likely to be updated, but I don't
> think renames are weighted towards bigger files.  In fact, I wrote a
> quick script to look at the sizes of all the renamed files in the
> history of v2.6.25, and the mean (8034.1) and median (3866) of the
> renamed files sizes in that history are comparable to the mean
> (11150.3) and median (4198) of the files sizes in the v2.6.25 tree.
> 
> I re-did the calculations using v5.5, and found that the mean
> (12495.1) and median (3702) sizes of renames in all linux history up
> to that point again were a bit less than the mean (13449.2) and median
> (3860) file size of a file in the final v5.5 tree.
> 
> Granted, this is a bit hand-wavy (what about creations or deletions?
> Is there too much bias from the fact that I did rename sizes over all
> history (due to needing enough to get statistics) while just grabbing
> regular file sizes just in the end tree?), but I think it provides
> pretty good first order approximation suggesting that mean/median
> sizes of files involved in rename detection will be similar to the
> mean/median sizes of other files within the relevant trees.
> 
> > I don't think file size matters all _that_ much, though, as it has a
> > linear relationship to time spent. Whereas the number of entries is
> > quadratic. And of course the whole experiment is ball-parking in the
> > first place. We're looking for order-of-magnitude approximations, I'd
> > think.
> 
> I agree that the number of entries is what's important; in fact,
> that's why I think the median file size is more important than the
> mean file size:

That is almost always the case (except in unskewed distributions where
the mean is equal to the median).

Another option instead of an opaque configuration like 'renamelimit'
--which is almost entirely arbitrary for most users--would be to have
'renamelevel'. A renamelevel of 5 would be the median, so that's already
more meaningul than any value of renamelimit.

A renamelevel of 9 would be the equivalent of the 9th decile, so that
would catch 90% of renames.

If the distribution follows a Pareto distribution (which is often the
case), the formula to calculate the different deciles is trivial, but it
would also be possible to hard-code all the different levels.

-- 
Felipe Contreras

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC] Bump {diff,merge}.renameLimit ?
  2021-07-12 20:23   ` Elijah Newren
  2021-07-12 20:58     ` Felipe Contreras
@ 2021-07-12 21:41     ` Jeff King
  1 sibling, 0 replies; 9+ messages in thread
From: Jeff King @ 2021-07-12 21:41 UTC (permalink / raw)
  To: Elijah Newren
  Cc: Git Mailing List, Derrick Stolee, Junio C Hamano, Linus Torvalds,
	Jonathan Tan, Ævar Arnfjörð Bjarmason

On Mon, Jul 12, 2021 at 01:23:38PM -0700, Elijah Newren wrote:

> > > * The feedback about the limit is better today than when we last changed
> > >   the limits, and folks can configure a higher limit relatively easily.
> > >   Many already have.
> >
> > I can't remember the last time I saw the limit kick in in practice, but
> > then I don't generally work with super-large repos (and my workflows
> > typically do not encourage merging across big segments of history).
> > Nor do I remember the topic coming up on the list after the last bump.
> > So maybe that means that people are happily bumping the limits
> > themselves via config.
> 
> It might also mean that you're missing more emails than you used to,
> or just forgot them.  :-)
> 
> e.g.:
> https://lore.kernel.org/git/20171129201154.192379-1-jonathantanmy@google.com/
> https://lore.kernel.org/git/20171113201600.24878-1-newren@gmail.com/
> 
> But I do certainly suspect it's come up less often than it would have before.

Oh, indeed. Thanks for some counterexamples. :)

-Peff

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-07-12 21:41 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-11  0:28 [RFC] Bump {diff,merge}.renameLimit ? Elijah Newren
2021-07-11 16:42 ` Ævar Arnfjörð Bjarmason
2021-07-12 15:23   ` Elijah Newren
2021-07-12 16:48     ` Ævar Arnfjörð Bjarmason
2021-07-12 17:39       ` Jeff King
2021-07-12 17:16 ` Jeff King
2021-07-12 20:23   ` Elijah Newren
2021-07-12 20:58     ` Felipe Contreras
2021-07-12 21:41     ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).