git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Git Mailing List <git@vger.kernel.org>
Cc: "Jeff King" <peff@peff.net>, "Derrick Stolee" <stolee@gmail.com>,
	"Junio C Hamano" <gitster@pobox.com>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"Jonathan Tan" <jonathantanmy@google.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: [RFC] Bump {diff,merge}.renameLimit ?
Date: Sat, 10 Jul 2021 17:28:43 -0700	[thread overview]
Message-ID: <CABPp-BFzp3TCWiF1QAVSfywDLYrz=GOQszVM-sw5p0rSB8RWvw@mail.gmail.com> (raw)

[CC'ing folks I've seen comment on these limits.]

Hi everyone,

I'm considering bumping {diff,merge}.renameLimit, which control the
quadratic portion of rename/copy detection.  Should they be bumped?
If so, moderately higher, or much higher?

I lean towards a moderate bump for diff.renameLimit, and preferably
more than just a moderate bump for merge.renameLimit.  I have
calculations for what "moderate" translates to, based on a number of
assumptions.  But there's several reasons to break with past
guideposts for how these limits were picked.  See below for various
arguments in each of the directions.

So...thoughts?

Thanks,
Elijah


==> Arguments for bumping MUCH higher:

* Linus said the real reason for the renameLimit was excessive memory
  usage (not perf)[1].  But Junio dropped the memory requirements to
  linear in commit 6d24ad971c81 (Optimize rename detection for a huge
  diff, 2008-01-29)

* Linus oft recommends setting diff.renameLimit=0 [2,3,4,5,6,7,8,9,10,11]
  (which maps to 32767 [12]).

* My colleagues happily raised merge.renameLimit beyond 32767 when the
  artificial cap was removed.  10 minute waits were annoying, but much
  less so than having to manually cherry-pick commits (especially given
  the risk of getting it wrong).[13]


==> Arguments for bumping MODERATELY higher:

* We have bumped the limits twice before (in 2008 and 2011), both times
  stating performance as the limiting factor.  Processors are faster
  today than then.[14,15]

* Peff's computations for performance in the last two bumps used "the
  average size of a file in the linux-2.6 repository"[16], for which I
  assume average==mean, but the file selected was actually ~2x larger
  than the mean file size according to my calculations[17].

* I think the median file size is a better predictor of rename
  performance than mean file size, and median file size is ~2.5x smaller
  than the mean[18].


==> Arguments for not bumping either limit:

* The feedback about the limit is better today than when we last changed
  the limits, and folks can configure a higher limit relatively easily.
  Many already have.

* This issue won't come up nearly as much any more once we switch the
  default merge backend to ort, due to my performance work[19] (many
  renames can be outright skipped without affecting merge quality, and
  many others can be detected in linear time -- the cherry-picks that
  used to require merge.renameLimit=48941 and took 10 minutes can now
  complete in less than a second with the default merge.renameLimit of
  1000.)

* It'd take too long to read all the footnotes in this email, so screw
  it.  :-)


==> Footnotes:

[1] https://lore.kernel.org/git/AANLkTimKp+Z==QXJg2Bagot+Df4REeANuxwVi7bpPCXr@mail.gmail.com/

[2] https://lore.kernel.org/git/alpine.LFD.0.999.0710161030430.6887@woody.linux-foundation.org/
 2007-10-16
[3] https://lore.kernel.org/git/alpine.LFD.2.00.0811032021210.3419@nehalem.linux-foundation.org/
 2008-11-03
[4] https://lore.kernel.org/git/AANLkTimKp+Z==QXJg2Bagot+Df4REeANuxwVi7bpPCXr@mail.gmail.com/
 2011-02-18
[5] https://lore.kernel.org/lkml/alpine.LFD.0.999.0710111944120.6887@woody.linux-foundation.org/
 2007-10-11
[6] https://lore.kernel.org/lkml/alpine.LFD.1.10.0808052157400.15995@nehalem.linux-foundation.org/
 2008-08-05
[7] https://lore.kernel.org/lkml/alpine.LFD.2.01.0909160813400.4950@localhost.localdomain/
  2009-09-16
[8] https://lore.kernel.org/lkml/CA+55aFw1GVvszqoC_f0RAvG5t1xj0CSYLhLU=y0gQ+_54Gsomw@mail.gmail.com/
 2011-10-25
[9] https://lore.kernel.org/lkml/CA+55aFyWefZ1jJLMJKXhy0Qif-iBmjG6n-evcbvkbWS5mDrs0g@mail.gmail.com/
 2015-02-16
[10] https://lore.kernel.org/lkml/CA+55aFxODGv7-AvnqFmxrXBcS2w0XzHuZ7UuRi3EMQz4-oeLJA@mail.gmail.com/
 2018-04-11
[11] https://lore.kernel.org/lkml/CAHk-=wg=CTtNrxPeFzkDw053dY3urchiyxevHnUXHhTGbK=9OQ@mail.gmail.com/
 2020-06-03

[12] 89973554b52c (diffcore-rename: make diff-tree -l0 mean -l<large>,
     2017-11-29)

[13] https://lore.kernel.org/git/20171110173956.25105-3-newren@gmail.com/

[14] 50705915eae8 (bump rename limit defaults, 2008-04-30)
[15] 92c57e5c1d29 (bump rename limit defaults (again), 2011-02-19)

[16] https://lore.kernel.org/git/20080211113516.GB6344@coredump.intra.peff.net/

[17] Calculated and compared as follows (num files, mean size, size Peff used):
  $ git ls-tree -rl v2.6.25 | wc -l
  23810
  $ git ls-tree -rl v2.6.25 | awk '{sum += $4} END{print sum/23810}'
  11150.3
  $ git show v2.6.25:arch/m68k/Kconfig | wc -c
  20977

[18] Calculated as 4198 as follows (note: 11905 = 23810/2):
  $ git ls-tree -rl v2.6.25 | sort -n -k 4 | head -n 11905 | tail -n 1
  100644 blob 29510dc515109ad5dd8a16b5936f1f6086ae417c    4198
Documentation/lguest/lguest.txt

[19] See "Overall Results" from https://github.com/gitgitgadget/git/pull/990

             reply	other threads:[~2021-07-11  0:28 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-11  0:28 Elijah Newren [this message]
2021-07-11 16:42 ` [RFC] Bump {diff,merge}.renameLimit ? Ævar Arnfjörð Bjarmason
2021-07-12 15:23   ` Elijah Newren
2021-07-12 16:48     ` Ævar Arnfjörð Bjarmason
2021-07-12 17:39       ` Jeff King
2021-07-12 17:16 ` Jeff King
2021-07-12 20:23   ` Elijah Newren
2021-07-12 20:58     ` Felipe Contreras
2021-07-12 21:41     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CABPp-BFzp3TCWiF1QAVSfywDLYrz=GOQszVM-sw5p0rSB8RWvw@mail.gmail.com' \
    --to=newren@gmail.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).