git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: Elijah Newren via GitGitGadget <gitgitgadget@gmail.com>,
	Git Mailing List <git@vger.kernel.org>,
	Bagas Sanjaya <bagasdotme@gmail.com>,
	Eric Sunshine <sunshine@sunshineco.com>,
	Derrick Stolee <stolee@gmail.com>, Jeff King <peff@peff.net>
Subject: Re: [PATCH v2 2/4] doc: clarify documentation for rename/copy limits
Date: Wed, 14 Jul 2021 15:56:14 -0700	[thread overview]
Message-ID: <CABPp-BFiw2PA6qz0uSCPXoSY9HDNc1NuwCxUJBnLU00mACAFqg@mail.gmail.com> (raw)
In-Reply-To: <87mtqozg2q.fsf@evledraar.gmail.com>

On Wed, Jul 14, 2021 at 3:27 PM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Wed, Jul 14 2021, Elijah Newren wrote:
>
> > On Wed, Jul 14, 2021 at 12:47 AM Ævar Arnfjörð Bjarmason
> > <avarab@gmail.com> wrote:
> >>
> >> On Wed, Jul 14 2021, Elijah Newren via GitGitGadget wrote:
> >>
...
> >> ...and this we should really have some wording to the effect of:
> >>
> >>     ..., defaults to XYZ. The exact (or even approximate) default of XYZ
> >>     should not be relied upon, and may be changed (or these limits even
> >>     removed) in future versions of git.
> >>
> >> I.e. let's distinguish a "this is how it works now, FYI" from a forward
> >> promise that it's going to work like that forever.
> >
> > Perhaps I could simplify that to "...currently defaults to XYZ"?  Does
> > that capture your intent well enough?
>
> Yes, and it's not as needlessly verbose :)

Cool, I'll include that change then.

> > I agree very much that this shouldn't be a hard promise.  If it was,
> > we've broken it repeatedly over the years.  Not only have we modified
> > the default numbers multiple times, there have also been multiple
> > times when we've been able to change how many renames could be handled
> > without changing the default numbers for the rename limits.  (This was
> > done by adding or improving special cases that allow us to exclude
> > paths from the exhaustive comparison; exact rename detection that
> > someone else added and my more recent improvements to exact rename
> > detection are two simple examples).
> >
> >> Which also leads me to:
> >>
> >> >  merge.renames::
> >> >       Whether Git detects renames.  If set to "false", rename detection
> >> > diff --git a/Documentation/diff-options.txt b/Documentation/diff-options.txt
> >> > index 32e6dee5ac3..11e08c3fd36 100644
> >> > --- a/Documentation/diff-options.txt
> >> > +++ b/Documentation/diff-options.txt
> >> > @@ -588,11 +588,12 @@ When used together with `-B`, omit also the preimage in the deletion part
> >> >  of a delete/create pair.
> >> >
> >> >  -l<num>::
> >> > -     The `-M` and `-C` options require O(n^2) processing time where n
> >> > -     is the number of potential rename/copy targets.  This
> >> > -     option prevents rename/copy detection from running if
> >> > +     The `-M` and `-C` options have an exhaustive portion that
> >> > +     requires O(n^2) processing time where n is the number of
> >> > +     potential rename/copy targets.  This option prevents the
> >> > +     exhaustive portion of rename/copy detection from running if
> >> >       the number of rename/copy targets exceeds the specified
> >> > -     number.
> >> > +     number.  Defaults to diff.renameLimit.
> >>
> >> This mention of O(n^2). This is not an issue with your patch/series, nd
> >> this is an improvement.
> >>
> >> But as a side-comment: Do we explain somewhere how exactly this
> >> {diff,merge}.renmeLimit=N works for values of N? It's probably fine to
> >> continue to handwave it away, but is there anything that say tells a
> >> user what happens if they have 400 files in their repository in a "x"
> >> directory and "git mv x y"? Will it work, but if they add a 401th file
> >> it won't for diffs?
> >>
> >> I *think* given a skimming of previous discussions that it's "no",
> >> i.e. that this just kicks in for these expensive cases, and e.g. for
> >> such a case of moving the same tree OID from "x" to "y" we'd
> >> short-circuit things, but maybe I'm just making things up at this point.
> >
> > Tree OIDs aren't used by the rename detection logic, but you're close.
> >
> >> Feel free to ignore me here, I guess this amounts to a request that it
> >> would be great if we could point to some docs about how this works. It's
> >> probably too much detail for Documentation/config/{merge,diff}.txt, so
> >> maybe a section in either of git-{diff,merge}.txt ?
> >
> > If git-merge.txt includes it, then rebase, cherry-pick, revert
> > probably should too.  (and maybe the -3 option of git-am)
> >
> > Anyway...
>
> Some of this we do via includes, but I also think we should have more
> "concept docs", like gitglossary, gitrevisions, so perhaps
> gitrenames(5)? Anyway, not for now...
>
> > In the "git mv dir-x dir-y" case, if no other changes are made to
> > those files, then the renames would be detectable via blobs having the
> > same hashes (i.e. exact renames).  So all these renames would be found
> > regardless of the number of files and without exhaustive detection
> > even kicking in.
> >
> > If there were thousands of such files, and every one of them were also
> > modified, then the basename-guided rename detection which operates in
> > linear time would still find all the renames, without the exhaustive
> > detection even kicking in (see commits bd24aa2f97a0 (diffcore-rename:
> > guide inexact rename detection based on basenames, 2021-02-14) and
> > 37a251436447 (diffcore-rename: use directory rename guided basename
> > comparisons, 2021-02-27)).
> >
> > If this were a merge, and there were thousands of such files renamed
> > and modified, AND they were all further renamed to change their
> > basename, BUT the original directory was unmodified on the other side
> > of history, then we wouldn't need any of the renames for the purposes
> > of the merge and they'd all be excluded from the exhaustive rename
> > detection.  Sure, we wouldn't find renames for those files, but that
> > wouldn't change the result of the merge.  So, again, no exhaustive
> > rename detection.  (See commit 174791f0fb23 (merge-ort: use
> > relevant_sources to filter possible rename sources, 2021-03-11))
> >
> > (Rebases & cherry-picks also have an advantage of caching renames from
> > previous picks on the upstream side of history to avoid needing to
> > re-detect renames on that side...though it only caches renames that
> > are either relevant or exact.)
> >
> > Basically, we bail on exhaustive rename detection if, after the linear
> > or faster steps have run (excluding previously cached renames, exact
> > rename detection, skipping irrelevant renames, basename guided rename
> > detection), the number of remaining unpaired source files times the
> > number of remaining unpaired destination files is greater than
> > rename_limit * rename_limit.  (If we can assume the numbers match,
> > i.e. N = number of remaining unpaired sources = number of remaining
> > unpaired destinations, then that check simplifies to bailing once N >
> > rename_limit)
> >
> >
> > I'm not sure if these details are really going to help the users,
> > especially if more special cases are added (and more have been
> > proposed by Johannes and became an Outreachy project, though that one
> > didn't get off the ground).  And this explanation is not even fully
> > detailed; I'm still hand waving here in at least four different ways
> > (mostly in ignoring special cases for different fast steps, but also
> > by ignoring copy detection that itself messes with the explanation in
> > multiple ways).
> >
> > But maybe someone else can figure out a way to hand wave my hand
> > waving down to a simple enough explanation, while also covering copy
> > detection, and managing to do so in a way that doesn't mislead.  If
> > so, we can include that in the docs somewhere.
>
> I think it would be good to briefly explain it in the sense of not
> explaining it, i.e. give the user who ran into this some idea of what
> they're looking at. Perhaps something like (and maybe this is too
> long...);
>
>     [...] these limits are limits to the number of "steps" in an an
>     algorithm that's subject to change, and whose number of steps will
>     depend on your input data.
>
>     They don't correspond to any easily tangible thing, e.g. a limit of
>     100 has no correspondence with detecting all renames in a commit
>     that changed 10 files, but not a commit that changed 11 files.
>
>     The limits were picked to limit runtime to an approximate desired
>     value, given what was thought to be Representative data from a
>     repository similar to linux.git.
>
>     If you find your diff/merge take X amount of time doing rename
>     detection, then generally speaking doubling the limit will about
>     double the time we spend on it.

...*quadruple* the time we spend on it.

>     So they're an indirect a way to get to a runtime limit of (what was
>     it, 2 and 10 seconds?).
>
>     The reason for this configuration not being a time limit is so that
>     git can guarantee identical results for the same input data across
>     different systems, given the same version and configuration.
>     I.e. the common case of running your attempt at a diff/merge twice
>     in a row, a time limit would produce unpredictable results.

Seems pretty lengthy.  Thinking about it for a bit, perhaps we could
just say something like:


After preliminary steps that can detect subsets of renames quickly, an
exhaustive fallback algorithm is used that compares all remaining
unpaired sources with all remaining unpaired destinations looking for
similar files that can be marked as renames.  (Or, in the case of copy
detection, it compares all original sources with all remaining
unpaired destinations).  With N sources and N destinations, this
algorithm is O(N^2).  The rename_limit says to skip this exhaustive
step if N is greater than the rename_limit.  (More precisely, it says
to skip the exhaustive step if number_sources * number_destinations >
rename_limit * rename_limit, which handles the case where there are a
different number of sources and destinations.)

  reply	other threads:[~2021-07-14 22:56 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-11  0:46 [PATCH 0/3] Improve the documentation and warnings dealing with rename/copy limits Elijah Newren via GitGitGadget
2021-07-11  0:46 ` [PATCH 1/3] doc: clarify documentation for " Elijah Newren via GitGitGadget
2021-07-11  4:37   ` Bagas Sanjaya
2021-07-11  4:52     ` Elijah Newren
2021-07-12 15:03   ` Derrick Stolee
2021-07-12 21:27     ` Junio C Hamano
2021-07-11  0:46 ` [PATCH 2/3] doc: document the special handling of -l0 Elijah Newren via GitGitGadget
2021-07-11  4:54   ` Eric Sunshine
2021-07-11  4:54     ` Elijah Newren
2021-07-11  0:46 ` [PATCH 3/3] diff: correct warning message when renameLimit exceeded Elijah Newren via GitGitGadget
2021-07-12 15:09   ` Derrick Stolee
2021-07-12 18:13     ` Elijah Newren
2021-07-14  0:47       ` Junio C Hamano
2021-07-14  1:06         ` Elijah Newren
2021-07-14  1:10           ` Junio C Hamano
2021-07-14  1:22             ` Elijah Newren
2021-07-14  5:17               ` Junio C Hamano
2021-07-14 15:09                 ` Elijah Newren
2021-07-14  1:12 ` [PATCH v2 0/4] Rename/copy limits -- docs, warnings, and new defaults Elijah Newren via GitGitGadget
2021-07-14  1:12   ` [PATCH v2 1/4] diff: correct warning message when renameLimit exceeded Elijah Newren via GitGitGadget
2021-07-14  1:12   ` [PATCH v2 2/4] doc: clarify documentation for rename/copy limits Elijah Newren via GitGitGadget
2021-07-14  7:37     ` Ævar Arnfjörð Bjarmason
2021-07-14 16:30       ` Elijah Newren
2021-07-14 22:08         ` Ævar Arnfjörð Bjarmason
2021-07-14 22:56           ` Elijah Newren [this message]
2021-07-14  1:12   ` [PATCH v2 3/4] doc: document the special handling of -l0 Elijah Newren via GitGitGadget
2021-07-14 16:45     ` Jeff King
2021-07-14 17:17       ` Elijah Newren
2021-07-14 17:33         ` Jeff King
2021-07-14 19:32           ` Elijah Newren
2021-07-14  1:12   ` [PATCH v2 4/4] Bump rename limit defaults (yet again) Elijah Newren via GitGitGadget
2021-07-14 16:43     ` Jeff King
2021-07-14 17:32       ` Elijah Newren
2021-07-14 17:57         ` Jeff King
2021-07-14 20:03           ` Elijah Newren
2021-07-14 20:47             ` Jeff King
2021-07-15  0:45   ` [PATCH v3 0/4] Rename/copy limits -- docs, warnings, and new defaults Elijah Newren via GitGitGadget
2021-07-15  0:45     ` [PATCH v3 1/4] diff: correct warning message when renameLimit exceeded Elijah Newren via GitGitGadget
2021-07-15  0:45     ` [PATCH v3 2/4] doc: clarify documentation for rename/copy limits Elijah Newren via GitGitGadget
2021-07-15  0:45     ` [PATCH v3 3/4] diffcore-rename: treat a rename_limit of 0 as unlimited Elijah Newren via GitGitGadget
2021-07-15 23:17       ` Junio C Hamano
2021-07-15  0:45     ` [PATCH v3 4/4] Bump rename limit defaults (yet again) Elijah Newren via GitGitGadget
2021-07-15 13:36     ` [PATCH v3 0/4] Rename/copy limits -- docs, warnings, and new defaults Derrick Stolee
2021-07-15 23:20     ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABPp-BFiw2PA6qz0uSCPXoSY9HDNc1NuwCxUJBnLU00mACAFqg@mail.gmail.com \
    --to=newren@gmail.com \
    --cc=avarab@gmail.com \
    --cc=bagasdotme@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    --cc=sunshine@sunshineco.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).