git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
From: Elijah Newren <newren@gmail.com>
To: Junio C Hamano <gitster@pobox.com>
Cc: Elijah Newren via GitGitGadget <gitgitgadget@gmail.com>,
	Git Mailing List <git@vger.kernel.org>,
	Derrick Stolee <dstolee@microsoft.com>,
	Jonathan Tan <jonathantanmy@google.com>,
	Taylor Blau <me@ttaylorr.com>
Subject: Re: [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames
Date: Wed, 24 Mar 2021 16:25:48 -0700	[thread overview]
Message-ID: <CABPp-BGMhyn1ricXzx539n-09+BYRHPeruNd4MG2PyQzWaRKow@mail.gmail.com> (raw)
In-Reply-To: <xmqqv99gw6n3.fsf@gitster.g>

On Wed, Mar 24, 2021 at 3:04 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Elijah Newren via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > === Basic Optimization idea ===
> >
> > This series avoids repeatedly detecting the same renames in a sequence of
> > merges such as a rebase or cherry-pick of several commits. When there are
> > many renames between the old base and the new base, traditionally all those
> > renames are re-detected for every commit that is transplanted. This
> > optimization avoids redoing that work.
>
> Unless this section is easily understandable, the readers have no
> incentive to read on, but the above is a bit too hand wavy.

Oh, yeah, it's very hand wavy.  I figured the commit messages were the
right place to include the details, and just wanted to give a flavor
of the idea in the cover letter.

> > This one adds a fourth (remember-renames), with some interesting properties:
> >
> >  * unlike basename-guided rename detection, there are no behavioral changes
> >    (there is no heuristic involved)[2].
> >
> >  * like skip-because-irrelevant, this optimization does not apply to all git
> >    commands using the rename machinery. In fact, this one is even more
> >    restrictive since it is ONLY useful for rebases and cherry-picks (not
> >    even merges), and only for second and later commits in a linear series.
>
> So, is it correct to understand that one case this would help is
> this scenario?
>
>  ---o---o---o---X---o---o---o---O ours
>      \
>       A---B---C topic
>
> where there is a side branch A--B--C that touched some files, while
> on our side, there is a commit X that is unknown to the side branch
> that renamed these files.  Now we want to transplant the side topic
> to the tip of our history, replaying the changes A--B--C made to
> these files under their original name to the corresponding files
> that have been renamed.
>
> And each step in this "rebase" is a 3-way merge of commits A, B and
> C onto HEAD, using the parent of the commit being cherrk-picked as a
> virtual common ancestor.  Which means

You generated nearly the same description and diagram I used in the
commit message (the one in 3/7) describing this.  :-)

>  - To transplant A (i.e. the first step), we'd compare the diff of
>    A^..O (i.e. what our side did, including the renames done at X)
>    and diff of A^..A (i.e. what the first commit did in the range),
>    and the former does quite a lot of rename detection.
>
>  - After transplanting B (i.e. the second step), then we'd compare
>    the diff of A^..A' (where A' is A cherry-picked on O, i.e. the

Close, but for transplanting B we do the diff of B^..A', not A^...A'.
(And in this diagram, B^ is A.)  That's critical below...

>    result of the previous step).  If we are lucky, O..A' did not
>    rename anything so the renames done in A^..O (i.e. what we
>    detected during the first step) and A^..A' (i.e. what we should
>    be computing for this second step) should be quite similar.

Again, B^..A' rather than A^..A'.

Luck is not involved here.  If O..A' did rename anything, it's one of
two reasons:

- There were conflicts when trying to transplant A, and when we stop
for conflict resolution, the user added some renames at that point.
- There were renames in A^..A.

In the first case, the presence of conflicts means we drop the cache
and this optimization doesn't try to kick in.  In the second case,
those renames in A' came from A.  Even without this optimization,
since those renames in A' came from A, doing rename detection on A..A'
wouldn't re-detect them and transplanting wouldn't try to reapply
them, so they just aren't relevant anymore -- with or without this
optimization.

>    If we assume that the "quite similar" is good enough, then we can
>    blindly reuse the record of "<path in A^> correspnds to <path in
>    O>" as if it were "<path in A^> corresponds to <path in A'>".

Again, B^ rather than A^ on the last line.

I disagree with the use of the term "blindly" here.  As spelled out in
the third commit message, the transplant of A involved a three-way
content merge of the form:
    A^:oldfile
    O:newfile
    A:oldfile
and produce a new result:
    A':newfile

The point of rename detection is to determine what files are similar
enough to use in a three-way content merge.  In particular, we'd use
rename detection when transplanting B to notice the oldfile -> newfile
rename so that we can do a three-way content merge of the form:
    A:oldfile
    A':newfile
    B:oldfile
and produce a new result:
    B':newfile

But, instead of asking rename detection whether A:oldfile and
A':newfile are similar enough to use together in a three-way content
merge, we could ask ourselves -- do we have any _other_ reason to
believe these files are similar enough to be used in a three-way
content merge?  And the answer that comes back is: these files were
*already* involved in the same three-way content merge -- the one that
A':newfile came from.  It was a three-way content merge with no
conflicts.  (Because when conflicts are triggered we turn this
optimization off.)

>  - Do the same for C, pretending that renames discovered between A^
>    and O is identical to the renames between A^ and B' (i.e. the
>    result of cherry-picking A--B on top of O).

Now you've changed your off-by-one mistake to an off-by-two mistake;
the rename detection is between C^ and B', not A^ and B'.  I think
this error might be critical to why you used terms like "pretend" and
"blindly" and "lucky".  I agree that it would require
luck/blindness/pretending to assume that the renames between A^ and O
are identical to those between A^ and B', but that's not what the
original algorithm would have been using for computing renames; it
would be using C^ and B'.

It's actually quite difficult to generate a case where this
optimization gets a possibly different result.  It requires there were
changes to the content on both sides of history that merge cleanly,
and in particular that need a significant size reduction of the file
by the unrenamed side of history.  If you take the changes on the
*renamed* side of history, which represent <50% changes since it was
detected as a rename, those same changes need to represent a >50%
change when applied to the smaller file.  This is discussed in the
third commit message, as noted in the cover letter:

>> [2] Well, almost no changes. There's technically a very narrow way that this
>> could change the behavior; see the really long "Technically," bullet point
>> in patch 3 for discussion of this.

  reply	other threads:[~2021-03-24 23:26 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-24 21:32 Elijah Newren via GitGitGadget
2021-03-24 21:32 ` [PATCH 1/7] merge-ort: add data structures for in-memory caching of rename detection Elijah Newren via GitGitGadget
2021-03-24 21:32 ` [PATCH 2/7] merge-ort: populate caches of rename detection results Elijah Newren via GitGitGadget
2021-03-24 21:32 ` [PATCH 3/7] merge-ort: add code to check for whether cached renames can be reused Elijah Newren via GitGitGadget
2021-03-24 21:32 ` [PATCH 4/7] merge-ort: avoid accidental API mis-use Elijah Newren via GitGitGadget
2021-03-24 21:32 ` [PATCH 5/7] merge-ort: preserve cached renames for the appropriate side Elijah Newren via GitGitGadget
2021-03-24 21:32 ` [PATCH 6/7] merge-ort: add helper functions for using cached renames Elijah Newren via GitGitGadget
2021-03-24 21:32 ` [PATCH 7/7] merge-ort, diffcore-rename: employ cached renames when possible Elijah Newren via GitGitGadget
2021-03-24 22:04 ` [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames Junio C Hamano
2021-03-24 23:25   ` Elijah Newren [this message]
2021-03-25 18:59     ` Junio C Hamano
2021-03-29 22:34       ` Elijah Newren
2021-03-30 12:07         ` Derrick Stolee
2021-05-04  2:12 ` [PATCH v2 00/13] " Elijah Newren via GitGitGadget
2021-05-04  2:12   ` [PATCH v2 01/13] t6423: rename file within directory that other side renamed Elijah Newren via GitGitGadget
2021-05-04  2:12   ` [PATCH v2 02/13] Documentation/technical: describe remembering renames optimization Elijah Newren via GitGitGadget
2021-05-04  2:12   ` [PATCH v2 03/13] fast-rebase: change assert() to BUG() Elijah Newren via GitGitGadget
2021-05-04  2:12   ` [PATCH v2 04/13] fast-rebase: write conflict state to working tree, index, and HEAD Elijah Newren via GitGitGadget
2021-05-17 13:32     ` Derrick Stolee
2021-05-18  3:42       ` Elijah Newren
2021-05-18 13:54         ` Derrick Stolee
2021-05-04  2:12   ` [PATCH v2 05/13] t6429: testcases for remembering renames Elijah Newren via GitGitGadget
2021-05-04  2:12   ` [PATCH v2 06/13] merge-ort: add data structures for in-memory caching of rename detection Elijah Newren via GitGitGadget
2021-05-17 13:41     ` Derrick Stolee
2021-05-18  3:55       ` Elijah Newren
2021-05-18 13:57         ` Derrick Stolee
2021-05-04  2:12   ` [PATCH v2 07/13] merge-ort: populate caches of rename detection results Elijah Newren via GitGitGadget
2021-05-17 13:51     ` Derrick Stolee
2021-05-20  0:48       ` Elijah Newren
2021-05-04  2:12   ` [PATCH v2 08/13] merge-ort: add code to check for whether cached renames can be reused Elijah Newren via GitGitGadget
2021-05-17 14:01     ` Derrick Stolee
2021-05-04  2:12   ` [PATCH v2 09/13] merge-ort: avoid accidental API mis-use Elijah Newren via GitGitGadget
2021-05-17 14:10     ` Derrick Stolee
2021-05-04  2:12   ` [PATCH v2 10/13] merge-ort: preserve cached renames for the appropriate side Elijah Newren via GitGitGadget
2021-05-04  2:12   ` [PATCH v2 11/13] merge-ort: add helper functions for using cached renames Elijah Newren via GitGitGadget
2021-05-04  2:12   ` [PATCH v2 12/13] merge-ort: handle interactions of caching and rename/rename(1to1) cases Elijah Newren via GitGitGadget
2021-05-17 14:16     ` Derrick Stolee
2021-05-04  2:12   ` [PATCH v2 13/13] merge-ort, diffcore-rename: employ cached renames when possible Elijah Newren via GitGitGadget
2021-05-17 14:23     ` Derrick Stolee
2021-05-20  0:36       ` Elijah Newren
2021-05-22 11:17         ` Derrick Stolee
2021-05-14 17:37   ` [PATCH v2 00/13] Optimization batch 11: avoid repeatedly detecting same renames Elijah Newren
2021-05-14 21:04     ` Derrick Stolee
2021-05-20  6:09   ` [PATCH v3 " Elijah Newren via GitGitGadget
2021-05-20  6:09     ` [PATCH v3 01/13] t6423: rename file within directory that other side renamed Elijah Newren via GitGitGadget
2021-05-20  6:09     ` [PATCH v3 02/13] Documentation/technical: describe remembering renames optimization Elijah Newren via GitGitGadget
2021-05-20 11:32       ` Bagas Sanjaya
2021-05-20 15:14         ` Kerry, Richard
2021-05-20 16:34         ` Elijah Newren
2021-05-20  6:09     ` [PATCH v3 03/13] fast-rebase: change assert() to BUG() Elijah Newren via GitGitGadget
2021-05-20  6:09     ` [PATCH v3 04/13] fast-rebase: write conflict state to working tree, index, and HEAD Elijah Newren via GitGitGadget
2021-05-20  6:09     ` [PATCH v3 05/13] t6429: testcases for remembering renames Elijah Newren via GitGitGadget
2021-05-20  6:09     ` [PATCH v3 06/13] merge-ort: add data structures for in-memory caching of rename detection Elijah Newren via GitGitGadget
2021-05-20  6:09     ` [PATCH v3 07/13] merge-ort: populate caches of rename detection results Elijah Newren via GitGitGadget
2021-05-20  6:09     ` [PATCH v3 08/13] merge-ort: add code to check for whether cached renames can be reused Elijah Newren via GitGitGadget
2021-05-20  6:09     ` [PATCH v3 09/13] merge-ort: avoid accidental API mis-use Elijah Newren via GitGitGadget
2021-05-20  6:09     ` [PATCH v3 10/13] merge-ort: preserve cached renames for the appropriate side Elijah Newren via GitGitGadget
2021-05-20  6:09     ` [PATCH v3 11/13] merge-ort: add helper functions for using cached renames Elijah Newren via GitGitGadget
2021-05-20  6:09     ` [PATCH v3 12/13] merge-ort: handle interactions of caching and rename/rename(1to1) cases Elijah Newren via GitGitGadget
2021-05-20  6:09     ` [PATCH v3 13/13] merge-ort, diffcore-rename: employ cached renames when possible Elijah Newren via GitGitGadget
2021-05-22 11:17     ` [PATCH v3 00/13] Optimization batch 11: avoid repeatedly detecting same renames Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CABPp-BGMhyn1ricXzx539n-09+BYRHPeruNd4MG2PyQzWaRKow@mail.gmail.com \
    --to=newren@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=me@ttaylorr.com \
    --subject='Re: [PATCH 0/7] Optimization batch 11: avoid repeatedly detecting same renames' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

git@vger.kernel.org list mirror (unofficial, one of many)

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V1 git git/ https://public-inbox.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors.
Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://7fh6tueqddpjyxjmgtdiueylzoqt6pt7hec3pukyptlmohoowvhde4yd.onion/inbox.comp.version-control.git
	nntp://ie5yzdi7fg72h7s4sdcztq5evakq23rdt33mfyfcddc5u3ndnw24ogqd.onion/inbox.comp.version-control.git
	nntp://4uok3hntl7oi7b4uf4rtfwefqeexfzil2w6kgk2jn5z2f764irre7byd.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git
 note: .onion URLs require Tor: https://www.torproject.org/

code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git