mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <>
To: Elijah Newren <>
Cc: Git Mailing List <>,
Subject: Re: [PATCH v2 00/20] fundamentals of merge-ort implementation
Date: Mon, 9 Nov 2020 14:51:36 -0500	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

On 11/9/20 12:13 PM, Elijah Newren wrote:> Actually, this was pretty enlightening.  I think I know about what's
> happening...
> First, a few years ago, Ben said that merges in the Microsoft repos
> took about an hour[1]:
> "For the repro that I have been using this drops the merge time from ~1 hour to
> ~5 minutes and the unmerged entries goes down from ~40,000 to 1."
> The change he made to drop it that far was to turn off rename detection.
> [1]
> Keep that in mind, especially since your times are actually
> significantly less than 5 minutes...

Yes, the other thing to keep in mind is that this is
a Scalar repo with the default cone-mode sparse-checkout
of only the files at root. For this repo, that means that
there are only ~10 files actually present.

I wanted to remove any working directory updates/checks
from the performance check as possible.

>> $ /_git/git/summarize-perf git rebase --onto to from test
>> Successfully rebased and updated refs/heads/test.
>> Accumulated times:
>>     8.511 : <unmeasured> (74.9%)
> Wild guess: This is setup_git_directory() loading your ~3 million entry index.

I think there is also some commit walking happening, but
it shouldn't be too much. 'from' and 'to' are not very
far away.

> Did you include two runs of recursive and two runs of ort just to show
> that the timings were stable and thus there wasn't warm or cold disk
> cache issues affecting things?  If so, good plan.  (If there was
> another reason, let me know; I missed it.)

For the rebase, I did "--onto to from test" and "--onto from to test"
to show both directions of the rebase. The merge I did twice for the
cache issues ;)

> .004s on label:incore_nonrecursive -- that's the actual merge
> operation.  This was a trivial rebase, and the merging took just 4
> milliseconds.  But the overall run took 11.442 seconds because working
> with 3M+ entries in the index just takes forever, and my code didn't
> touch any on-disk formats, certainly not the index format.
> _All_ of my optimization work was on the merging piece, not the stuff
> outside.  But for what you're testing here, it appears to be
> irrelevant compared to the overhead.

OK, so since we already disable rename detection through config,
the machinery that you are changing is already fast with the old
algorithm in these trivial cases.

To actually show any benefits, we would need to disable rename
detection or use a larger change.
>> And here are timings for a simple merge. Two files at root were changed in the
>> commits I made, but there are also some larger changes from the commit history.
>> These should all be seen as "this tree updated in one of the two, so take that
>> tree".
> Ahah!  That's a microsoft-specific optimization you guys made in the
> recursive strategy, yes? 

I'm not aware of any logic we have that's different from core Git.
The config we use [1] includes "merge.stat = false" and "merge.renames
= false" but otherwise seems to be using stock Git.


I'm CC'ing Jeff Hostetler to see if he knows anything about a custom
merge algorithm in microsoft/git.

> It does NOT exist in upstream git.  It's
> also one that is nearly incompatible with rename detection; it turns
> out you can only do that optimization in the face of rename detection
> if you do a HUGE amount of specialized work and tracking in order to
> determine when it's safe _despite_ needing to detect renames. 

Perhaps merge.renames=false is enough to trigger this logic already?

> I
> thought that optimization was totally incompatible with rename
> detection for a long time; I tried it a couple times while working on
> ort and watched it break all kinds of rename tests...but I eventually
> discovered some tricks involving a lot of work to be able to run that
> optimization.

I will try to keep this in mind.

> So, you aren't comparing upstream "recursive" to "ort", you're
> comparing a tweaked version of recursive, and one that is incompatible
> with how recursive's rename detection work.  In fact, just to be clear
> in case you go looking, I suspect that this tweak is to be found
> within unpack_trees.c (which recursive relies on heavily).
> Further, you've set it up so there are only a few files changed after
> unpack_trees returns.
> In total, you have: (1) turned off rename detection (most my
> optimizations are for improving this factor, meaning I can't show an
> advantage), (2) you took advantage of no rename detection to implement
> trivial-tree merges (thus killing the main second advantage my
> algorithm has), and (3) you are looking at a case with a tiny number
> of changes for the merge algorithm to process (thus killing a third
> optimization that removes quadratic performance).  Those are my three
> big optimizations, and you've made them all irrelevant.  In fact,
> you're in an area I would have been worried that ort would do _worse_
> than recursive.  I track an awful lot of things and there is overhead
> in checking and filling all that information in; if there are only a
> few entries to merge, then all that information was a waste to collect
> and ort might be slower than recursive.  But then again, that should
> be a case where both algorithms are "nearly instantaneous" (or would
> be if it weren't for your 3M+ index entry repo causing run_builtin()'s
> call to setup_git_directory() in git.c to take a huge amount of time
> before the builtin is even called.)

Thanks for your time isolating this case. I appreciate knowing exactly
which portions of the merge algorithm are being touched and which are
> 5 seconds.  I do have to hand it to Ben and anyone else involved,
> though.  From 1 hour down to 5 seconds is pretty good, even if it was
> done by hacks (turning off rename detection, and then implementing
> trivial-tree merging that would have broken rename detection).  I
> suspect that whoever did that work might have found the unconditional
> discarding and re-reading of the index and fixed it as well?

As you can probably tell from my general confusion, I had nothing
to do with it. ;)

> Heh, yeah 0.002 seconds for ..label:incore_recursive.  Only 2
> milliseconds to create the actual merge tree.  That does suggest you
> might have fun with 'git log -p --remerge-diff'; if you can redo
> merges in 2 milliseconds, showing them in git log output is very
> reasonable.  :-)

Yeah, 'git merge-tree' is very fast for these cases, so I assumed
that something else was going on for that command.

> Could we have some fun, though?  What if you have some merge or rebase
> involving lots of changes, and you turn rename detection back on, and
> you disable that trivial-tree resolution optimization that breaks
> recursive's rename detection handling...and then compare recursive and
> ort?  (It might be easiest to just compare upstream recursive rather
> than the one with all the microsoft changes to make sure you undid
> whatever trivial tree handling work exists.)

I can try these kinds of cases, but it won't be today. I'm on kid duty
today, and answering emails in between running around with them.

> For example, my testcase in the linux kernel was finding a series of a
> few dozen patches I could rebase back to an older version, but
> tweaking the "older" version by renaming drivers/ -> pilots/ (with
> about 26K files under that directory, that meant about 26K renames).
> So, I got to see rebasing of dozens of real changes across a massive
> rename boundary -- and the massive rename boundary also guaranteed
> there were lots of entries for the merge algorithm to deal with.
> In the end, though, 4 milliseconds for the rebase and 2 milliseconds
> for the merge, with the rest all being overhead of interfacing to the
> index and working tree actually seems pretty good to me.  I'm just
> curious if we can check how things work for more involved cases.

I'm definitely interested in identifying how your algorithm improves
over the previous cases, and perhaps re-enabling rename detection for
merges is enough of a benefit to justify the new one.

Eventually, I hope to actually engage with your patches in the form
of review. Just trying to build a mental model for what's going on


  reply	other threads:[~2020-11-09 19:51 UTC|newest]

Thread overview: 84+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-02 20:43 [PATCH v2 00/20] fundamentals of merge-ort implementation Elijah Newren
2020-11-02 20:43 ` [PATCH v2 01/20] merge-ort: setup basic internal data structures Elijah Newren
2020-11-06 22:05   ` Jonathan Tan
2020-11-06 22:45     ` Elijah Newren
2020-11-09 20:55       ` Jonathan Tan
2020-11-02 20:43 ` [PATCH v2 02/20] merge-ort: add some high-level algorithm structure Elijah Newren
2020-11-02 20:43 ` [PATCH v2 03/20] merge-ort: port merge_start() from merge-recursive Elijah Newren
2020-11-11 13:52   ` Derrick Stolee
2020-11-11 16:22     ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 04/20] merge-ort: use histogram diff Elijah Newren
2020-11-11 13:54   ` Derrick Stolee
2020-11-11 16:47     ` Elijah Newren
2020-11-11 16:51       ` Derrick Stolee
2020-11-11 17:03         ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 05/20] merge-ort: add an err() function similar to one from merge-recursive Elijah Newren
2020-11-11 13:58   ` Derrick Stolee
2020-11-11 17:07     ` Elijah Newren
2020-11-11 17:10       ` Derrick Stolee
2020-11-02 20:43 ` [PATCH v2 06/20] merge-ort: implement a very basic collect_merge_info() Elijah Newren
2020-11-06 22:19   ` Jonathan Tan
2020-11-06 23:10     ` Elijah Newren
2020-11-09 20:59       ` Jonathan Tan
2020-11-11 14:38   ` Derrick Stolee
2020-11-11 17:02     ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 07/20] merge-ort: avoid repeating fill_tree_descriptor() on the same tree Elijah Newren
2020-11-11 14:51   ` Derrick Stolee
2020-11-11 17:13     ` Elijah Newren
2020-11-11 17:21       ` Eric Sunshine
2020-11-02 20:43 ` [PATCH v2 08/20] merge-ort: compute a few more useful fields for collect_merge_info Elijah Newren
2020-11-06 22:52   ` Jonathan Tan
2020-11-06 23:41     ` Elijah Newren
2020-11-09 22:04       ` Jonathan Tan
2020-11-09 23:05         ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 09/20] merge-ort: record stage and auxiliary info for every path Elijah Newren
2020-11-06 22:58   ` Jonathan Tan
2020-11-07  0:26     ` Elijah Newren
2020-11-09 22:09       ` Jonathan Tan
2020-11-09 23:08         ` Elijah Newren
2020-11-11 15:26   ` Derrick Stolee
2020-11-11 18:16     ` Elijah Newren
2020-11-11 22:06       ` Elijah Newren
2020-11-12 18:23         ` Derrick Stolee
2020-11-12 18:39       ` Derrick Stolee
2020-11-02 20:43 ` [PATCH v2 10/20] merge-ort: avoid recursing into identical trees Elijah Newren
2020-11-11 15:31   ` Derrick Stolee
2020-11-02 20:43 ` [PATCH v2 11/20] merge-ort: add a preliminary simple process_entries() implementation Elijah Newren
2020-11-11 19:51   ` Jonathan Tan
2020-11-12  1:48     ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 12/20] merge-ort: have process_entries operate in a defined order Elijah Newren
2020-11-11 16:09   ` Derrick Stolee
2020-11-11 18:58     ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 13/20] merge-ort: step 1 of tree writing -- record basenames, modes, and oids Elijah Newren
2020-11-11 20:01   ` Jonathan Tan
2020-11-11 20:24     ` Elijah Newren
2020-11-12 20:39       ` Jonathan Tan
2020-11-02 20:43 ` [PATCH v2 14/20] merge-ort: step 2 of tree writing -- function to create tree object Elijah Newren
2020-11-11 20:47   ` Jonathan Tan
2020-11-11 21:21     ` Elijah Newren
2020-11-02 20:43 ` [PATCH v2 15/20] merge-ort: step 3 of tree writing -- handling subdirectories as we go Elijah Newren
2020-11-12 20:15   ` Jonathan Tan
2020-11-12 22:30     ` Elijah Newren
2020-11-24 20:19       ` Elijah Newren
2020-11-25  2:07         ` Jonathan Tan
2020-11-26 18:13           ` Elijah Newren
2020-11-30 18:41             ` Jonathan Tan
2020-11-02 20:43 ` [PATCH v2 16/20] merge-ort: basic outline for merge_switch_to_result() Elijah Newren
2020-11-02 20:43 ` [PATCH v2 17/20] merge-ort: add implementation of checkout() Elijah Newren
2020-11-02 20:43 ` [PATCH v2 18/20] tree: enable cmp_cache_name_compare() to be used elsewhere Elijah Newren
2020-11-02 20:43 ` [PATCH v2 19/20] merge-ort: add implementation of record_unmerged_index_entries() Elijah Newren
2020-11-02 20:43 ` [PATCH v2 20/20] merge-ort: free data structures in merge_finalize() Elijah Newren
2020-11-03 14:49 ` [PATCH v2 00/20] fundamentals of merge-ort implementation Derrick Stolee
2020-11-03 16:36   ` Elijah Newren
2020-11-07  6:06     ` Elijah Newren
2020-11-07 15:02       ` Derrick Stolee
2020-11-07 19:39         ` Elijah Newren
2020-11-09 12:30           ` Derrick Stolee
2020-11-09 17:13             ` Elijah Newren
2020-11-09 19:51               ` Derrick Stolee [this message]
2020-11-09 22:44                 ` Elijah Newren
2020-11-11 17:08 ` Derrick Stolee
2020-11-11 18:35   ` Elijah Newren
2020-11-11 20:48     ` Derrick Stolee
2020-11-11 21:18       ` Elijah Newren
2020-11-29  7:43 [PATCH " Elijah Newren via GitGitGadget
2020-12-04 20:47 ` [PATCH v2 " Elijah Newren via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

  List information:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).