git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Elijah Newren via GitGitGadget <gitgitgadget@gmail.com>,
	git@vger.kernel.org
Cc: Jonathan Tan <jonathantanmy@google.com>,
	Derrick Stolee <dstolee@gmail.com>, Taylor Blau <me@ttaylorr.com>,
	Elijah Newren <newren@gmail.com>
Subject: Re: [PATCH v3 0/5] Optimization batch 13: partial clone optimizations for merge-ort
Date: Tue, 22 Jun 2021 12:10:34 -0400	[thread overview]
Message-ID: <3a397e04-88a1-1205-a465-75dc2fd7e93d@gmail.com> (raw)
In-Reply-To: <pull.969.v3.git.1624349082.gitgitgadget@gmail.com>

On 6/22/2021 4:04 AM, Elijah Newren via GitGitGadget wrote:
> This series optimizes blob downloading in merges for partial clones. It can
> apply on master. It's independent of ort-perf-batch-12.

As promised, I completed a performance evaluation of this series as well
as ort-perf-batch-12 (and all earlier batches) against our microsoft/git
fork and running in one of our large monorepos that has over 2 million
files at HEAD. Here are my findings.

In my comparisons, I compare the recursive merge strategy with renames
disabled against the ORT strategy (which always has renames enabled). When
I enabled renames for the recursive merge I saw the partial clone logic
kick in and start downloading many files in every case, so I dropped that
from consideration.


Experiment #1: Large RI/FI merges
---------------------------------

Most of the merge commits in this repo's history are merges of several
long-lived branches as code is merged across organizational boundaries. I
focused on the merge commits in the first-parent history, which mean these
are the merges that brought the latest changes from several areas into the
canonical version of the software.

These merges are all automated merges created by libgit2's implementation
of the recursive merge strategy. Since they are created on the server,
these will not have any merge conflicts.

They are still interesting because the sheer number of files that change
can be large. This is a pain point for the recursive merge because many
index entries need to update with the merge. For ORT, some of the updates
are simple because only one side changed a certain subtree (the
organizational boundaries also correspond to the directory structure in
many cases).

Across these merges I tested, ORT was _always_ faster and was consistent
with the recursive strategy. Even more interesting was the fact that the
recursive strategy had very slow outliers while the ORT strategy was much
more consistent:

     Recursive     ORT
-----------------------
MAX     34.97s    4.74s
P90     30.04s    4.50s
P75     15.35s    3.74s
P50      7.22s    3.39s
P10      3.61s    3.08s

(I'm not testing ORT with the sparse-index yet. A significant portion of
this 3 second lower bound is due to reading and writing the index file
with 2 million entries. I _am_ using sparse-checkout with only the files
at root, which minimizes the time required to update the working directory
with any changed files.)

For these merges, ORT is a clear win.


Experiment #2: User-created merges
----------------------------------

To find merges that might be created by actual user cases, I ran
'git rev-list --grep="^Merge branch"' to get merges that had default
messages from 'git merge' runs. (The merges from Experiment #1 had other
automated names that did not appear in this search.)

Here, the differences are less striking, but still valuable:

     Recursive     ORT
-----------------------
MAX     10.61s   6.27s
P75      8.81s   3.92s
P50      4.32s   3.21s
P10      3.53s   2.95s

The ORT strategy had more variance in these examples, though still not as
much as the recursive strategy. Here the variance is due to conflicting
files needing content merges, which usually were automatically resolved.

This version of the experiment provided interesting observations in a few
cases:

1. One case had the recursive merge strategy result in a root tree that
   disagreed with what the user committed, but the ORT strategy _did_ the
   correct resolution. Likely, this is due to the rename detection and
   resolution. The user probably had to manually resolve the merge to
   match their expected renames since we turn off merge.renames in their
   config.

2. I watched for the partial clone logic to kick in and download blobs.
   Some of these were inevitable: we need the blobs to resolve edit/edit
   conflicts. Most cases none were downloaded at all, so this series is
   working as advertised. There _was_ a case where the inexact rename
   detection requested a large list of files (~2900 in three batches) but
   _then_ said "inexact rename detection was skipped due to too many
   files". This is a case that would be nice to resolve in this series. I
   will try to find exactly where in the code this is being triggered and
   report back.

3. As I mentioned, I was using sparse-checkout to limit the size of the
   working directory. In one case of a conflict that could not be
   automatically resolved, the ORT strategy output this error:

   error: could not open '<X>': No such file or directory

   It seems we are looking for a file on disk without considering if it
   might have the SKIP_WORKTREE bit on in the index. I don't think this is
   an issue for this series, but might require a follow-up on top of the
   other ORT work.


Conclusions
-----------

I continue to be excited about the ORT strategy and will likely be
focusing on it in a month or so to integrate it with the sparse-index. I
think we would be interested in making the ORT strategy a new default for
Scalar, but we might really want it to respect merge.renames=false if only
so we can deploy the settings in stages (first, change the strategy, then
enable renames as an independent step) so we can isolate concerns.


Thanks!
-Stolee

  parent reply	other threads:[~2021-06-22 16:10 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-05  1:27 [PATCH 0/5] Optimization batch 13: partial clone optimizations for merge-ort Elijah Newren via GitGitGadget
2021-06-05  1:28 ` [PATCH 1/5] promisor-remote: output trace2 statistics for number of objects fetched Elijah Newren via GitGitGadget
2021-06-09 21:12   ` Derrick Stolee
2021-06-05  1:28 ` [PATCH 2/5] t6421: add tests checking for excessive object downloads during merge Elijah Newren via GitGitGadget
2021-06-09 21:16   ` Derrick Stolee
2021-06-05  1:28 ` [PATCH 3/5] diffcore-rename: allow different missing_object_cb functions Elijah Newren via GitGitGadget
2021-06-05  1:28 ` [PATCH 4/5] diffcore-rename: use a different prefetch for basename comparisons Elijah Newren via GitGitGadget
2021-06-05  1:28 ` [PATCH 5/5] merge-ort: add prefetching for content merges Elijah Newren via GitGitGadget
2021-06-15 22:41 ` [PATCH v2 0/5] Optimization batch 13: partial clone optimizations for merge-ort Elijah Newren via GitGitGadget
2021-06-15 22:41   ` [PATCH v2 1/5] promisor-remote: output trace2 statistics for number of objects fetched Elijah Newren via GitGitGadget
2021-06-15 22:41   ` [PATCH v2 2/5] t6421: add tests checking for excessive object downloads during merge Elijah Newren via GitGitGadget
2021-06-17  4:49     ` Junio C Hamano
2021-06-15 22:41   ` [PATCH v2 3/5] diffcore-rename: allow different missing_object_cb functions Elijah Newren via GitGitGadget
2021-06-15 22:41   ` [PATCH v2 4/5] diffcore-rename: use a different prefetch for basename comparisons Elijah Newren via GitGitGadget
2021-06-15 22:41   ` [PATCH v2 5/5] merge-ort: add prefetching for content merges Elijah Newren via GitGitGadget
2021-06-17  5:04     ` Junio C Hamano
2021-06-22  8:02       ` Elijah Newren
2021-06-16 17:54   ` [PATCH v2 0/5] Optimization batch 13: partial clone optimizations for merge-ort Derrick Stolee
2021-06-17  5:05   ` Junio C Hamano
2021-06-22  8:04   ` [PATCH v3 " Elijah Newren via GitGitGadget
2021-06-22  8:04     ` [PATCH v3 1/5] promisor-remote: output trace2 statistics for number of objects fetched Elijah Newren via GitGitGadget
2021-06-22  8:04     ` [PATCH v3 2/5] t6421: add tests checking for excessive object downloads during merge Elijah Newren via GitGitGadget
2021-06-22  8:04     ` [PATCH v3 3/5] diffcore-rename: allow different missing_object_cb functions Elijah Newren via GitGitGadget
2021-06-22  8:04     ` [PATCH v3 4/5] diffcore-rename: use a different prefetch for basename comparisons Elijah Newren via GitGitGadget
2021-06-22  8:04     ` [PATCH v3 5/5] merge-ort: add prefetching for content merges Elijah Newren via GitGitGadget
2021-06-22 16:10     ` Derrick Stolee [this message]
2021-06-22 18:45       ` [PATCH v3 0/5] Optimization batch 13: partial clone optimizations for merge-ort Elijah Newren
2021-06-23  2:14         ` Derrick Stolee
2021-06-23  8:11           ` Elijah Newren
2021-06-23 17:31             ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3a397e04-88a1-1205-a465-75dc2fd7e93d@gmail.com \
    --to=stolee@gmail.com \
    --cc=dstolee@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=jonathantanmy@google.com \
    --cc=me@ttaylorr.com \
    --cc=newren@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).