git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Johannes Schindelin <Johannes.Schindelin@gmx.de>
To: Alban Gruin <alban.gruin@gmail.com>
Cc: Elijah Newren <newren@gmail.com>, git@vger.kernel.org
Subject: Re: Comparing rebase --am with --interactive via p3400
Date: Fri, 31 Jan 2020 22:23:33 +0100 (CET)	[thread overview]
Message-ID: <nycvar.QRO.7.76.6.2001312216410.46@tvgsbejvaqbjf.bet> (raw)
In-Reply-To: <a00e1689-ec7c-4039-a2e9-f72d452ae4ff@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 8064 bytes --]

Hi Alban,

On Fri, 27 Dec 2019, Alban Gruin wrote:

> Le 01/02/2019 à 07:04, Johannes Schindelin a écrit :
>
> > as discussed at the Contributors' Summit, I ran p3400 as-is (i.e. with the
> > --am backend) and then with --keep-empty to force the interactive backend
> > to be used. Here are the best of 10, on my relatively powerful Windows 10
> > laptop, with current `master`.
> >
> > With regular rebase --am:
> >
> > 3400.2: rebase on top of a lot of unrelated changes             5.32(0.06+0.15)
> > 3400.4: rebase a lot of unrelated changes without split-index   33.08(0.04+0.18)
> > 3400.6: rebase a lot of unrelated changes with split-index      30.29(0.03+0.18)
> >
> > with --keep-empty to force the interactive backend:
> >
> > 3400.2: rebase on top of a lot of unrelated changes             3.92(0.03+0.18)
> > 3400.4: rebase a lot of unrelated changes without split-index   33.92(0.03+0.22)
> > 3400.6: rebase a lot of unrelated changes with split-index      38.82(0.03+0.16)
> >
> > I then changed it to -m to test the current scripted version, trying to
> > let it run overnight, but my laptop eventually went to sleep and the tests
> > were not even done. I'll let them continue and report back.
> >
> > My conclusion after seeing these numbers is: the interactive rebase is
> > really close to the performance of the --am backend. So to me, it makes a
> > total lot of sense to switch --merge over to it, and to make --merge the
> > default. We still should investigate why the split-index performance is so
> > significantly worse, though.
> >
> > Ciao,
> > Dscho
> >
>
> I investigated a bit on this.  From a quick glance at a callgrind trace,
> I can see that ce_write_entry() is called 20 601[1] times with `git am',
> but 739 802 times with the sequencer when the split-index is enabled.
>
> For reference, here are the timings, measured on my Linux machine, on a
> tmpfs, with git.git as the repo:
>
> `rebase --am':
> > 3400.2: rebase on top of a lot of unrelated changes             0.29(0.24+0.03)
> > 3400.4: rebase a lot of unrelated changes without split-index   6.77(6.51+0.22)
> > 3400.6: rebase a lot of unrelated changes with split-index      4.43(4.29+0.13)
> `rebase --quiet':
> > 3400.2: rebase on top of a lot of unrelated changes             0.24(0.21+0.02)
> > 3400.4: rebase a lot of unrelated changes without split-index   5.60(5.32+0.27)
> > 3400.6: rebase a lot of unrelated changes with split-index      5.67(5.40+0.26)
>
> This comes from two things:
>
> 1. There is not enough shared entries in the index with the sequencer.
>
> do_write_index() is called only by do_write_locked_index() with `--am',
> but is also called by write_shared_index() with the sequencer once for
> every other commit.  As the latter is only called by
> write_locked_index(), which means that too_many_not_shared_entries()
> returns true for the sequencer, but never for `--am'.
>
> Removing the call to discard_index() in do_pick_commit() (as in the
> first attached patch) solve this particular issue, but this would
> require a more thorough analysis to see if it is actually safe to do.

Indeed. I offered these insights in #git-devel (slightly edited):

This `discard_index()` is in an awfully central location. I am rather
certain that it would cause problems to just remove it.

Looking at `do_merge()`: it explicitly discards and re-reads the index if
we had to spawn a `git merge` process (which we do if a strategy other
than `recursive` was specified, or if it is an octopus merge). But I am
wary of other code paths that might not be as careful.

I see that `do_exec()` is similarly careful.

One thing I cannot fail to notice: we do not re-read a changed index
after running the `prepare-commit-msg` hook, or for that matter, any other
hook. That could even be an old regression from the conversion of the
interactive rebase to using the sequencer rather than a shell script.

Further, `reset_merge()` seems to spawn `git reset --merge` without
bothering to re-read the possibly modified index. Its callees are
`rollback_single_pick()`, `skip_single_pick()` and `sequencer_rollback()`,
none of which seem to be careful, either, about checking whether the index
was modified in the meantime.

Technically, the in-memory index should also be discarded
in `apply_autostash()`, but so far, we do not have any callers of that
function, I don't think, that wants to do anything but release resources
and exit.

The `run_git_checkout()` function discards, as intended. I
am not quite sure whether it needs to, though, unless the `.git/index`
file _was_ modified (it _is_ possible, after all, to run `git rebase -i
HEAD`, and I do have a use case for that where one of my scripts generates
a todo script, sort of a `git cherry-pick --rebase-merges`, because
`cherry-pick` does not support that mode).

The `continue_single_pick()` function spawns a `git
commit` which could potentially modify the index through a hook, but the
first call site does not care and the second one guards against that
(erroring out...).

My biggest concern is with the `run_git_commit()` function: it does not
re-read a potentially-modified index (think of hooks).

We will need to be very careful with this `discard_index()`, I think, and
in my opinion there is a great opportunity here for cleaning up a little:
rather than discarding and re-reading the in-memory index without seeing
whether the on-disk index has changed at all appears a bit wasteful to me.

This could be refactored into a function that only discards and re-reads
the index if the mtime of `.git/index` changed. That function could then
also be taught to detect when the in-memory index has unwritten changes:
that would constitute a bug.

Ciao,
Dscho

>
> After this, ce_write() is still called much more by the sequencer.
>
> Here are the results of `rebase --quiet' without discarding the index:
>
> > 3400.2: rebase on top of a lot of unrelated changes             0.23(0.19+0.04)
> > 3400.4: rebase a lot of unrelated changes without split-index   5.14(4.95+0.18)
> > 3400.6: rebase a lot of unrelated changes with split-index      5.02(4.87+0.15)
> The performance of the rebase is better in the two cases.
>
>
> 2. The base index is dropped by unpack_trees_start() and unpack_trees().
>
> Now, write_shared_index() is no longer called and write_locked_index()
> is less expensive than before according to callgrind.  But
> ce_write_entry() is still called 749 302 times (which is even more than
> before.)
>
> The only place where ce_write_entry() is called is in a loop in
> do_write_index().  The number of iterations is dictated by the size of
> the cache, and there is a trace2 probe dumping this value.
>
> For `--am', the value goes like this: 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4,
> 4, 4, 5, 5, 5, 5, … up until 101.
>
> For the sequencer, it goes like this: 1, 1, 3697, 3697, 3698, 3698,
> 3699, 3699, … up until 3796.
>
> The size of the cache is set in prepare_to_write_split_index().  It
> grows if a cache entry has no index (most of them should have one by
> now), or if the split index has no base index (with `--am', the split
> index always has a base.)  This comes from unpack_trees_start() -- it
> creates a new index, and unpack_trees() does not carry the base index,
> hence the size of the cache.
>
> The second attached patch (which is broken for the non-interactive
> rebase case) demonstrates what we could expect for the split-index case
> if we fix this:
>
> > 3400.2: rebase on top of a lot of unrelated changes             0.24(0.21+0.03)
> > 3400.4: rebase a lot of unrelated changes without split-index   5.81(5.62+0.17)
> > 3400.6: rebase a lot of unrelated changes with split-index      4.76(4.54+0.20)
> So, for everything related to the index, I think that’s it.
>
> [1] Numbers may vary, but they should remain in the same order of magnitude.
>
> Cheers,
> Alban
>
>

  parent reply	other threads:[~2020-01-31 21:23 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-01  6:04 Comparing rebase --am with --interactive via p3400 Johannes Schindelin
2019-02-01  7:22 ` Johannes Schindelin
2019-02-01  9:26 ` Elijah Newren
2019-12-27 21:11 ` Alban Gruin
2019-12-27 22:45   ` Elijah Newren
2019-12-29 17:25     ` Alban Gruin
2020-01-02 20:17       ` Johannes Schindelin
2020-01-31 21:23   ` Johannes Schindelin [this message]
2020-04-01 11:33     ` Alban Gruin
2020-04-01 14:00       ` Phillip Wood
2020-04-04 20:33         ` Johannes Schindelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=nycvar.QRO.7.76.6.2001312216410.46@tvgsbejvaqbjf.bet \
    --to=johannes.schindelin@gmx.de \
    --cc=alban.gruin@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=newren@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).