git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / Atom feed
From: Johannes Schindelin <Johannes.Schindelin@gmx.de>
To: Elijah Newren <newren@gmail.com>
Cc: Git Mailing List <git@vger.kernel.org>,
	Junio C Hamano <gitster@pobox.com>,
	Alban Gruin <alban.gruin@gmail.com>
Subject: Re: [RFC PATCH 3/3] git-rebase.sh: make git-rebase--interactive the default
Date: Mon, 21 Jan 2019 16:55:44 +0100 (STD)
Message-ID: <nycvar.QRO.7.76.6.1901211635080.41@tvgsbejvaqbjf.bet> (raw)
In-Reply-To: <CABPp-BGh7yW69QwxQb13K0HM38NKmQif3A6C6UULEKYnkEJ5vA@mail.gmail.com>

Hi Elijah,

while looking for the patches I promised to review much earlier, I
realized that I never answered to this, even older mail. Sorry about that.

[On a positive note: it only has been, what, exactly 7 months since you
sent it. My record is, I believe, somewhat around 5 years before I
responded to a mail.]

On Thu, 21 Jun 2018, Elijah Newren wrote:

> Thanks for all the food for thought.  This is awesome.

:-)

> On Thu, Jun 21, 2018 at 3:57 AM, Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
> > On Wed, 20 Jun 2018, Elijah Newren wrote:
> >> On Sun, Jun 17, 2018 at 2:44 PM, Johannes Schindelin
> >> <Johannes.Schindelin@gmx.de> wrote:
> >>
> >> > I was really referring to speed. But I have to admit that I do not have
> >> > any current numbers.
> >> >
> >> > Another issue just hit me, though: rebase --am does not need to look at as
> >> > many Git objects as rebase --merge or rebase -i. Therefore, GVFS users
> >> > will still want to use --am wherever possible, to avoid "hydrating"
> >> > many objects during their rebase.
> >>
> >> What is it that makes rebase --am need fewer Git objects than rebase
> >> --merge or rebase -i?
> >
> > Multiple things. The most obvious thing: --cherry-pick. When you call `git
> > rebase --am`, it does not try to exclude the patches by looking at
> > `rev-list --cherry-pick --right-only <upstream>..<head>`
> >
> > This really bit us in GVFS Git because we have several thousand developers
> > working on the same code base, and you probably read the numbers elsewhere
> > how many commits are introduced while a developer goes on so much as a
> > week-long vacation.
> 
> Ooh, that's interesting.  Here's a crazy idea: I'm curious if this is
> essentially a workaround that we could expunge.

That is indeed a very, very compelling idea.

> In particular, am-based rebases will drop empty commits (that is, either
> commits that are empty to start, or that after being applied are empty).
> It never grew a --no-allow-empty option either.  If rebase -i/-m were to
> behave the same, for consistency sake, we could drop the `rev-list
> --cherry-pick ...` call.  Granted, that would result in a slight
> difference in behavior (it could result in conflicts in some cases
> instead of an automatically empty commit), but may well be worth it for
> consistencies' sake between rebase backends.  As a side effect of making
> the backends consistent, it may also provide a nice performance boost
> for the GVFS case by removing the need to do the --cherry-pick thing.

I very much like the consistency thing. I am slightly worried about the
ramifications, though.

For example, when I did not use GitGitGadget yet to submit patches, I used
one of my machines to keep a record of still-in-flight branches, and to
verify whether a branch had been merged, I simply used `git rebase -i
upstream/master`. If that showed a `noop`, I was sure that it was merged.

While I no longer have *this* particular use case, I take it as an
indication that Git users out there are probably using the interactive
rebase in similar manners.

So we would *in the least* have to keep this as an option, and we would
have to work with a deprecation notice for a couple of major versions
before we flip the default.

Having said that, I really like the performance implication of switching
the default to skip the --cherry-pick, *in particular* with an eye toward
VFS for Git (TAFKA GVFS).

> > Next, rename detection.
> >
> > I think that bit us even harder because it tries to look at all the files
> > that have been changed in upstream in the meantime, which can be *a lot*.
> > And if you know that moving files outside of your cozy little sparse
> > checkout is unlikely (or moving from the outside into your checkout), then
> > all this hydration is pretty wasteful. That's why we had to switch off
> > rename detection in GVFS Git.
> 
> Actually to be clear, for each side of the merge, rename detection
> only needs to look at files that were not present in both the
> merge-base and the branch.  Files which were modified but have the
> same name are not involved (break detection is not active.).  But yes,
> I understand how you can still get an awful lot of files, which makes
> it worse when you then go and compare all of them in an O(N*M)
> fashion.

Yes.

It is made worse by the fact that Git never sprouted an option to tell it
about *known* renames, even if that would make tons of sense. I mean, it
is nice that Git tries to detect renames. Really nice. In some cases it
fails, though, and in other cases it just takes a lot of time, so it would
be a real boon if there was a way to skip Git's rename detection. It'd be
similar in spirit to way you can side-step Git's binary file detection, by
using .gitattributes. (Obviously, the rename thing would need to use a
different facility than .gitattributes.)

> >> My guess at what objects are needed by each type:
> >>
> >> At a high level, rebase --am for each commit will need to compare the
> >> commit to its parent to generate a diff (which thus involves walking
> >> over the objects in both the commit and its parent, though it should
> >> be able to skip over subtrees that are equal), and then will need to
> >> look at all the objects in the target commit on which it needs to
> >> apply the patch (in order to properly fill the index for a starting
> >> point, and used later when creating a new commit).
> >
> > No, in --am mode, it does not even need to look at all the objects in the
> > target commits. Only those objects that correspond to the files that are
> > touched by the diff.
> 
> Sorry, I mis-spoke.  Unless you've done something special with GVFS, I
> believe the --am mode would indeed need to read all the _tree_ objects
> in the target commits, but any _file_ object corresponding to a path
> not modified by the other side need not be loaded since we can proceed
> simply knowing its sha1sum.

That *might* be good enough with the "largest repository on the planet".
Emprically, we determined pretty early on that it is not the commits and
trees that make the initial clone unbearably large. It's the blobs.

> > In a massive code base, such as Windows', this makes a huge difference.
> > Even comparing the number of files touched by the patches that are to be
> > rebased to the number of files that were touched in upstream since you
> > rebased last is ridiculous. And a three-way merge needs to consider that
> > latter set of files.
> 
> Actually, the three-way merge isn't all that different.  For a path
> that exists in both branches and the merge base, it first compares
> sha1sums (thus it does need to read all the tree objects), and then if
> one side has not modified a certain path, it knows the merge result is
> the sha1sum from the other side.  Thus, it won't need to read files
> that were changed upstream so long as that path wasn't changed on your
> side.

True.

> In fact, --merge and --interactive have a slight performance advantage
> here: if upstream didn't modify the path, then it doesn't need to read
> the file from your side to do any diff or comparison.  It can just use
> whatever sha1sum you have.  In contrast --am mode will need to read
> the relevant file more than once.  Since it first creates a series of
> patches, it will have to read the file from your commit and its
> parent, create a diff, and then later apply that diff to the target
> version, and since the target version matches the merge base it will
> end up just recovering the version of the file on your side that you
> started with anyway.  (Since that file is already local, though, this
> small performance advantage would currently be lost in the noise of
> the other problems.)

I had not thought of that. You're right, with largish patch series,
interactive rebase might actually be faster than --am based rebase. I have
not even tested this, I guess I really should first look at numbers before
I speak further :0)

> >> If the application of the diff fails, it falls back to a three-way
> >> merge, though the three-way merge shouldn't need any additional objects.
> >
> > The three-way merge needs to reconcile the diff between branch point and
> > your changes to the diff between branch point and upstream's changes. The
> > latter can be a lot bigger than the former, in which case --am's
> > complexity (O(former)) is much nicer than --merge's (O(former u latter)).
> 
> Yeah, renames mess up that whole "a path that exists in both branches
> and the merge base" requirement that I stipulated above, and widens
> the scope considerably and affects performance rather markedly.  On
> the positive side though, the more I read your description, the more I
> think my rename performance work may give a speedup for the GVFS
> usecase far in excess of the factor of 30 I saw for my usecase.  It
> may still not quite match --am mode's outright avoidance of rename
> detection, but I still think it should be pretty impressive.  Time
> will tell...

I am very hopeful.

And I also think that one fallout of your work on that front could be to
teach Git a way to skip/overrule rename detection. After all, it is not
*all* that common that files are renamed. Take Git's commit history, for
example. Apart from the huge builtin/ move (for which rename detection
will fail after many iterations after that move), there were only sporadic
renames (such as sha1_file.c -> sha1-file.c). In "real" repositories,
refactorings are also rather rare, for the same reason: they simply cause
a lot of trouble with a lot of concurrent development going on. If we had
such a facility (say, via notes) to tell Git: look, don't even bother to
try to detect renames, at least in *commit range*, just use this here data
base that I prepared for you, we would not only be able to speed up
operations such as rebase, we also would have an easy way to solve the
problem where the rename detection simply fails after sustained
development because both sides diverged *so much* that Git cannot detect
any similarity.

Another thought that occurred to me: in the context of GitGitGadget, where
I cannot rely on the `amlog` notes (because there are too many
inaccuracies in that information), I was looking for another way to
identify commits in `pu` that corresponds to the original commits (i.e.
the commits by the authors themselves, in GitGitGadget's PRs). Sadly, I
lack the time to work on this with any seriousness, I can only spend a
couple hours here and there on it, but I *think* that there might be a
chance to use something like MinHash and/or Locally Sensitive Hashes to
not only determine near-identical commits, but also correspondences
between commits and mails on the Git mailing list containing patches. And
the very same technology should also be able to speed up rename detection,
much more robustly (and with a much better time complexity) than what we
have now. Did you look in that direction already?

Ciao,
Dscho

  reply index

Thread overview: 130+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-07  4:58 RFC: rebase inconsistency in 2.18 -- course of action? Elijah Newren
2018-06-07  5:04 ` [PATCH] t3401: add directory rename testcases for rebase and am Elijah Newren
2018-06-25 16:17   ` Elijah Newren
2018-06-07  5:05 ` [PATCH] apply: fix grammar error in comment Elijah Newren
2018-06-07  5:05 ` [PATCH] t5407: fix test to cover intended arguments Elijah Newren
2018-06-07  5:06 ` [PATCH] git-rebase--merge: modernize "git-$cmd" to "git $cmd" Elijah Newren
2018-06-07  5:24   ` Elijah Newren
2018-06-27  7:46   ` [PATCH v2] " Elijah Newren
2018-07-12 15:49     ` Johannes Schindelin
2018-07-13 16:13       ` Elijah Newren
2018-07-14 22:20         ` Johannes Schindelin
2018-06-07  5:06 ` [PATCH 1/2] t3422: new testcases for checking when incompatible options passed Elijah Newren
2018-06-07  5:06   ` [PATCH 2/2] git-rebase: error out " Elijah Newren
2018-06-10 19:40     ` Phillip Wood
2018-06-11 15:19       ` Elijah Newren
2018-06-11 15:59         ` Phillip Wood
2018-06-11 15:49       ` Elijah Newren
2018-06-12  9:45         ` Phillip Wood
2018-06-17  5:58   ` [RFC PATCH v2 0/7] Document/fix/warn about rebase incompatibilities and inconsistences Elijah Newren
2018-06-17  5:58     ` [RFC PATCH v2 1/7] git-rebase.txt: document incompatible options Elijah Newren
2018-06-17 15:38       ` Phillip Wood
2018-06-20 17:09         ` Elijah Newren
2018-06-17 17:15       ` Eric Sunshine
2018-06-20 17:10         ` Elijah Newren
2018-06-17  5:58     ` [RFC PATCH v2 2/7] git-rebase.sh: update help messages a bit Elijah Newren
2018-06-17  5:58     ` [RFC PATCH v2 3/7] t3422: new testcases for checking when incompatible options passed Elijah Newren
2018-06-17  5:58     ` [RFC PATCH v2 4/7] git-rebase: error out " Elijah Newren
2018-06-17  5:58     ` [RFC PATCH v2 5/7] git-rebase.txt: document behavioral inconsistencies between modes Elijah Newren
2018-06-17  5:58     ` [RFC PATCH v2 6/7] git-rebase.txt: address confusion between --no-ff vs --force-rebase Elijah Newren
2018-06-17  5:58     ` [RFC PATCH v2 7/7] git-rebase: make --allow-empty-message the default Elijah Newren
2018-06-17 15:30       ` Phillip Wood
2018-06-20 16:47         ` Elijah Newren
2018-06-17 15:41     ` [RFC PATCH v2 0/7] Document/fix/warn about rebase incompatibilities and inconsistences Phillip Wood
2018-06-21 15:00     ` [PATCH v3 0/7] Document/fix/warn about rebase incompatibilities and inconsistencies Elijah Newren
2018-06-21 15:00       ` [PATCH v3 1/7] git-rebase.txt: document incompatible options Elijah Newren
2018-06-21 19:47         ` Junio C Hamano
2018-06-21 20:33           ` Elijah Newren
2018-06-22  8:32         ` SZEDER Gábor
2018-06-21 15:00       ` [PATCH v3 2/7] git-rebase.sh: update help messages a bit Elijah Newren
2018-06-21 19:58         ` Junio C Hamano
2018-06-21 20:34           ` Elijah Newren
2018-06-21 15:00       ` [PATCH v3 3/7] t3422: new testcases for checking when incompatible options passed Elijah Newren
2018-06-21 20:04         ` Junio C Hamano
2018-06-21 20:37           ` Elijah Newren
2018-06-21 21:18           ` Eric Sunshine
2018-06-21 15:00       ` [PATCH v3 4/7] git-rebase: error out " Elijah Newren
2018-06-21 20:29         ` Junio C Hamano
2018-06-21 20:41           ` Elijah Newren
2018-06-21 15:00       ` [PATCH v3 5/7] git-rebase.txt: document behavioral inconsistencies between modes Elijah Newren
2018-06-21 20:44         ` Junio C Hamano
2018-06-21 21:25           ` Elijah Newren
2018-06-21 15:00       ` [PATCH v3 6/7] git-rebase.txt: address confusion between --no-ff vs --force-rebase Elijah Newren
2018-06-21 20:46         ` Junio C Hamano
2018-06-21 21:22           ` Elijah Newren
2018-06-21 15:00       ` [RFC PATCH v3 7/7] git-rebase: make --allow-empty-message the default Elijah Newren
2018-06-25 16:12       ` [PATCH v4 0/9] Document/fix/warn about rebase incompatibilities and inconsistencies Elijah Newren
2018-06-25 16:12         ` [PATCH v4 1/9] git-rebase.txt: document incompatible options Elijah Newren
2018-06-25 16:12         ` [PATCH v4 2/9] git-rebase.sh: update help messages a bit Elijah Newren
2018-06-25 16:12         ` [PATCH v4 3/9] t3422: new testcases for checking when incompatible options passed Elijah Newren
2018-06-26 18:17           ` Junio C Hamano
2018-06-27  6:50             ` Elijah Newren
2018-06-25 16:12         ` [PATCH v4 4/9] git-rebase: error out " Elijah Newren
2018-06-25 16:12         ` [PATCH v4 5/9] git-rebase.txt: address confusion between --no-ff vs --force-rebase Elijah Newren
2018-06-25 16:12         ` [PATCH v4 6/9] directory-rename-detection.txt: technical docs on abilities and limitations Elijah Newren
2018-06-25 16:12         ` [PATCH v4 7/9] git-rebase.txt: document behavioral differences between modes Elijah Newren
2018-06-25 16:12         ` [PATCH v4 8/9] t3401: add directory rename testcases for rebase and am Elijah Newren
2018-06-25 16:13         ` [RFC PATCH v4 9/9] git-rebase: make --allow-empty-message the default Elijah Newren
2018-06-27  7:23         ` [PATCH v5 0/9] Document/fix/warn about rebase incompatibilities and inconsistencies Elijah Newren
2018-06-27  7:23           ` [PATCH v5 1/9] git-rebase.txt: document incompatible options Elijah Newren
2018-06-27  7:23           ` [PATCH v5 2/9] git-rebase.sh: update help messages a bit Elijah Newren
2018-06-27  7:23           ` [PATCH v5 3/9] t3422: new testcases for checking when incompatible options passed Elijah Newren
2018-06-27  7:23           ` [PATCH v5 4/9] git-rebase: error out " Elijah Newren
2018-06-27  7:23           ` [PATCH v5 5/9] git-rebase.txt: address confusion between --no-ff vs --force-rebase Elijah Newren
2018-06-27  7:23           ` [PATCH v5 6/9] directory-rename-detection.txt: technical docs on abilities and limitations Elijah Newren
2018-06-27  7:23           ` [PATCH v5 7/9] git-rebase.txt: document behavioral differences between modes Elijah Newren
2018-06-27  7:23           ` [PATCH v5 8/9] t3401: add directory rename testcases for rebase and am Elijah Newren
2018-06-27  7:23           ` [RFC PATCH v5 9/9] git-rebase: make --allow-empty-message the default Elijah Newren
2018-09-12  2:42             ` 2.19.0 regression: leaving editor with empty commit message doesn't stop rebase [was: Re: [RFC PATCH v5 9/9] git-rebase: make --allow-empty-message the default] SZEDER Gábor
2018-09-12  6:21               ` Elijah Newren
2018-09-12 21:18               ` [PATCH] sequencer: fix --allow-empty-message behavior, make it smarter Elijah Newren
2018-09-13 20:24                 ` Junio C Hamano
2018-06-29 13:20           ` [PATCH v5 0/9] Document/fix/warn about rebase incompatibilities and inconsistencies Phillip Wood
2018-06-07  5:07 ` [PATCH] git-rebase.sh: handle keep-empty like all other options Elijah Newren
2018-06-10 19:26   ` Phillip Wood
2018-06-11 14:42     ` Elijah Newren
2018-06-11 15:16       ` Phillip Wood
2018-06-11 16:08         ` Elijah Newren
2018-06-12  9:53           ` Phillip Wood
2018-06-07  5:08 ` [PATCH 1/2] t3418: add testcase showing problems with rebase -i and strategy options Elijah Newren
2018-06-07  5:08   ` [PATCH 2/2] Fix use of strategy options with interactive rebases Elijah Newren
2018-06-27  7:36   ` [PATCH v2 0/2] " Elijah Newren
2018-06-27  7:36     ` [PATCH v2 1/2] t3418: add testcase showing problems with rebase -i and strategy options Elijah Newren
2018-06-27  7:45       ` Eric Sunshine
2018-06-27  7:49         ` Elijah Newren
2018-06-27 16:54           ` Junio C Hamano
2018-06-27 17:27             ` Elijah Newren
2018-06-27 18:17               ` Johannes Sixt
2018-06-27 18:24                 ` Eric Sunshine
2018-06-27 18:34                 ` [PATCH v2 1/2] t3418: add testcase showing problems with rebase SZEDER Gábor
2018-06-27 19:20                 ` [PATCH v2 1/2] t3418: add testcase showing problems with rebase -i and strategy options Junio C Hamano
2018-06-27 19:23               ` Junio C Hamano
2018-06-27  7:36     ` [PATCH v2 2/2] Fix use of strategy options with interactive rebases Elijah Newren
2018-06-27 15:48     ` [PATCH v3 0/2] " Elijah Newren
2018-06-27 15:48       ` [PATCH v3 1/2] t3418: add testcase showing problems with rebase -i and strategy options Elijah Newren
2018-06-27 18:28         ` SZEDER Gábor
2018-07-11 10:56           ` SZEDER Gábor
2018-07-11 19:12             ` Elijah Newren
2018-06-27 15:48       ` [PATCH v3 2/2] Fix use of strategy options with interactive rebases Elijah Newren
2018-07-12 15:41         ` Johannes Schindelin
2018-07-13 15:51           ` Elijah Newren
2018-06-07 17:13 ` [RFC PATCH 0/3] Send more rebases through interactive machinery Elijah Newren
2018-06-07 17:13   ` [RFC PATCH 1/3] git-rebase, sequencer: add a --quiet option for the " Elijah Newren
2018-06-09 21:45     ` Johannes Schindelin
2018-06-09 23:05       ` Elijah Newren
2018-06-07 17:13   ` [RFC PATCH 2/3] rebase: Implement --merge via git-rebase--interactive Elijah Newren
2018-06-09 22:04     ` Johannes Schindelin
2018-06-10  1:14       ` Elijah Newren
2018-06-11  9:54         ` Phillip Wood
2018-06-07 17:13   ` [RFC PATCH 3/3] git-rebase.sh: make git-rebase--interactive the default Elijah Newren
2018-06-09 22:11     ` Johannes Schindelin
2018-06-10  1:31       ` Elijah Newren
2018-06-17 21:44         ` Johannes Schindelin
2018-06-20 16:27           ` Elijah Newren
2018-06-21 10:57             ` Johannes Schindelin
2018-06-21 15:36               ` Elijah Newren
2019-01-21 15:55                 ` Johannes Schindelin [this message]
2019-01-21 19:02                   ` Elijah Newren
2019-01-22 15:37                     ` Johannes Schindelin
2018-06-11 17:44 ` RFC: rebase inconsistency in 2.18 -- course of action? Junio C Hamano
2018-06-11 20:17   ` Elijah Newren

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=nycvar.QRO.7.76.6.1901211635080.41@tvgsbejvaqbjf.bet \
    --to=johannes.schindelin@gmx.de \
    --cc=alban.gruin@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=newren@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

git@vger.kernel.org list mirror (unofficial, one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Example config snippet for mirrors

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.io/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git