Re: [PATCH v2 8/8] merge-tree: provide an easy way to access which files have conflicts

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Johannes Schindelin <Johannes.Schindelin@gmx.de>
To: Elijah Newren <newren@gmail.com>
Cc: Elijah Newren via GitGitGadget <gitgitgadget@gmail.com>,
	Git Mailing List <git@vger.kernel.org>,
	Christian Couder <chriscool@tuxfamily.org>,
	Taylor Blau <me@ttaylorr.com>,
	Johannes Altmanninger <aclopte@gmail.com>
Subject: Re: [PATCH v2 8/8] merge-tree: provide an easy way to access which files have conflicts
Date: Tue, 22 Feb 2022 14:03:46 +0100 (CET)	[thread overview]
Message-ID: <nycvar.QRO.7.76.6.2202221345510.11118@tvgsbejvaqbjf.bet> (raw)
In-Reply-To: <CABPp-BHvXrP0sTTmuTYfACoJTCcm9+wk_f441nj4TstrmQdqMQ@mail.gmail.com>

Hi Elijah,

I meant to answer this mail much earlier, oh well. Sorry. I will only
answer the still open questions below, clipping the quoted text to save
every reader some time.

On Fri, 7 Jan 2022, Elijah Newren wrote:

> On Fri, Jan 7, 2022 at 11:36 AM Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
>
> > So my (somewhat hacky) patches for a proof-of-concept produced
> > _either_ the hash of the tree on `stdout`, _or_ a header saying that
> > there were conflicts followed by a NUL-separated list of file names.
>
> Did you really check that it only produced one of these?  If you were
> using ort, you wrote blob and tree objects to disk, even if you didn't
> print their hash on stdout.

D'oh. No, I had not checked that.

> > Mind you, I did not even get to the point of analyzing things even more
> > deeply. My partner in crime and I only got to comparing the `merge-ort`
> > way to the libgit2-based way, trying to get them to compare as much
> > apples-to-apples as possible [*1*], and we found that even the time to
> > spawn the Git process (~1-3ms, with all overhead counted in) is _quite_
> > noticeable, at server-side scale.
>
> 1-3ms?  I thought it was a good bit more than that.

The absolute number really depends on a lot of factors, i.e. what is
relevant is the relative difference between in-process vs spawning a new
process.

> If process execution overhead is such a big problem, perhaps you could
> use these new functions via libgit.a instead of invoking a git process
> and get the best of both worlds?

For now, I solved this by combining multiple Git process invocations into
a single one, as described here:

> > In this particular instance, there is a silver-lining: the
> > libgit2-based merge is not actually recursive. It is a three-way
> > merge. Which means that we first have to determine a merge base. In
> > our case, this is done by spawning a Git process anyway, so one of my
> > ideas to move forward is to fold that merge-base logic into `git
> > merge-tree`, too.
>
> The merge-base logic is already part of merge-tree in my patches.
> What exactly did you do with merge-tree?  Were you using the existing
> one and feeding it with a merge-base as an input, or did you write
> your own that was more like Christian's that expected a merge base?

For an apples-to-apples comparison, I had to stay with the very same merge
base logic as before, i.e. originally I added a new option to `merge-tree`
that would take a merge base and then take a non-recursive path. In my
current version, it is merely an option that even determines said merge
base in the same process (and yes, it leaks memory, but that does not
matter because the process is short-lived anyway).

> > Anyway, the short short is: whenever we can avoid unnecessary work, we
> > should do so. In the context of this patch, I would say that we should
> > avoid writing out a tree (and avoid printing its hash to `stdout`) if
> > there are merge conflicts.
>
> We can avoid printing its hash to `stdout`, but it'd take significant
> work to avoid writing the tree to an object store, and it cannot be
> done at the merge-tree level, it'd require replumbing some bits of
> merge-ort (and making some already complex codepaths a bit more
> complex, but that's a price I'm willing to pay for significant
> performance wins).

Yes, let's leave things as-are. It is not worth the trouble.

> We could make use of the tmp_objdir API that recently merged (see
> b3cecf49ea ("tmp-objdir: new API for creating temporary writable
> databases", 2021-12-06)).  We could put that tmp-objdir on /dev/shm or
> other ramdisk, and if the merge is clean, migrate the contents into
> the real object store.  Perhaps we could even pack those objects first
> if there are a large number of them, but If it's not clean, we can
> just discard the tmp-objdir. Also, as a further variant on this
> alternative... packing these objects before migrating if there are a
> sufficient number of them.  Now, this is rather unlikely to be needed
> in general by merge-tree, because you only need to write new objects
> (thus representing files modified on both sides, or whatever leading
> trees are needed to reference the updated paths).  However, it might
> matter for big enough repos with large enough numbers of changes on
> both sides.  And it'd align nicely with my idea for server-side
> rebases (where implementing this is on my TODO list), because
> server-side rebases are much more likely to generate a large number of
> objects.
>
> But if you really want to learn about avoiding object writes...
>
> If you really want to only write tree and blob objects when the merge
> is clean, then as far as I can tell you have two options in regards to
> the blobs: (1) you'll need to keep all files from three-way content
> merges simultaneously in memory until you've determined if the result
> is clean, so that you can then write the merged contents out as blobs
> at the end.  Or (2) doing all the three-way content merges and keeping
> track of whether the result for each is clean, and if they all turn
> out to be clean, then redo every single one of those three-way content
> merges afterwards so that you can actually write out the merged-result
> to disk that time.
>
> I think (2) would cost you a lot more work than you'd save, and I
> worry that (1) might risk using large amounts of memory in the big
> repositories if there are lots of changes on both sides.  While that
> may be uncommon, I've seen folks try to merge things with lots of
> changes on both sides, and you do have the server side to worry about
> after all.
>
> There are similar issues with the fact that trees are written as they
> are processed as well.  Those would also require re-running afterwards
> to re-generate the trees from the list of relevant-files-and-trees we
> operate on.
>
> However, if you are really curious about trying this out despite the
> fact that I think you might be causing more work than you're avoiding
> (or potentially requiring a lot more RAM), look for calls to
> write_tree() (there are precisely two in merge-ort.c, one for
> intermediate trees and one for the toplevel tree) and
> write_object_file() (there are precisely two in merge-ort.c, one
> within write_tree() for writing tree objects, and one in
> handle_content_merge() for writing blob objects).

Thank you for this thorough analysis.

I am a big fan of crossing bridges when they are reached, and not miles
before that. So _iff_ it turns out that the speed, or the potential
cluttering with objects, should present a problem in the future, I am
inclined to follow the tmp-objdir route you described above (thank you for
pointing it out, I had not made the connection to this here scenario).

> > Footnote *1*: I did not _quite_ get to the point of comparing the
> > `merge-ort` merges to the libgit2 ones, unfortunately. I was on my way to
> > add code to respect `merge.renames = false` so that we could _truly_
> > compare the `merge-ort` merges to the libgit2 merges (we really will want
> > to verify that the output is identical, before even considering to enable
> > recursive merges on the server side, and then only after studying the time
> > difference), and then had to take off due to the holidays. If you already
> > have that need to be able to turn off rename-detection on your radar, even
> > if only for a transitional period, I would be _so_ delighted.
>
> Well, I had no intention of submitting it (and still don't), but I did
> implement it a while back for folks who have needs for it in a
> transitional period.  It's a pretty simple change.
>
> https://github.com/newren/git/commit/5330a9de77f56f20e546acc65c924fc783f092e6
> https://github.com/newren/git/commit/2e7e8d79b0995d352558608b6308060fbc055fd1

Thank you _so_ much for that. It was super helpful, and it allowed me to
gather enough evidence to justify continuing to work on this code.

Ciao,
Dscho

next prev parent reply	other threads:[~2022-02-22 13:03 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-31  5:03 [PATCH 0/8] RFC: Server side merges (no ref updating, no commit creating, no touching worktree or index) Elijah Newren via GitGitGadget
2021-12-31  5:03 ` [PATCH 1/8] merge-tree: rename merge_trees() to trivial_merge_trees() Elijah Newren via GitGitGadget
2021-12-31  5:03 ` [PATCH 2/8] merge-tree: move logic for existing merge into new function Elijah Newren via GitGitGadget
2022-01-01 20:11   ` Johannes Altmanninger
2022-01-01 20:17     ` Elijah Newren
2021-12-31  5:03 ` [PATCH 3/8] merge-tree: add option parsing and initial shell for real merge function Elijah Newren via GitGitGadget
2021-12-31  5:04 ` [PATCH 4/8] merge-tree: implement real merges Elijah Newren via GitGitGadget
2022-01-01 20:08   ` Johannes Altmanninger
2022-01-01 21:11     ` Elijah Newren
2022-01-03 12:23   ` Fabian Stelzer
2022-01-03 16:37     ` Elijah Newren
2021-12-31  5:04 ` [PATCH 5/8] merge-ort: split out a separate display_update_messages() function Elijah Newren via GitGitGadget
2022-01-03 12:15   ` Fabian Stelzer
2022-01-03 12:25     ` Fabian Stelzer
2021-12-31  5:04 ` [PATCH 6/8] merge-ort: allow update messages to be written to different file stream Elijah Newren via GitGitGadget
2022-01-01 20:08   ` Johannes Altmanninger
2022-01-01 20:19     ` Elijah Newren
2021-12-31  5:04 ` [PATCH 7/8] merge-tree: support saving merge messages to a separate file Elijah Newren via GitGitGadget
2022-01-03 12:31   ` Fabian Stelzer
2022-01-03 16:51     ` Elijah Newren
2022-01-03 17:22       ` Fabian Stelzer
2022-01-03 19:46         ` Elijah Newren
2022-01-04 13:05           ` Fabian Stelzer
2022-01-03 12:35   ` Fabian Stelzer
2022-01-03 16:55     ` Elijah Newren
2021-12-31  5:04 ` [PATCH 8/8] merge-tree: provide an easy way to access which files have conflicts Elijah Newren via GitGitGadget
2022-01-05 17:27 ` [PATCH v2 0/8] RFC: Server side merges (no ref updating, no commit creating, no touching worktree or index) Elijah Newren via GitGitGadget
2022-01-05 17:27   ` [PATCH v2 1/8] merge-tree: rename merge_trees() to trivial_merge_trees() Elijah Newren via GitGitGadget
2022-01-05 17:27   ` [PATCH v2 2/8] merge-tree: move logic for existing merge into new function Elijah Newren via GitGitGadget
2022-01-05 17:27   ` [PATCH v2 3/8] merge-tree: add option parsing and initial shell for real merge function Elijah Newren via GitGitGadget
2022-01-05 17:27   ` [PATCH v2 4/8] merge-tree: implement real merges Elijah Newren via GitGitGadget
2022-01-07 15:30     ` Johannes Schindelin
2022-01-07 17:26       ` Elijah Newren
2022-01-07 18:22         ` Johannes Schindelin
2022-01-07 19:15           ` Elijah Newren
2022-01-07 20:56         ` Junio C Hamano
2022-01-11 13:39           ` Johannes Schindelin
2022-01-07 18:12     ` Christian Couder
2022-01-07 19:09       ` Elijah Newren
2022-01-05 17:27   ` [PATCH v2 5/8] merge-ort: split out a separate display_update_messages() function Elijah Newren via GitGitGadget
2022-01-05 17:27   ` [PATCH v2 6/8] merge-ort: allow update messages to be written to different file stream Elijah Newren via GitGitGadget
2022-01-05 17:27   ` [PATCH v2 7/8] merge-tree: support saving merge messages to a separate file Elijah Newren via GitGitGadget
2022-01-07 18:07     ` Johannes Schindelin
2022-01-08  1:02       ` Elijah Newren
2022-01-05 17:27   ` [PATCH v2 8/8] merge-tree: provide an easy way to access which files have conflicts Elijah Newren via GitGitGadget
2022-01-05 19:09     ` Ramsay Jones
2022-01-05 19:17       ` Elijah Newren
2022-01-07 19:36     ` Johannes Schindelin
2022-01-07 22:12       ` Elijah Newren
2022-02-22 13:03         ` Johannes Schindelin [this message]
2022-01-08  1:28       ` Elijah Newren
2022-02-22 13:05         ` Johannes Schindelin
2022-01-05 20:18   ` [PATCH v2 0/8] RFC: Server side merges (no ref updating, no commit creating, no touching worktree or index) Junio C Hamano
2022-01-05 22:35     ` Elijah Newren
2022-01-07 18:46   ` Christian Couder
2022-01-07 19:59     ` Elijah Newren
2022-01-07 21:26       ` René Scharfe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=nycvar.QRO.7.76.6.2202221345510.11118@tvgsbejvaqbjf.bet \
    --to=johannes.schindelin@gmx.de \
    --cc=aclopte@gmail.com \
    --cc=chriscool@tuxfamily.org \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=newren@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).