Merging split files

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Merging split files
       [not found] <31155742.183989.1300374518689.JavaMail.root@mail.hq.genarts.com>
@ 2011-03-18 13:22 ` Stephen Bash
  2011-03-29 15:16   ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Stephen Bash @ 2011-03-18 13:22 UTC (permalink / raw
  To: git

Hi all-

In our previous release foo.cxx contained both the base class and a few subclasses.  Since then the number of subclasses has grown, and we've split foo.cxx (base and sub-classes) into foo-base.cxx (base class) and foo-defs.cxx (sub-classes).  Since the release, we've had a few bug fixes in foo.cxx on the maintenance branch, and need to merge those back to development.  When I did the merge Git identified foo.cxx as moved to foo-defs.cxx, which worked for most changes, but a few needed to be in foo-base.cxx.  In this case it was a pretty trivial manual resolution, but is there a method for handling merges of split files?

Thanks,
Stephen

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Merging split files
  2011-03-18 13:22 ` Merging split files Stephen Bash
@ 2011-03-29 15:16   ` Jeff King
  2011-03-29 16:33     ` Stephen Bash
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff King @ 2011-03-29 15:16 UTC (permalink / raw
  To: Stephen Bash; +Cc: git

On Fri, Mar 18, 2011 at 09:22:36AM -0400, Stephen Bash wrote:

> In our previous release foo.cxx contained both the base class and a
> few subclasses.  Since then the number of subclasses has grown, and
> we've split foo.cxx (base and sub-classes) into foo-base.cxx (base
> class) and foo-defs.cxx (sub-classes).  Since the release, we've had a
> few bug fixes in foo.cxx on the maintenance branch, and need to merge
> those back to development.  When I did the merge Git identified
> foo.cxx as moved to foo-defs.cxx, which worked for most changes, but a
> few needed to be in foo-base.cxx.  In this case it was a pretty
> trivial manual resolution, but is there a method for handling merges
> of split files?

I don't think there is currently a good way to do this automatically.

The problem is that the closest merge-recursive gets to understanding
content movement is that it considers whole file renames. So it sees
"foo.cxx became foo-defs.cxx", and applies changes to foo.cxx to
foo-defs.cxx, but it has no clue that foo-base.cxx. So at the very
least, it would need to represent "foo.cxx has split into foo-base.cxx
and foo-defs.cxx", which is not something it can currently handle. But
more than that, you want to know _which_ parts moved to each file.

So I think the most flexible thing is to forget file renames at all.
They are just a rough version of the general idea of content movement.
In theory, we should be able to see that the content we changed in
foo.cxx no longer exists, and then start looking for similar content
elsewhere. Not similar _files_, but for the chunk of content that is
changed between the merge base and the maintenance (and some surrounding
context), find where that bit of content went. And then try to merge our
changes into that new bit of content.

One problem is that when it fails, it fails pretty hard. With file
renames, your changes at least usually ends up in the right file (your
present problem excluded), and you get some textual mess to clean up.
But with content-level renaming, I suspect in conflict cases we would
end up with no clue where the result goes (because the conflict means we
can't easily match up the content for similarity), and have to stick it
in the deleted file. On the other hand, it might simply work to keep
expanding the amount of context we consider for content similarity until
we find a match, which eventually would end up considering the whole
file, and generalize to a file rename.

Implementing that inside of merge-recursive is likely to be pretty nasty
(even the current file-rename code is already pretty nasty). But it may
be possible to prototype something that runs after we hit the conflicted
state, like mergetool.

I definitely think it's an interesting area to work in, but I would have
to give it a lot of thought.

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Merging split files
  2011-03-29 15:16   ` Jeff King
@ 2011-03-29 16:33     ` Stephen Bash
  2011-03-29 18:15       ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Stephen Bash @ 2011-03-29 16:33 UTC (permalink / raw
  To: Jeff King; +Cc: git

Jeff-

Thanks for taking the time to think about this.  More inline...

----- Original Message -----
> From: "Jeff King" <peff@peff.net>
> To: "Stephen Bash" <bash@genarts.com>
> Cc: git@vger.kernel.org
> Sent: Tuesday, March 29, 2011 11:16:23 AM
> Subject: Re: Merging split files
>
> On Fri, Mar 18, 2011 at 09:22:36AM -0400, Stephen Bash wrote:
> 
> > In our previous release foo.cxx contained both the base class and a
> > few subclasses. Since then the number of subclasses has grown, and
> > we've split foo.cxx (base and sub-classes) into foo-base.cxx (base
> > class) and foo-defs.cxx (sub-classes). Since the release, we've had
> > a
> > few bug fixes in foo.cxx on the maintenance branch, and need to
> > merge
> > those back to development. When I did the merge Git identified
> > foo.cxx as moved to foo-defs.cxx, which worked for most changes, but
> > a
> > few needed to be in foo-base.cxx. In this case it was a pretty
> > trivial manual resolution, but is there a method for handling merges
> > of split files?
> 
> I don't think there is currently a good way to do this automatically.
> 
> The problem is that the closest merge-recursive gets to understanding
> content movement is that it considers whole file renames. ...
> 
> So I think the most flexible thing is to forget file renames at all.

I agree that would be the best solution long term. ("Git doesn't track files, Git tracks content".  Think I heard that somewhere before...)

That being said, the back seat drivers in the office here (i.e. me and everyone else that knows almost nothing about the internals of merge recursive!) thought maybe a middle ground is teach merge recursive to do copy detection along with rename detection.  Then the algorithm would have a (relatively small?) list of candidate files to check for hunks.  You still have to deal with the similarity score in some corner cases, but hopefully since all we want is candidate files the process is relatively insensitive to the similarity threshold.

Am I way off the deep end now?  I'm not lying when I say I know *nothing* about the merge implementations.

> I definitely think it's an interesting area to work in, but I would
> have to give it a lot of thought.

It's a "corner case" that I seem to have run into a lot in my work experience, so if the Git community can actually make a good solution work it will be a major win in my book.

Thanks again!

Stephen

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Merging split files
  2011-03-29 16:33     ` Stephen Bash
@ 2011-03-29 18:15       ` Jeff King
  0 siblings, 0 replies; 4+ messages in thread
From: Jeff King @ 2011-03-29 18:15 UTC (permalink / raw
  To: Stephen Bash; +Cc: git

On Tue, Mar 29, 2011 at 12:33:17PM -0400, Stephen Bash wrote:

> > The problem is that the closest merge-recursive gets to understanding
> > content movement is that it considers whole file renames. ...
> > 
> > So I think the most flexible thing is to forget file renames at all.
> 
> I agree that would be the best solution long term. ("Git doesn't track
> files, Git tracks content".  Think I heard that somewhere before...)

Exactly. :) I think that is a tricky project, though, and in the
meantime, I wouldn't be opposed to a more file-based solution if it
generates good results.

> That being said, the back seat drivers in the office here (i.e. me and
> everyone else that knows almost nothing about the internals of merge
> recursive!) thought maybe a middle ground is teach merge recursive to
> do copy detection along with rename detection.  Then the algorithm
> would have a (relatively small?) list of candidate files to check for
> hunks.  You still have to deal with the similarity score in some
> corner cases, but hopefully since all we want is candidate files the
> process is relatively insensitive to the similarity threshold.

This was something I gave some thought to recently in this other thread:

  http://thread.gmane.org/gmane.comp.version-control.git/169944

though I came to the conclusion in that case that break-rewriting was a
much better match for that particular case. Namely, we see that content
has been renamed, so we make sure to merge changes to the "original"
content with each other, no matter whether the changes happened in the
renamed path or the original. And similarly, we merge changes left over
from any "new" content that has replaced the original (which, in the
pure rename case, is just empty, but with break-rewriting we might have
some dissimilar content at the old path). We know that the "new" content
can't be related to the "old" content, because to find a rename, we
would have to have triggered the "break" by finding that the content is
dissimilar.

Copy detection has to deal with that, but harder. :) I see two major
challenges:

  1. One source file may go to multiple destinations. So instead of
     saying "oops, I should be doing the merge with this other, renamed
     content", you have to pick a best one (either through heuristic, or
     even per-hunk by trying each hunk in turn). And this means you're
     interacting deeply with the content-level 3-way merger. I haven't
     looked at that code at all, so I don't know how feasible that is.

     And you have to accept that you may pick wrong, or even that there
     may be no right answer. If I do "cp foo bar; cp foo baz; rm foo",
     and then modify "foo" on another branch, the choice of merging
     changes to "bar" versus "baz" is going to be arbitrary.

  2. Because it's a copy and not a rename, your source file may still
     exist and be a candidate for applying content to. And that violates
     the break-rewrite rename logic I mentioned above, which is that old
     content goes with old content and new content goes with new
     content. We're not sure which the source file is for a given hunk.

     I think that may not be a big deal, though. We already have to deal
     with the hard part in (1), which is finding _which_ copy is the
     right place for a given bit of content. So this may just simplify
     to adding the source file (if it still exists) as another possible
     place to merge changes to, and it is another case of (1) (though
     obviously we should prefer merging to the original pathname if it
     is still there, rather than a copy).

> Am I way off the deep end now?  I'm not lying when I say I know
> *nothing* about the merge implementations.

No, I don't think you're off the deep end. But then, I don't know that
much about the merge code, either. :)

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-03-29 18:15 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <31155742.183989.1300374518689.JavaMail.root@mail.hq.genarts.com>
2011-03-18 13:22 ` Merging split files Stephen Bash
2011-03-29 15:16   ` Jeff King
2011-03-29 16:33     ` Stephen Bash
2011-03-29 18:15       ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).