git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* git-subtree split misbehaviour with a commit having empty ls-tree for the specified subdir
@ 2019-11-22 16:55 Ed Maste
  2019-12-18  0:17 ` Tom Clarkson
  0 siblings, 1 reply; 10+ messages in thread
From: Ed Maste @ 2019-11-22 16:55 UTC (permalink / raw)
  To: git

I encountered an issue while trying to use git subtree with the
FreeBSD svn->git mirror: I found that when "git subtree split"
encounters a commit with an empty "git ls-tree" for the subdirectory
being split, it ends up recording the original parent as the new
parent in the split history that's being created. This then leads to
unrelated history appearing in the split subtree.

Below is a shell script that demonstrates the issue - this is not the
precise case that I encountered in the FreeBSD repo, but the behaviour
is identical (and it doesn't take nearly 10 minutes to run). Running
the script and then "git log" of the commit printed by the final (git
subtree) command includes the unrelated history in dir2/.

It looks like this comes from the cache_set "$rev" "$rev" in
process_split_commit() added in 39f5fff0d53. This is under the
suspicious-looking "ugly. is there no better way to tell if this is a
subtree vs. a mainline commit? Does it matter" comment. However, I
don't yet understand enough of git-subtree's operation to propose a
fix.

--repro.sh--
#!/bin/sh

rm -rf subrepo-issue
mkdir -p subrepo-issue
cd subrepo-issue

git init .
mkdir -p dir1 dir2
touch dir1/file1 dir2/file2
git add dir1 dir2
git commit -m 'initial commit'
echo 'file2' > dir2/file2
git commit -m 'file2 modified' dir2/file2
git rm dir1/file1
git commit -m 'remove file1'
mkdir -p dir1
touch dir1/file1
git add dir1
git commit -m 'restore file1'
echo 'file1' > dir1/file1
git commit -m 'file1 modified' dir1/file1
git subtree split --prefix=dir1/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-subtree split misbehaviour with a commit having empty ls-tree for the specified subdir
  2019-11-22 16:55 git-subtree split misbehaviour with a commit having empty ls-tree for the specified subdir Ed Maste
@ 2019-12-18  0:17 ` Tom Clarkson
  2019-12-18 10:23   ` Ed Maste
  0 siblings, 1 reply; 10+ messages in thread
From: Tom Clarkson @ 2019-12-18  0:17 UTC (permalink / raw)
  To: Ed Maste; +Cc: git



> On 23 Nov 2019, at 3:55 am, Ed Maste <emaste@freebsd.org> wrote:
> 
> I encountered an issue while trying to use git subtree with the
> FreeBSD svn->git mirror: I found that when "git subtree split"
> encounters a commit with an empty "git ls-tree" for the subdirectory
> being split, it ends up recording the original parent as the new
> parent in the split history that's being created. This then leads to
> unrelated history appearing in the split subtree.
> 
> Below is a shell script that demonstrates the issue - this is not the
> precise case that I encountered in the FreeBSD repo, but the behaviour
> is identical (and it doesn't take nearly 10 minutes to run). Running
> the script and then "git log" of the commit printed by the final (git
> subtree) command includes the unrelated history in dir2/.
> 
> It looks like this comes from the cache_set "$rev" "$rev" in
> process_split_commit() added in 39f5fff0d53. This is under the
> suspicious-looking "ugly. is there no better way to tell if this is a
> subtree vs. a mainline commit? Does it matter" comment. However, I
> don't yet understand enough of git-subtree's operation to propose a
> fix.
> 
> --repro.sh--
> #!/bin/sh
> 
> rm -rf subrepo-issue
> mkdir -p subrepo-issue
> cd subrepo-issue
> 
> git init .
> mkdir -p dir1 dir2
> touch dir1/file1 dir2/file2
> git add dir1 dir2
> git commit -m 'initial commit'
> echo 'file2' > dir2/file2
> git commit -m 'file2 modified' dir2/file2
> git rm dir1/file1
> git commit -m 'remove file1'
> mkdir -p dir1
> touch dir1/file1
> git add dir1
> git commit -m 'restore file1'
> echo 'file1' > dir1/file1
> git commit -m 'file1 modified' dir1/file1
> git subtree split --prefix=dir1/
> 


The algorithm I am looking at to replace the file based mainline detection is

 - If subtree root is unknown (as on the initial split), everything is mainline.

 - If subtree root is reachable and mainline root is not, it’s a subtree commit 

 - Otherwise, treat as mainline. This will also pick up commits from other subtrees but they hopefully won’t contain the subtree folder. I don’t think there is an unambiguous way to distinguish a subtree merge from a regular merge - the message produced is pretty generic. It may be possible to check reachability of all known subtrees, but that adds a fair bit of complexity.

That leaves us with the question of how to record the empty mainline commits. The most correct result for your repro is probably four commits (add/delete everything/restore/modify), but I can see that falling over in a scenario where deleting a subtree is more like unlinking a library than editing that library to do nothing.

Is it sufficiently correct for your scenario to treat ‘restore file1’ as the initial subtree commit?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-subtree split misbehaviour with a commit having empty ls-tree for the specified subdir
  2019-12-18  0:17 ` Tom Clarkson
@ 2019-12-18 10:23   ` Ed Maste
  2019-12-19  0:57     ` Tom Clarkson
  0 siblings, 1 reply; 10+ messages in thread
From: Ed Maste @ 2019-12-18 10:23 UTC (permalink / raw)
  To: Tom Clarkson; +Cc: git

On Tue, 17 Dec 2019 at 19:17, Tom Clarkson <tqclarkson@icloud.com> wrote:
>
> The algorithm I am looking at to replace the file based mainline detection is
>
>  - If subtree root is unknown (as on the initial split), everything is mainline.
>
>  - If subtree root is reachable and mainline root is not, it’s a subtree commit
>
>  - Otherwise, treat as mainline. This will also pick up commits from other subtrees but they hopefully won’t contain the subtree folder. I don’t think there is an unambiguous way to distinguish a subtree merge from a regular merge - the message produced is pretty generic. It may be possible to check reachability of all known subtrees, but that adds a fair bit of complexity.
>
> That leaves us with the question of how to record the empty mainline commits. The most correct result for your repro is probably four commits (add/delete everything/restore/modify), but I can see that falling over in a scenario where deleting a subtree is more like unlinking a library than editing that library to do nothing.
>
> Is it sufficiently correct for your scenario to treat ‘restore file1’ as the initial subtree commit?

My reproduction scenario is really a demonstration of the real issue I
encountered. Running the initial "subtree split" on the real repo
takes about 40 minutes so I wanted something trivial that shows the
same issue. In the demonstration case (i.e., actually removing and
readding the subtree) I think it's reasonable to start with the commit
that added it back.

Overall I think your proposed algorithm is reasonable (even though I
think it won't address some of the cases in our repo). Will your
algorithm allow us to pass $dir to git rev-list, for the initial
split?

My actual issue stems from the way svn2git converted some odd svn
history, and is described in more detail on the freebsd-git mailing
list at https://lists.freebsd.org/pipermail/freebsd-git/2019-November/000218.html.

Perhaps we can have some command-line options to provide metadata for
cases that cannot be inferred? The cases in our repo come from svn2git
creating subtree merges to represent updates from vendor code. AFAIK
these should be basically identical to what subtree creates, except
that we don't have any of the metadata it adds.

For a concrete example (from the repo at
https://github.com/freebsd/freebsd), 7f3a50b3b9f8 is a mainline commit
that added a new subtree, from 9ee787636908. I think that if I could
inform subtree split that 9ee787636908 is the root it would work for
me.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-subtree split misbehaviour with a commit having empty ls-tree for the specified subdir
  2019-12-18 10:23   ` Ed Maste
@ 2019-12-19  0:57     ` Tom Clarkson
  2019-12-20 15:56       ` Ed Maste
  0 siblings, 1 reply; 10+ messages in thread
From: Tom Clarkson @ 2019-12-19  0:57 UTC (permalink / raw)
  To: Ed Maste; +Cc: git



> On 19 Dec 2019, at 4:58 am, Ed Maste <emaste@freebsd.org> wrote:
> 
> On Tue, 17 Dec 2019 at 19:17, Tom Clarkson <tqclarkson@icloud.com> wrote:
>> 
>> The algorithm I am looking at to replace the file based mainline detection is
>> 
>> - If subtree root is unknown (as on the initial split), everything is mainline.
>> 
>> - If subtree root is reachable and mainline root is not, it’s a subtree commit
>> 
>> - Otherwise, treat as mainline. This will also pick up commits from other subtrees but they hopefully won’t contain the subtree folder. I don’t think there is an unambiguous way to distinguish a subtree merge from a regular merge - the message produced is pretty generic. It may be possible to check reachability of all known subtrees, but that adds a fair bit of complexity.
>> 
>> That leaves us with the question of how to record the empty mainline commits. The most correct result for your repro is probably four commits (add/delete everything/restore/modify), but I can see that falling over in a scenario where deleting a subtree is more like unlinking a library than editing that library to do nothing.
>> 
>> Is it sufficiently correct for your scenario to treat ‘restore file1’ as the initial subtree commit?
> 
> My reproduction scenario is really a demonstration of the real issue I
> encountered. Running the initial "subtree split" on the real repo
> takes about 40 minutes so I wanted something trivial that shows the
> same issue. In the demonstration case (i.e., actually removing and
> readding the subtree) I think it's reasonable to start with the commit
> that added it back.
> 
> Overall I think your proposed algorithm is reasonable (even though I
> think it won't address some of the cases in our repo). Will your
> algorithm allow us to pass $dir to git rev-list, for the initial
> split?

Is this just for performance reasons? As I understand it that was left out because it would exclude relevant commits on an existing subtree, but it could make sense as an optimization for the first split of a large repo.

> My actual issue stems from the way svn2git converted some odd svn
> history, and is described in more detail on the freebsd-git mailing
> list at https://lists.freebsd.org/pipermail/freebsd-git/2019-November/000218.html.
> 
> Perhaps we can have some command-line options to provide metadata for
> cases that cannot be inferred? The cases in our repo come from svn2git
> creating subtree merges to represent updates from vendor code. AFAIK
> these should be basically identical to what subtree creates, except
> that we don't have any of the metadata it adds.

The existing --onto option comes pretty close - it marks everything in the rev-list of $onto as a subtree commit to be used as-is

For more flexibility, I think allowing more manipulation of the cache is the way to go - $cachedir is currently based on process id, but I don’t see any reason it can’t be based on prefix instead. So the process becomes something like

 # clear the cache - shouldn't usually be necessary, but it's a universal debugging step.
git subtree clear-cache --prefix=dir

# ref and all its parents are before subtree add. Treat any children as inital commits.
git subtree ignore --prefix=dir ref

# ref and all its parents are known subtree commits to be included without transformation.
git subtree existing --prefix=dir ref

# Override an arbitrary mapping, either for performance or because that commit is problematic 
git subtree map --prefix=dir mainline-ref subtree-ref

# Run the existing algorithm, but skipping anything defined manually
git subtree split --prefix=dir


> For a concrete example (from the repo at
> https://github.com/freebsd/freebsd), 7f3a50b3b9f8 is a mainline commit
> that added a new subtree, from 9ee787636908. I think that if I could
> inform subtree split that 9ee787636908 is the root it would work for
> me.

Aside from the metadata, that one is a bit different from a standard subtree add in that it copies three folders from the subtree repo rather than the root - so the contents of contrib/elftoolchain will never exactly match the actual elftoolchain repo, and 9ee787636908 is neither mainline nor subtree as subtree split understands it.

If you ignore 9ee787636908, the resulting subtree will be fairly clean, but won’t have much of a relationship to the external repo.

If you treat 9ee787636908 as an existing subtree, the second commit on your subtree will be based on 7f3a50b3b9f8, which deletes most of the contents of the subtree. You should still be able to merge in updates from the external repo, but if you try to push changes upstream the deletion will break things.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-subtree split misbehaviour with a commit having empty ls-tree for the specified subdir
  2019-12-19  0:57     ` Tom Clarkson
@ 2019-12-20 15:56       ` Ed Maste
  2019-12-22 14:01         ` Tom Clarkson
                           ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Ed Maste @ 2019-12-20 15:56 UTC (permalink / raw)
  To: Tom Clarkson; +Cc: git mailing list

On Wed, 18 Dec 2019 at 19:57, Tom Clarkson <tqclarkson@icloud.com> wrote:
>
> > Overall I think your proposed algorithm is reasonable (even though I
> > think it won't address some of the cases in our repo). Will your
> > algorithm allow us to pass $dir to git rev-list, for the initial
> > split?
>
> Is this just for performance reasons? As I understand it that was left out because it would exclude relevant commits on an existing subtree, but it could make sense as an optimization for the first split of a large repo.

Yes, it's for performance reasons on a first split that I'd like to
see it. On the FreeBSD repo the difference is some 40 minutes vs. a
few seconds.

> So the process becomes something like
>
>  # clear the cache - shouldn't usually be necessary, but it's a universal debugging step.
> git subtree clear-cache --prefix=dir
>
> # ref and all its parents are before subtree add. Treat any children as inital commits.
> git subtree ignore --prefix=dir ref
>
> # ref and all its parents are known subtree commits to be included without transformation.
> git subtree existing --prefix=dir ref
>
> # Override an arbitrary mapping, either for performance or because that commit is problematic
> git subtree map --prefix=dir mainline-ref subtree-ref
>
> # Run the existing algorithm, but skipping anything defined manually
> git subtree split --prefix=dir

This sounds about perfect.

> > For a concrete example (from the repo at
> > https://github.com/freebsd/freebsd), 7f3a50b3b9f8 is a mainline commit
> > that added a new subtree, from 9ee787636908. I think that if I could
> > inform subtree split that 9ee787636908 is the root it would work for
> > me.
>
> Aside from the metadata, that one is a bit different from a standard subtree add in that it copies three folders from the subtree repo rather than the root - so the contents of contrib/elftoolchain will never exactly match the actual elftoolchain repo, and 9ee787636908 is neither mainline nor subtree as subtree split understands it.

Fair enough, and we have lots of examples of slightly strange history
in svn that svn2git represents in interesting ways.

> If you ignore 9ee787636908, the resulting subtree will be fairly clean, but won’t have much of a relationship to the external repo.
>
> If you treat 9ee787636908 as an existing subtree, the second commit on your subtree will be based on 7f3a50b3b9f8, which deletes most of the contents of the subtree. You should still be able to merge in updates from the external repo, but if you try to push changes upstream the deletion will break things.

I think this is fine - our main goal here is to be able to update
contrib/ code within FreeBSD as we do today with svn, and we may well
always have some changes that are never intended to be pushed
upstream.

Continuing the example from our repo, there is more history in the
"subtree" already, with 061ef1f9424f as the head. ca8624403626 is the
merge to mainline.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-subtree split misbehaviour with a commit having empty ls-tree for the specified subdir
  2019-12-20 15:56       ` Ed Maste
@ 2019-12-22 14:01         ` Tom Clarkson
  2020-01-21 22:36           ` Ed Maste
       [not found]         ` <DB65AE2F-12DE-43B7-8B20-4E173794CAF2@icloud.com>
  2020-06-17 14:46         ` Ed Maste
  2 siblings, 1 reply; 10+ messages in thread
From: Tom Clarkson @ 2019-12-22 14:01 UTC (permalink / raw)
  To: Ed Maste; +Cc: git mailing list


> On 21 Dec 2019, at 2:56 am, Ed Maste <emaste@freebsd.org> wrote:
> 
> On Wed, 18 Dec 2019 at 19:57, Tom Clarkson <tqclarkson@icloud.com> wrote:
>> 
>>> Overall I think your proposed algorithm is reasonable (even though I
>>> think it won't address some of the cases in our repo). Will your
>>> algorithm allow us to pass $dir to git rev-list, for the initial
>>> split?
>> 
>> Is this just for performance reasons? As I understand it that was left out because it would exclude relevant commits on an existing subtree, but it could make sense as an optimization for the first split of a large repo.
> 
> Yes, it's for performance reasons on a first split that I'd like to
> see it. On the FreeBSD repo the difference is some 40 minutes vs. a
> few seconds.

I tried out the dir filter after getting the full revlist to produce a reasonable result.  It is a lot faster, but unfortunately it doesn’t produce the same output - If you use actual parent commits, you have to process the 50k irrelevant ones to keep a valid path. If you let rev-list find what it thinks is the most recent relevant change, the subtree merges resolve to nothing.

>> So the process becomes something like
>> 
>> # clear the cache - shouldn't usually be necessary, but it's a universal debugging step.
>> git subtree clear-cache --prefix=dir
>> 
>> # ref and all its parents are before subtree add. Treat any children as inital commits.
>> git subtree ignore --prefix=dir ref
>> 
>> # ref and all its parents are known subtree commits to be included without transformation.
>> git subtree existing --prefix=dir ref
>> 
>> # Override an arbitrary mapping, either for performance or because that commit is problematic
>> git subtree map --prefix=dir mainline-ref subtree-ref
>> 
>> # Run the existing algorithm, but skipping anything defined manually
>> git subtree split --prefix=dir
> 
> This sounds about perfect.
> 
>>> For a concrete example (from the repo at
>>> https://github.com/freebsd/freebsd), 7f3a50b3b9f8 is a mainline commit
>>> that added a new subtree, from 9ee787636908. I think that if I could
>>> inform subtree split that 9ee787636908 is the root it would work for
>>> me.
>> 
>> Aside from the metadata, that one is a bit different from a standard subtree add in that it copies three folders from the subtree repo rather than the root - so the contents of contrib/elftoolchain will never exactly match the actual elftoolchain repo, and 9ee787636908 is neither mainline nor subtree as subtree split understands it.
> 
> Fair enough, and we have lots of examples of slightly strange history
> in svn that svn2git represents in interesting ways.
> 
>> If you ignore 9ee787636908, the resulting subtree will be fairly clean, but won’t have much of a relationship to the external repo.
>> 
>> If you treat 9ee787636908 as an existing subtree, the second commit on your subtree will be based on 7f3a50b3b9f8, which deletes most of the contents of the subtree. You should still be able to merge in updates from the external repo, but if you try to push changes upstream the deletion will break things.
> 
> I think this is fine - our main goal here is to be able to update
> contrib/ code within FreeBSD as we do today with svn, and we may well
> always have some changes that are never intended to be pushed
> upstream.
> 
> Continuing the example from our repo, there is more history in the
> "subtree" already, with 061ef1f9424f as the head. ca8624403626 is the
> merge to mainline.


If you want to try out my update, it’s at  https://github.com/gitgitgadget/git/pull/493. The commands I ended up with were

git subtree ignore --clear-cache --prefix=contrib/elftoolchain 4d43158
git subtree use --prefix=contrib/elftoolchain 9e78763
git subtree split --prefix=contrib/elftoolchain 53f2672ff78be42389cf41a8258f6e9ce36808fb

On my machine, ignore takes about 2 minutes to flag 200k commits as irrelevant. The split takes around 15 to go through the remaining 50k.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-subtree split misbehaviour with a commit having empty ls-tree for the specified subdir
  2019-12-22 14:01         ` Tom Clarkson
@ 2020-01-21 22:36           ` Ed Maste
  0 siblings, 0 replies; 10+ messages in thread
From: Ed Maste @ 2020-01-21 22:36 UTC (permalink / raw)
  To: Tom Clarkson; +Cc: git mailing list

On Sun, 22 Dec 2019 at 09:01, Tom Clarkson <tqclarkson@icloud.com> wrote:
>
> If you want to try out my update, it’s at  https://github.com/gitgitgadget/git/pull/493. The commands I ended up with were
>
> git subtree ignore --clear-cache --prefix=contrib/elftoolchain 4d43158
> git subtree use --prefix=contrib/elftoolchain 9e78763
> git subtree split --prefix=contrib/elftoolchain 53f2672ff78be42389cf41a8258f6e9ce36808fb

Thanks Tom, I was finally able to get back to this, and confirm it
works as I'd expect / desire. I'll continue experimenting with the
rest of the contrib/ software in FreeBSD; please let me know if
there's anything specific you'd like me to test with your patch set.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-subtree split misbehaviour with a commit having empty ls-tree for the specified subdir
       [not found]         ` <DB65AE2F-12DE-43B7-8B20-4E173794CAF2@icloud.com>
@ 2020-04-28 18:08           ` Ed Maste
  0 siblings, 0 replies; 10+ messages in thread
From: Ed Maste @ 2020-04-28 18:08 UTC (permalink / raw)
  To: Tom Clarkson; +Cc: git mailing list

On Sun, 22 Dec 2019 at 08:50, Tom Clarkson <tqclarkson@icloud.com> wrote:
>
> If you want to try out my update, it’s at  https://github.com/gitgitgadget/git/pull/493. The commands I ended up with were
>
> git subtree ignore --clear-cache --prefix=contrib/elftoolchain 4d43158
> git subtree use --prefix=contrib/elftoolchain 9e78763
> git subtree split --prefix=contrib/elftoolchain 53f2672ff78be42389cf41a8258f6e9ce36808fb
>
> On my machine, ignore takes about 2 minutes to flag 200k commits as irrelevant. The split takes around 15 to go through the remaining 50k.

What's the next step with this patch set? Is there anything I can do to help?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-subtree split misbehaviour with a commit having empty ls-tree for the specified subdir
  2019-12-20 15:56       ` Ed Maste
  2019-12-22 14:01         ` Tom Clarkson
       [not found]         ` <DB65AE2F-12DE-43B7-8B20-4E173794CAF2@icloud.com>
@ 2020-06-17 14:46         ` Ed Maste
  2020-06-18  1:13           ` Tom Clarkson
  2 siblings, 1 reply; 10+ messages in thread
From: Ed Maste @ 2020-06-17 14:46 UTC (permalink / raw)
  To: Tom Clarkson; +Cc: git mailing list

On Fri, 20 Dec 2019 at 10:56, Ed Maste <emaste@freebsd.org> wrote:
>
> On Wed, 18 Dec 2019 at 19:57, Tom Clarkson <tqclarkson@icloud.com> wrote:
> >
> > > Overall I think your proposed algorithm is reasonable (even though I
> > > think it won't address some of the cases in our repo). Will your
> > > algorithm allow us to pass $dir to git rev-list, for the initial
> > > split?
> >
> > Is this just for performance reasons? As I understand it that was left out because it would exclude relevant commits on an existing subtree, but it could make sense as an optimization for the first split of a large repo.
>
> Yes, it's for performance reasons on a first split that I'd like to
> see it. On the FreeBSD repo the difference is some 40 minutes vs. a
> few seconds.

Following up on this old thread, I plan to revisit the optimization,
implementing something on top of your work in
https://github.com/gitgitgadget/git/pull/493. I might look at adding a
--initial flag to subtree split, having it essentially auto-detect a
revision to use as the value for --onto. For the common case of an
initial merge commit with two parents I think we can relatively easily
determine which is the subtree parent. If that's not sufficiently
general (or broadly useful outside of our context) we could just
create a helper script wrapping `subtree split` tailored to the
FreeBSD cases. We have something like 100 projects we're looking to
split, as part of our svn to git migration.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: git-subtree split misbehaviour with a commit having empty ls-tree for the specified subdir
  2020-06-17 14:46         ` Ed Maste
@ 2020-06-18  1:13           ` Tom Clarkson
  0 siblings, 0 replies; 10+ messages in thread
From: Tom Clarkson @ 2020-06-18  1:13 UTC (permalink / raw)
  To: Ed Maste; +Cc: git mailing list


> On 18 Jun 2020, at 12:46 am, Ed Maste <emaste@freebsd.org> wrote:
> 
> On Fri, 20 Dec 2019 at 10:56, Ed Maste <emaste@freebsd.org> wrote:
>> 
>> On Wed, 18 Dec 2019 at 19:57, Tom Clarkson <tqclarkson@icloud.com> wrote:
>>> 
>>>> Overall I think your proposed algorithm is reasonable (even though I
>>>> think it won't address some of the cases in our repo). Will your
>>>> algorithm allow us to pass $dir to git rev-list, for the initial
>>>> split?
>>> 
>>> Is this just for performance reasons? As I understand it that was left out because it would exclude relevant commits on an existing subtree, but it could make sense as an optimization for the first split of a large repo.
>> 
>> Yes, it's for performance reasons on a first split that I'd like to
>> see it. On the FreeBSD repo the difference is some 40 minutes vs. a
>> few seconds.
> 
> Following up on this old thread, I plan to revisit the optimization,
> implementing something on top of your work in
> https://github.com/gitgitgadget/git/pull/493. I might look at adding a
> --initial flag to subtree split, having it essentially auto-detect a
> revision to use as the value for --onto. For the common case of an
> initial merge commit with two parents I think we can relatively easily
> determine which is the subtree parent. If that's not sufficiently
> general (or broadly useful outside of our context) we could just
> create a helper script wrapping `subtree split` tailored to the
> FreeBSD cases. We have something like 100 projects we're looking to
> split, as part of our svn to git migration.

The new use command might be a better fit than onto in this case - it does the same thing as onto, except it also marks the commit as processed and therefore excludes them from the initial rev list.

Actually, on reading the code, I’m not sure onto does quite what the documentation suggests it does - by updating the cache it will shortcut processing of subtree commits that have already been merged into mainline, but has no mechanism for building onto an existing unrelated history.

Reliably differentiating subtree and mainline commits has always been tricky, but should be ok as part of an advanced flag/new command. Perhaps rev-list --merges <path> to find potential unmarked subtree merges, then take the one where the root tree matches the post merge subdir tree. No doubt it won’t catch everything, but I’d say that’s less of a risk than false positives.

In the context of a helper script, a new command or adding a --auto flag to use might be better than adding a flag to split - that way you could easily tell if the expected initial state was found rather than having to wait for the full process to produce something weird. 

That would also let you mark the other side of the merge as ignored mainline history - a significant optimization when you’re excluding 200k commits, but risky to include more generally.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-06-18  2:19 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-11-22 16:55 git-subtree split misbehaviour with a commit having empty ls-tree for the specified subdir Ed Maste
2019-12-18  0:17 ` Tom Clarkson
2019-12-18 10:23   ` Ed Maste
2019-12-19  0:57     ` Tom Clarkson
2019-12-20 15:56       ` Ed Maste
2019-12-22 14:01         ` Tom Clarkson
2020-01-21 22:36           ` Ed Maste
     [not found]         ` <DB65AE2F-12DE-43B7-8B20-4E173794CAF2@icloud.com>
2020-04-28 18:08           ` Ed Maste
2020-06-17 14:46         ` Ed Maste
2020-06-18  1:13           ` Tom Clarkson

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).