Hi Tom,

On Thu, 12 Dec 2019, Tom Clarkson wrote:

>
> > This makes me wonder if the problem is perhaps related to the hardware
> > involved; maybe the algorithm is doing exactly what it should, but the
> > available RAM isn't sufficient. If that's the problem, perhaps we could
> > find a way to perform the recursive work without using actual
> > recursion, reducing the number of instances on the stack.
>
> It’s not so much hardware as OS I think - After adding stack depth (the indent parameter on check_parents) to the log, I have been able to get different results with ulimit settings.

Do you mean to say that the stack overflow is reported as a segmentation
fault? If so, that message was sure a red herring...

Thanks,
Dscho

>
> With the default stack size on macOS of 8MB, It falls over at depth 445. Being less than the shortest path to the root commit, that matches my initial count, which was just the number of lines in the log.
>
> Reducing the stack size with ulimit -s 4096 makes it fall over at 225
>
> Increasing to the hard limit of 64MB should allow a depth of around 4000, and as it turns out that did allow the script to complete, reaching a maximum depth of 1148.
>
> I’m not seeing any issues with the hashes being wrong (all show no parents or subtree) but processing all those commits that resolve to nothing does take forever.
>
> The mainline commit test seems to work ok on my repo, but it’s fairly easy to see scenarios where it would break, such as having a  subfolder with the same name within the subtree.
>
> So while part of the fix will be a more reliable test, it also needs to work before parent commits are processed to mitigate the recursion issues.
>
> The rules I have  come up with so far are below. There are still scenarios where the recursion is unavoidable such as running an initial split on a large repo, but that should be much less common than using a small subtree with a more complex existing repo.
>
> In the initial setup of cmd_split, collect some extra information:
>
> 	- Add rev-list of all git-subtree-split values to the cache. I’d expect subtrees to usually be smaller than mainline, but since we can do that non-recursively we may as well.
>
> 	- Find the git-subtree-mainline value from subtree add/rejoin. Anything in its rev list should only be reachable by mainline commits. If not (which probably requires doing something convoluted like having subtree include mainline as its own subtree), this is a good place to check that and fall back to the existing behavior.
>
>
> When processing each commit:
>
> If no prior splits were found, we only have mainline commits.
>
>  	- If $dir exists, it is a mainline commit needing copy - use existing process.
> 	- If $dir does not exist, it is a mainline commit that will map to nothing - no need to process further.
>
> If we do have some known subtree commits:
>
> 	- If it is in the cache, it is a subtree commit we don’t need to process further.
> 	- If subtree root is not reachable (rev-list or merge-base), must be mainline pre subtree add. Map to nothing and skip further processing.
> 	- If any subtree root is reachable, could be either mainline commit with subtree merged in, or subtree commit newer than the last add/squash (subtree pull/merge without squash does not use a custom commit message)
> 		- If $dir does not exist, must be subtree - add to the cache as mapped to self, no need to process parents.
> 		- If the folder does exist, it is  either a mainline commit to be processed normally, or a subtree that happens to contain a folder with the same name.  Check if mainline root is reachable.
>
>
>
>