git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/7] subtree: Fix handling of complex history
@ 2020-05-11  5:49 Tom Clarkson via GitGitGadget
  2020-05-11  5:49 ` [PATCH 1/7] subtree: handle multiple parents passed to cache_miss Tom Clarkson via GitGitGadget
                   ` (8 more replies)
  0 siblings, 9 replies; 28+ messages in thread
From: Tom Clarkson via GitGitGadget @ 2020-05-11  5:49 UTC (permalink / raw)
  To: git; +Cc: Avery Pennarun, Tom Clarkson

Fixes several issues that could occur when running subtree split on large
repos with more complex history.

 1. A merge commit could bypass the known start point of the subtree, which
    would cause the entire history to be processed recursively, leading to a
    stack overflow / segfault after reading a few hundred commits. Older
    commits are now explicitly recorded as irrelevant so that the recursive
    process can terminate on any mainline commit rather than only on subtree
    joins and initial commits.
    
    
 2. It is possible for a repo to contain subtrees that lack the metadata
    that is usually present in add/join commit messages (git-svn at least
    can produce such a structure). The new use/ignore/map commands allow the
    user to provide that information for any problematic commits.
    
    
 3. A mainline commit that does not contain the subtree folder could be
    erroneously identified as a subtree commit, which would add the entire
    mainline history to the subtree. Commits will now only be used as is if
    all their parents are already identified as subtree commits. While the
    new code can still be tripped up by unusual folder structures, the
    completely unambiguous solution turned out to involve a significant
    performance penalty, and the new ignore / use commands provide a
    workaround for that scenario.

Tom Clarkson (7):
  subtree: handle multiple parents passed to cache_miss
  subtree: exclude commits predating add from recursive processing
  subtree: persist cache between split runs
  subtree: add git subtree map command
  subtree: add git subtree use and ignore commands
  subtree: more robustly distinguish subtree and mainline commits
  subtree: document new subtree commands

 contrib/subtree/git-subtree.sh  | 185 ++++++++++++++++++++++++++------
 contrib/subtree/git-subtree.txt |  24 +++++
 2 files changed, 177 insertions(+), 32 deletions(-)


base-commit: af6b65d45ef179ed52087e80cb089f6b2349f4ec
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-493%2Ftqc%2Ftqc%2Fsubtree-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-493/tqc/tqc/subtree-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/493
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/7] subtree: handle multiple parents passed to cache_miss
  2020-05-11  5:49 [PATCH 0/7] subtree: Fix handling of complex history Tom Clarkson via GitGitGadget
@ 2020-05-11  5:49 ` Tom Clarkson via GitGitGadget
  2020-05-11  5:49 ` [PATCH 2/7] subtree: exclude commits predating add from recursive processing Tom Clarkson via GitGitGadget
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Tom Clarkson via GitGitGadget @ 2020-05-11  5:49 UTC (permalink / raw)
  To: git; +Cc: Avery Pennarun, Tom Clarkson, Tom Clarkson

From: Tom Clarkson <tom@tqclarkson.com>

Signed-off-by: Tom Clarkson <tom@tqclarkson.com>
---
 contrib/subtree/git-subtree.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contrib/subtree/git-subtree.sh b/contrib/subtree/git-subtree.sh
index 868e18b9a1a..9867718503c 100755
--- a/contrib/subtree/git-subtree.sh
+++ b/contrib/subtree/git-subtree.sh
@@ -238,7 +238,7 @@ cache_miss () {
 }
 
 check_parents () {
-	missed=$(cache_miss "$1")
+	missed=$(cache_miss $1)
 	local indent=$(($2 + 1))
 	for miss in $missed
 	do
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/7] subtree: exclude commits predating add from recursive processing
  2020-05-11  5:49 [PATCH 0/7] subtree: Fix handling of complex history Tom Clarkson via GitGitGadget
  2020-05-11  5:49 ` [PATCH 1/7] subtree: handle multiple parents passed to cache_miss Tom Clarkson via GitGitGadget
@ 2020-05-11  5:49 ` Tom Clarkson via GitGitGadget
  2020-05-11  5:49 ` [PATCH 3/7] subtree: persist cache between split runs Tom Clarkson via GitGitGadget
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Tom Clarkson via GitGitGadget @ 2020-05-11  5:49 UTC (permalink / raw)
  To: git; +Cc: Avery Pennarun, Tom Clarkson, Tom Clarkson

From: Tom Clarkson <tom@tqclarkson.com>

Include recursion depth in debug logs so we can see when the recursion is
getting out of hand.

Making the cache handle null mappings correctly and adding older commits
to the cache allows the recursive algorithm to terminate at any point on
mainline rather than needing to reach either the add point or the initial
commit.

Signed-off-by: Tom Clarkson <tom@tqclarkson.com>
---
 contrib/subtree/git-subtree.sh | 37 +++++++++++++++++++++++++++++++---
 1 file changed, 34 insertions(+), 3 deletions(-)

diff --git a/contrib/subtree/git-subtree.sh b/contrib/subtree/git-subtree.sh
index 9867718503c..da0eede6979 100755
--- a/contrib/subtree/git-subtree.sh
+++ b/contrib/subtree/git-subtree.sh
@@ -244,7 +244,7 @@ check_parents () {
 	do
 		if ! test -r "$cachedir/notree/$miss"
 		then
-			debug "  incorrect order: $miss"
+			debug "  unprocessed parent commit: $miss ($indent)"
 			process_split_commit "$miss" "" "$indent"
 		fi
 	done
@@ -392,6 +392,26 @@ find_existing_splits () {
 	done
 }
 
+find_mainline_ref () {
+	debug "Looking for first split..."
+	dir="$1"
+	revs="$2"
+	main=
+	sub=
+	local grep_format="^git-subtree-dir: $dir/*\$"
+	git log --reverse --grep="$grep_format" \
+		--no-show-signature --pretty=format:'START %H%n%s%n%n%b%nEND%n' $revs |
+	while read a b junk
+	do
+		case "$a" in
+		git-subtree-mainline:)
+			echo "$b"
+			return
+			;;
+		esac
+	done
+}
+
 copy_commit () {
 	# We're going to set some environment vars here, so
 	# do it in a subshell to get rid of them safely later
@@ -646,9 +666,9 @@ process_split_commit () {
 
 	progress "$revcount/$revmax ($createcount) [$extracount]"
 
-	debug "Processing commit: $rev"
+	debug "Processing commit: $rev ($indent)"
 	exists=$(cache_get "$rev")
-	if test -n "$exists"
+	if test -z "$(cache_miss "$rev")"
 	then
 		debug "  prior: $exists"
 		return
@@ -773,6 +793,17 @@ cmd_split () {
 
 	unrevs="$(find_existing_splits "$dir" "$revs")"
 
+	mainline="$(find_mainline_ref "$dir" "$revs")"
+	if test -n "$mainline"
+	then
+		debug "Mainline $mainline predates subtree add"
+		git rev-list --topo-order --skip=1 $mainline |
+		while read rev
+		do
+			cache_set "$rev" ""
+		done || exit $?
+	fi
+
 	# We can't restrict rev-list to only $dir here, because some of our
 	# parents have the $dir contents the root, and those won't match.
 	# (and rev-list --follow doesn't seem to solve this)
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3/7] subtree: persist cache between split runs
  2020-05-11  5:49 [PATCH 0/7] subtree: Fix handling of complex history Tom Clarkson via GitGitGadget
  2020-05-11  5:49 ` [PATCH 1/7] subtree: handle multiple parents passed to cache_miss Tom Clarkson via GitGitGadget
  2020-05-11  5:49 ` [PATCH 2/7] subtree: exclude commits predating add from recursive processing Tom Clarkson via GitGitGadget
@ 2020-05-11  5:49 ` Tom Clarkson via GitGitGadget
  2020-05-11  5:49 ` [PATCH 4/7] subtree: add git subtree map command Tom Clarkson via GitGitGadget
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Tom Clarkson via GitGitGadget @ 2020-05-11  5:49 UTC (permalink / raw)
  To: git; +Cc: Avery Pennarun, Tom Clarkson, Tom Clarkson

From: Tom Clarkson <tom@tqclarkson.com>

Provide a mechanism for handling problematic commits. If the algorithm
in process_split_commit is getting something wrong, you can write a
corrected value to the cache before running split.

Signed-off-by: Tom Clarkson <tom@tqclarkson.com>
---
 contrib/subtree/git-subtree.sh | 37 ++++++++++++++++++++++++++--------
 1 file changed, 29 insertions(+), 8 deletions(-)

diff --git a/contrib/subtree/git-subtree.sh b/contrib/subtree/git-subtree.sh
index da0eede6979..90f92f4e949 100755
--- a/contrib/subtree/git-subtree.sh
+++ b/contrib/subtree/git-subtree.sh
@@ -27,6 +27,7 @@ b,branch=     create a new branch from the split subtree
 ignore-joins  ignore prior --rejoin commits
 onto=         try connecting new tree to an existing one
 rejoin        merge the new branch back into HEAD
+clear-cache   reset the subtree mapping cache
  options for 'add', 'merge', and 'pull'
 squash        merge subtree changes as a single commit
 "
@@ -48,6 +49,7 @@ annotate=
 squash=
 message=
 prefix=
+clearcache=
 
 debug () {
 	if test -n "$debug"
@@ -131,6 +133,9 @@ do
 	--no-rejoin)
 		rejoin=
 		;;
+	--clear-cache)
+		clearcache=1
+		;;
 	--ignore-joins)
 		ignore_joins=1
 		;;
@@ -206,9 +211,13 @@ debug "opts: {$*}"
 debug
 
 cache_setup () {
-	cachedir="$GIT_DIR/subtree-cache/$$"
-	rm -rf "$cachedir" ||
-		die "Can't delete old cachedir: $cachedir"
+	cachedir="$GIT_DIR/subtree-cache/$prefix"
+	if test -n "$clearcache"
+	then
+		debug "Clearing cache"
+		rm -rf "$cachedir" ||
+			die "Can't delete old cachedir: $cachedir"
+	fi
 	mkdir -p "$cachedir" ||
 		die "Can't create new cachedir: $cachedir"
 	mkdir -p "$cachedir/notree" ||
@@ -266,6 +275,16 @@ cache_set () {
 	echo "$newrev" >"$cachedir/$oldrev"
 }
 
+cache_set_if_unset () {
+	oldrev="$1"
+	newrev="$2"
+	if test -e "$cachedir/$oldrev"
+	then
+		return
+	fi
+	echo "$newrev" >"$cachedir/$oldrev"
+}
+
 rev_exists () {
 	if git rev-parse "$1" >/dev/null 2>&1
 	then
@@ -375,13 +394,13 @@ find_existing_splits () {
 			then
 				# squash commits refer to a subtree
 				debug "  Squash: $sq from $sub"
-				cache_set "$sq" "$sub"
+				cache_set_if_unset "$sq" "$sub"
 			fi
 			if test -n "$main" -a -n "$sub"
 			then
 				debug "  Prior: $main -> $sub"
-				cache_set $main $sub
-				cache_set $sub $sub
+				cache_set_if_unset $main $sub
+				cache_set_if_unset $sub $sub
 				try_remove_previous "$main"
 				try_remove_previous "$sub"
 			fi
@@ -690,6 +709,8 @@ process_split_commit () {
 		if test -n "$newparents"
 		then
 			cache_set "$rev" "$rev"
+		else
+			cache_set "$rev" ""
 		fi
 		return
 	fi
@@ -787,7 +808,7 @@ cmd_split () {
 			# the 'onto' history is already just the subdir, so
 			# any parent we find there can be used verbatim
 			debug "  cache: $rev"
-			cache_set "$rev" "$rev"
+			cache_set_if_unset "$rev" "$rev"
 		done
 	fi
 
@@ -800,7 +821,7 @@ cmd_split () {
 		git rev-list --topo-order --skip=1 $mainline |
 		while read rev
 		do
-			cache_set "$rev" ""
+			cache_set_if_unset "$rev" ""
 		done || exit $?
 	fi
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 4/7] subtree: add git subtree map command
  2020-05-11  5:49 [PATCH 0/7] subtree: Fix handling of complex history Tom Clarkson via GitGitGadget
                   ` (2 preceding siblings ...)
  2020-05-11  5:49 ` [PATCH 3/7] subtree: persist cache between split runs Tom Clarkson via GitGitGadget
@ 2020-05-11  5:49 ` Tom Clarkson via GitGitGadget
  2020-05-11  5:49 ` [PATCH 5/7] subtree: add git subtree use and ignore commands Tom Clarkson via GitGitGadget
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Tom Clarkson via GitGitGadget @ 2020-05-11  5:49 UTC (permalink / raw)
  To: git; +Cc: Avery Pennarun, Tom Clarkson, Tom Clarkson

From: Tom Clarkson <tom@tqclarkson.com>

Adds an entry to the subtree cache so that subsequent split runs can skip
any commits that turn out to be problematic.

Signed-off-by: Tom Clarkson <tom@tqclarkson.com>
---
 contrib/subtree/git-subtree.sh | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/contrib/subtree/git-subtree.sh b/contrib/subtree/git-subtree.sh
index 90f92f4e949..f8ae1283cd3 100755
--- a/contrib/subtree/git-subtree.sh
+++ b/contrib/subtree/git-subtree.sh
@@ -15,6 +15,7 @@ git subtree merge --prefix=<prefix> <commit>
 git subtree pull  --prefix=<prefix> <repository> <ref>
 git subtree push  --prefix=<prefix> <repository> <ref>
 git subtree split --prefix=<prefix> <commit>
+git subtree map   --prefix=<prefix> <mainline> <subtree>
 --
 h,help        show the help
 q             quiet
@@ -161,7 +162,7 @@ command="$1"
 shift
 
 case "$command" in
-add|merge|pull)
+add|merge|pull|map)
 	default=
 	;;
 split|push)
@@ -192,7 +193,8 @@ dir="$(dirname "$prefix/.")"
 
 if test "$command" != "pull" &&
 		test "$command" != "add" &&
-		test "$command" != "push"
+		test "$command" != "push" &&
+		test "$command" != "map"
 then
 	revs=$(git rev-parse $default --revs-only "$@") || exit $?
 	dirs=$(git rev-parse --no-revs --no-flags "$@") || exit $?
@@ -795,6 +797,21 @@ cmd_add_commit () {
 	say "Added dir '$dir'"
 }
 
+cmd_map () {
+	oldrev="$1"
+	newrev="$2"
+
+	if test -z "$oldrev"
+	then
+		die "You must provide a revision to map"
+	fi
+
+	cache_setup || exit $?
+	cache_set "$oldrev" "$newrev"
+
+	say "Mapped $oldrev => $newrev"
+}
+
 cmd_split () {
 	debug "Splitting $dir..."
 	cache_setup || exit $?
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 5/7] subtree: add git subtree use and ignore commands
  2020-05-11  5:49 [PATCH 0/7] subtree: Fix handling of complex history Tom Clarkson via GitGitGadget
                   ` (3 preceding siblings ...)
  2020-05-11  5:49 ` [PATCH 4/7] subtree: add git subtree map command Tom Clarkson via GitGitGadget
@ 2020-05-11  5:49 ` Tom Clarkson via GitGitGadget
  2020-05-11  5:50 ` [PATCH 6/7] subtree: more robustly distinguish subtree and mainline commits Tom Clarkson via GitGitGadget
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Tom Clarkson via GitGitGadget @ 2020-05-11  5:49 UTC (permalink / raw)
  To: git; +Cc: Avery Pennarun, Tom Clarkson, Tom Clarkson

From: Tom Clarkson <tom@tqclarkson.com>

Tell split to use or ignore larger sections of the history. In most cases
split does this automatically based on metadata from subtree add.

Signed-off-by: Tom Clarkson <tom@tqclarkson.com>
---
 contrib/subtree/git-subtree.sh | 78 ++++++++++++++++++++++++++++------
 1 file changed, 66 insertions(+), 12 deletions(-)

diff --git a/contrib/subtree/git-subtree.sh b/contrib/subtree/git-subtree.sh
index f8ae1283cd3..1ca8b2f1101 100755
--- a/contrib/subtree/git-subtree.sh
+++ b/contrib/subtree/git-subtree.sh
@@ -9,13 +9,15 @@ then
 	set -- -h
 fi
 OPTS_SPEC="\
-git subtree add   --prefix=<prefix> <commit>
-git subtree add   --prefix=<prefix> <repository> <ref>
-git subtree merge --prefix=<prefix> <commit>
-git subtree pull  --prefix=<prefix> <repository> <ref>
-git subtree push  --prefix=<prefix> <repository> <ref>
-git subtree split --prefix=<prefix> <commit>
-git subtree map   --prefix=<prefix> <mainline> <subtree>
+git subtree add    --prefix=<prefix> <commit>
+git subtree add    --prefix=<prefix> <repository> <ref>
+git subtree merge  --prefix=<prefix> <commit>
+git subtree pull   --prefix=<prefix> <repository> <ref>
+git subtree push   --prefix=<prefix> <repository> <ref>
+git subtree split  --prefix=<prefix> <commit>
+git subtree map    --prefix=<prefix> <mainline> <subtree>
+git subtree ignore --prefix=<prefix> <commit>
+git subtree use    --prefix=<prefix> <commit>
 --
 h,help        show the help
 q             quiet
@@ -162,7 +164,7 @@ command="$1"
 shift
 
 case "$command" in
-add|merge|pull|map)
+add|merge|pull|map|ignore|use)
 	default=
 	;;
 split|push)
@@ -433,6 +435,18 @@ find_mainline_ref () {
 	done
 }
 
+exclude_processed_refs () {
+		if test -r "$cachedir/processed"
+		then
+			cat "$cachedir/processed" |
+			while read rev
+			do
+				debug "read $rev"
+				echo "^$rev"
+			done
+		fi
+}
+
 copy_commit () {
 	# We're going to set some environment vars here, so
 	# do it in a subshell to get rid of them safely later
@@ -798,20 +812,60 @@ cmd_add_commit () {
 }
 
 cmd_map () {
-	oldrev="$1"
-	newrev="$2"
 
-	if test -z "$oldrev"
+	if test -z "$1"
 	then
 		die "You must provide a revision to map"
 	fi
 
+	oldrev=$(git rev-parse --revs-only "$1") || exit $?
+	newrev=
+
+	if test -n "$2"
+	then
+		newrev=$(git rev-parse --revs-only "$2") || exit $?
+	fi
+
 	cache_setup || exit $?
 	cache_set "$oldrev" "$newrev"
 
 	say "Mapped $oldrev => $newrev"
 }
 
+cmd_ignore () {
+	revs=$(git rev-parse $default --revs-only "$@") || exit $?
+	ensure_single_rev $revs
+
+	say "Ignoring $revs"
+
+	cache_setup || exit $?
+
+	git rev-list $revs |
+	while read rev
+	do
+		cache_set "$rev" ""
+	done
+
+	echo "$revs" >>"$cachedir/processed"
+}
+
+cmd_use () {
+	revs=$(git rev-parse $default --revs-only "$@") || exit $?
+	ensure_single_rev $revs
+
+	say "Using existing subtree $revs"
+
+	cache_setup || exit $?
+
+	git rev-list $revs |
+	while read rev
+	do
+		cache_set "$rev" "$rev"
+	done
+
+	echo "$revs" >>"$cachedir/processed"
+}
+
 cmd_split () {
 	debug "Splitting $dir..."
 	cache_setup || exit $?
@@ -829,7 +883,7 @@ cmd_split () {
 		done
 	fi
 
-	unrevs="$(find_existing_splits "$dir" "$revs")"
+	unrevs="$(find_existing_splits "$dir" "$revs") $(exclude_processed_refs)"
 
 	mainline="$(find_mainline_ref "$dir" "$revs")"
 	if test -n "$mainline"
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 6/7] subtree: more robustly distinguish subtree and mainline commits
  2020-05-11  5:49 [PATCH 0/7] subtree: Fix handling of complex history Tom Clarkson via GitGitGadget
                   ` (4 preceding siblings ...)
  2020-05-11  5:49 ` [PATCH 5/7] subtree: add git subtree use and ignore commands Tom Clarkson via GitGitGadget
@ 2020-05-11  5:50 ` Tom Clarkson via GitGitGadget
  2020-05-11  5:50 ` [PATCH 7/7] subtree: document new subtree commands Tom Clarkson via GitGitGadget
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 28+ messages in thread
From: Tom Clarkson via GitGitGadget @ 2020-05-11  5:50 UTC (permalink / raw)
  To: git; +Cc: Avery Pennarun, Tom Clarkson, Tom Clarkson

From: Tom Clarkson <tom@tqclarkson.com>

Prevent a mainline commit without $dir being treated as a subtree
commit and pulling in the entire mainline history. Any valid subtree
commit will have only valid subtree commits as parents, which will be
unchanged by check_parents.

Signed-off-by: Tom Clarkson <tom@tqclarkson.com>
---
 contrib/subtree/git-subtree.sh | 24 +++++++++++-------------
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/contrib/subtree/git-subtree.sh b/contrib/subtree/git-subtree.sh
index 1ca8b2f1101..320e51b7e1a 100755
--- a/contrib/subtree/git-subtree.sh
+++ b/contrib/subtree/git-subtree.sh
@@ -224,8 +224,6 @@ cache_setup () {
 	fi
 	mkdir -p "$cachedir" ||
 		die "Can't create new cachedir: $cachedir"
-	mkdir -p "$cachedir/notree" ||
-		die "Can't create new cachedir: $cachedir/notree"
 	debug "Using cachedir: $cachedir" >&2
 }
 
@@ -255,18 +253,11 @@ check_parents () {
 	local indent=$(($2 + 1))
 	for miss in $missed
 	do
-		if ! test -r "$cachedir/notree/$miss"
-		then
-			debug "  unprocessed parent commit: $miss ($indent)"
-			process_split_commit "$miss" "" "$indent"
-		fi
+		debug "  unprocessed parent commit: $miss ($indent)"
+		process_split_commit "$miss" "" "$indent"
 	done
 }
 
-set_notree () {
-	echo "1" > "$cachedir/notree/$1"
-}
-
 cache_set () {
 	oldrev="$1"
 	newrev="$2"
@@ -721,11 +712,18 @@ process_split_commit () {
 	# vs. a mainline commit?  Does it matter?
 	if test -z "$tree"
 	then
-		set_notree "$rev"
 		if test -n "$newparents"
 		then
-			cache_set "$rev" "$rev"
+			if test "$newparents" = "$parents"
+			then
+				# if all parents were subtrees, this can be a subtree commit
+				cache_set "$rev" "$rev"
+			else
+				# a mainline commit with tree missing is equivalent to the initial commit
+				cache_set "$rev" ""
+			fi
 		else
+			# no parents with valid subtree mappings means a commit prior to subtree add
 			cache_set "$rev" ""
 		fi
 		return
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 7/7] subtree: document new subtree commands
  2020-05-11  5:49 [PATCH 0/7] subtree: Fix handling of complex history Tom Clarkson via GitGitGadget
                   ` (5 preceding siblings ...)
  2020-05-11  5:50 ` [PATCH 6/7] subtree: more robustly distinguish subtree and mainline commits Tom Clarkson via GitGitGadget
@ 2020-05-11  5:50 ` Tom Clarkson via GitGitGadget
  2020-10-04 17:52 ` [PATCH 0/7] subtree: Fix handling of complex history Ed Maste
  2020-10-06 22:05 ` [PATCH v2 " Tom Clarkson via GitGitGadget
  8 siblings, 0 replies; 28+ messages in thread
From: Tom Clarkson via GitGitGadget @ 2020-05-11  5:50 UTC (permalink / raw)
  To: git; +Cc: Avery Pennarun, Tom Clarkson, Tom Clarkson

From: Tom Clarkson <tom@tqclarkson.com>

Signed-off-by: Tom Clarkson <tom@tqclarkson.com>
---
 contrib/subtree/git-subtree.txt | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/contrib/subtree/git-subtree.txt b/contrib/subtree/git-subtree.txt
index 352deda69dc..a5a76e8ce69 100644
--- a/contrib/subtree/git-subtree.txt
+++ b/contrib/subtree/git-subtree.txt
@@ -52,6 +52,12 @@ useful elsewhere, you can extract its entire history and publish
 that as its own git repository, without accidentally
 intermingling the history of your application project.
 
+Although the relationship between subtree and mainline commits is stored
+in regular git history, it is also cached between subtree runs. In most
+cases this is merely a performance improvement, but for projects with
+large and complex histories the cache can be manipulated directly
+with the use, ignore and map commands.
+
 [TIP]
 In order to keep your commit messages clean, we recommend that
 people split their commits between the subtrees and the main
@@ -120,6 +126,21 @@ and friends will work as expected.
 Note that if you use '--squash' when you merge, you should usually not
 just '--rejoin' when you split.
 
+ignore::
+	Mark a commit and all of its history as irrelevant to subtree split.
+	In most cases this would be handled automatically based on metadata
+	from subtree join commits. Intended for improving performance on
+	extremely large repos and excluding complex history that turns out
+	to be otherwise problematic.
+
+use::
+	Mark a commit and all of its history as part of an existing subtree.
+	In normal circumstances this would be handled based on the metadata
+	from the subtree join commit. Similar to the --onto option of split.
+
+map::
+	Manually override the normal output of split for a particular commit.
+	Extreme flexibility for advanced troubleshooting purposes only.
 
 OPTIONS
 -------
@@ -142,6 +163,9 @@ OPTIONS
 	This option is only valid for add, merge and pull (unsure).
 	Specify <message> as the commit message for the merge commit.
 
+--clear-cache::
+	Reset the subtree cache and recalculate all subtree mappings from the
+	commit history
 
 OPTIONS FOR add, merge, push, pull
 ----------------------------------
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/7] subtree: Fix handling of complex history
  2020-05-11  5:49 [PATCH 0/7] subtree: Fix handling of complex history Tom Clarkson via GitGitGadget
                   ` (6 preceding siblings ...)
  2020-05-11  5:50 ` [PATCH 7/7] subtree: document new subtree commands Tom Clarkson via GitGitGadget
@ 2020-10-04 17:52 ` Ed Maste
  2020-10-04 19:27   ` Johannes Schindelin
  2020-10-06 22:05 ` [PATCH v2 " Tom Clarkson via GitGitGadget
  8 siblings, 1 reply; 28+ messages in thread
From: Ed Maste @ 2020-10-04 17:52 UTC (permalink / raw)
  To: Tom Clarkson via GitGitGadget
  Cc: git mailing list, Avery Pennarun, Tom Clarkson

On Mon, 11 May 2020 at 01:50, Tom Clarkson via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> Fixes several issues that could occur when running subtree split on large
> repos with more complex history.

I've been using this patch set locally and would very much like to see it land.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/7] subtree: Fix handling of complex history
  2020-10-04 17:52 ` [PATCH 0/7] subtree: Fix handling of complex history Ed Maste
@ 2020-10-04 19:27   ` Johannes Schindelin
  2020-10-05 16:47     ` Junio C Hamano
  2020-10-05 21:37     ` Ed Maste
  0 siblings, 2 replies; 28+ messages in thread
From: Johannes Schindelin @ 2020-10-04 19:27 UTC (permalink / raw)
  To: Ed Maste
  Cc: Tom Clarkson via GitGitGadget, git mailing list, Avery Pennarun,
	Tom Clarkson

Hi,

On Sun, 4 Oct 2020, Ed Maste wrote:

> On Mon, 11 May 2020 at 01:50, Tom Clarkson via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
> >
> > Fixes several issues that could occur when running subtree split on large
> > repos with more complex history.
>
> I've been using this patch set locally and would very much like to see it land.

FWIW there have been more comments in favor of this patch in
https://github.com/gitgitgadget/git/pull/493.

I guess what this patch series needs to proceed further is an ACK by
Avery?

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/7] subtree: Fix handling of complex history
  2020-10-04 19:27   ` Johannes Schindelin
@ 2020-10-05 16:47     ` Junio C Hamano
  2020-10-05 21:37     ` Ed Maste
  1 sibling, 0 replies; 28+ messages in thread
From: Junio C Hamano @ 2020-10-05 16:47 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Ed Maste, Tom Clarkson via GitGitGadget, git mailing list,
	Avery Pennarun, Tom Clarkson

Johannes Schindelin <Johannes.Schindelin@gmx.de> writes:

> On Sun, 4 Oct 2020, Ed Maste wrote:
>
>> On Mon, 11 May 2020 at 01:50, Tom Clarkson via GitGitGadget
>> <gitgitgadget@gmail.com> wrote:
>> >
>> > Fixes several issues that could occur when running subtree split on large
>> > repos with more complex history.
>>
>> I've been using this patch set locally and would very much like to see it land.
>
> FWIW there have been more comments in favor of this patch in
> https://github.com/gitgitgadget/git/pull/493.
>
> I guess what this patch series needs to proceed further is an ACK by
> Avery?

Not necessarily by Avery, but by somebody who knows the code better
than us ;-).


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/7] subtree: Fix handling of complex history
  2020-10-04 19:27   ` Johannes Schindelin
  2020-10-05 16:47     ` Junio C Hamano
@ 2020-10-05 21:37     ` Ed Maste
  2020-10-07 16:31       ` Johannes Schindelin
  1 sibling, 1 reply; 28+ messages in thread
From: Ed Maste @ 2020-10-05 21:37 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Tom Clarkson via GitGitGadget, git mailing list, Avery Pennarun,
	Tom Clarkson

On Mon, 5 Oct 2020 at 09:18, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> FWIW there have been more comments in favor of this patch in
> https://github.com/gitgitgadget/git/pull/493.
>
> I guess what this patch series needs to proceed further is an ACK by
> Avery?

Avery says "if it's good enough for Johannes it's good enough for me"
https://twitter.com/apenwarr/status/1313231132721401861

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 0/7] subtree: Fix handling of complex history
  2020-05-11  5:49 [PATCH 0/7] subtree: Fix handling of complex history Tom Clarkson via GitGitGadget
                   ` (7 preceding siblings ...)
  2020-10-04 17:52 ` [PATCH 0/7] subtree: Fix handling of complex history Ed Maste
@ 2020-10-06 22:05 ` Tom Clarkson via GitGitGadget
  2020-10-06 22:05   ` [PATCH v2 1/7] subtree: handle multiple parents passed to cache_miss Tom Clarkson via GitGitGadget
                     ` (7 more replies)
  8 siblings, 8 replies; 28+ messages in thread
From: Tom Clarkson via GitGitGadget @ 2020-10-06 22:05 UTC (permalink / raw)
  To: git; +Cc: Avery Pennarun, Ed Maste, Johannes Schindelin, Tom Clarkson

Fixes several issues that could occur when running subtree split on large
repos with more complex history.

 1. A merge commit could bypass the known start point of the subtree, which
    would cause the entire history to be processed recursively, leading to a
    stack overflow / segfault after reading a few hundred commits. Older
    commits are now explicitly recorded as irrelevant so that the recursive
    process can terminate on any mainline commit rather than only on subtree
    joins and initial commits.
    
    
 2. It is possible for a repo to contain subtrees that lack the metadata
    that is usually present in add/join commit messages (git-svn at least
    can produce such a structure). The new use/ignore/map commands allow the
    user to provide that information for any problematic commits.
    
    
 3. A mainline commit that does not contain the subtree folder could be
    erroneously identified as a subtree commit, which would add the entire
    mainline history to the subtree. Commits will now only be used as is if
    all their parents are already identified as subtree commits. While the
    new code can still be tripped up by unusual folder structures, the
    completely unambiguous solution turned out to involve a significant
    performance penalty, and the new ignore / use commands provide a
    workaround for that scenario.

Tom Clarkson (7):
  subtree: handle multiple parents passed to cache_miss
  subtree: exclude commits predating add from recursive processing
  subtree: persist cache between split runs
  subtree: add git subtree map command
  subtree: add git subtree use and ignore commands
  subtree: more robustly distinguish subtree and mainline commits
  subtree: document new subtree commands

 contrib/subtree/git-subtree.sh  | 183 ++++++++++++++++++++++++++------
 contrib/subtree/git-subtree.txt |  24 +++++
 2 files changed, 175 insertions(+), 32 deletions(-)


base-commit: 47ae905ffb98cc4d4fd90083da6bc8dab55d9ecc
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-493%2Ftqc%2Ftqc%2Fsubtree-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-493/tqc/tqc/subtree-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/493

Range-diff vs v1:

 1:  74fa670490 = 1:  9cff2a0cf6 subtree: handle multiple parents passed to cache_miss
 2:  87af5a316a ! 2:  79b5f4a651 subtree: exclude commits predating add from recursive processing
     @@ contrib/subtree/git-subtree.sh: find_existing_splits () {
      +	debug "Looking for first split..."
      +	dir="$1"
      +	revs="$2"
     -+	main=
     -+	sub=
     -+	local grep_format="^git-subtree-dir: $dir/*\$"
     -+	git log --reverse --grep="$grep_format" \
     ++
     ++	git log --reverse --grep="^git-subtree-dir: $dir/*\$" \
      +		--no-show-signature --pretty=format:'START %H%n%s%n%n%b%nEND%n' $revs |
      +	while read a b junk
      +	do
 3:  c892ee9828 = 3:  8eec18388c subtree: persist cache between split runs
 4:  a67c256a59 = 4:  1490ce1114 subtree: add git subtree map command
 5:  a76a49651b = 5:  2d103292ce subtree: add git subtree use and ignore commands
 6:  27a43ea2c4 = 6:  a7aaedfed3 subtree: more robustly distinguish subtree and mainline commits
 7:  19db9cfb68 = 7:  fe2e4819b8 subtree: document new subtree commands

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 1/7] subtree: handle multiple parents passed to cache_miss
  2020-10-06 22:05 ` [PATCH v2 " Tom Clarkson via GitGitGadget
@ 2020-10-06 22:05   ` Tom Clarkson via GitGitGadget
  2020-10-07 13:12     ` Ed Maste
  2020-10-06 22:05   ` [PATCH v2 2/7] subtree: exclude commits predating add from recursive processing Tom Clarkson via GitGitGadget
                     ` (6 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Tom Clarkson via GitGitGadget @ 2020-10-06 22:05 UTC (permalink / raw)
  To: git
  Cc: Avery Pennarun, Ed Maste, Johannes Schindelin, Tom Clarkson,
	Tom Clarkson

From: Tom Clarkson <tom@tqclarkson.com>

Signed-off-by: Tom Clarkson <tom@tqclarkson.com>
---
 contrib/subtree/git-subtree.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contrib/subtree/git-subtree.sh b/contrib/subtree/git-subtree.sh
index 868e18b9a1..9867718503 100755
--- a/contrib/subtree/git-subtree.sh
+++ b/contrib/subtree/git-subtree.sh
@@ -238,7 +238,7 @@ cache_miss () {
 }
 
 check_parents () {
-	missed=$(cache_miss "$1")
+	missed=$(cache_miss $1)
 	local indent=$(($2 + 1))
 	for miss in $missed
 	do
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 2/7] subtree: exclude commits predating add from recursive processing
  2020-10-06 22:05 ` [PATCH v2 " Tom Clarkson via GitGitGadget
  2020-10-06 22:05   ` [PATCH v2 1/7] subtree: handle multiple parents passed to cache_miss Tom Clarkson via GitGitGadget
@ 2020-10-06 22:05   ` Tom Clarkson via GitGitGadget
  2020-10-07 15:36     ` Johannes Schindelin
  2020-10-06 22:05   ` [PATCH v2 3/7] subtree: persist cache between split runs Tom Clarkson via GitGitGadget
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Tom Clarkson via GitGitGadget @ 2020-10-06 22:05 UTC (permalink / raw)
  To: git
  Cc: Avery Pennarun, Ed Maste, Johannes Schindelin, Tom Clarkson,
	Tom Clarkson

From: Tom Clarkson <tom@tqclarkson.com>

Include recursion depth in debug logs so we can see when the recursion is
getting out of hand.

Making the cache handle null mappings correctly and adding older commits
to the cache allows the recursive algorithm to terminate at any point on
mainline rather than needing to reach either the add point or the initial
commit.

Signed-off-by: Tom Clarkson <tom@tqclarkson.com>
---
 contrib/subtree/git-subtree.sh | 35 +++++++++++++++++++++++++++++++---
 1 file changed, 32 insertions(+), 3 deletions(-)

diff --git a/contrib/subtree/git-subtree.sh b/contrib/subtree/git-subtree.sh
index 9867718503..160bad95c1 100755
--- a/contrib/subtree/git-subtree.sh
+++ b/contrib/subtree/git-subtree.sh
@@ -244,7 +244,7 @@ check_parents () {
 	do
 		if ! test -r "$cachedir/notree/$miss"
 		then
-			debug "  incorrect order: $miss"
+			debug "  unprocessed parent commit: $miss ($indent)"
 			process_split_commit "$miss" "" "$indent"
 		fi
 	done
@@ -392,6 +392,24 @@ find_existing_splits () {
 	done
 }
 
+find_mainline_ref () {
+	debug "Looking for first split..."
+	dir="$1"
+	revs="$2"
+
+	git log --reverse --grep="^git-subtree-dir: $dir/*\$" \
+		--no-show-signature --pretty=format:'START %H%n%s%n%n%b%nEND%n' $revs |
+	while read a b junk
+	do
+		case "$a" in
+		git-subtree-mainline:)
+			echo "$b"
+			return
+			;;
+		esac
+	done
+}
+
 copy_commit () {
 	# We're going to set some environment vars here, so
 	# do it in a subshell to get rid of them safely later
@@ -646,9 +664,9 @@ process_split_commit () {
 
 	progress "$revcount/$revmax ($createcount) [$extracount]"
 
-	debug "Processing commit: $rev"
+	debug "Processing commit: $rev ($indent)"
 	exists=$(cache_get "$rev")
-	if test -n "$exists"
+	if test -z "$(cache_miss "$rev")"
 	then
 		debug "  prior: $exists"
 		return
@@ -773,6 +791,17 @@ cmd_split () {
 
 	unrevs="$(find_existing_splits "$dir" "$revs")"
 
+	mainline="$(find_mainline_ref "$dir" "$revs")"
+	if test -n "$mainline"
+	then
+		debug "Mainline $mainline predates subtree add"
+		git rev-list --topo-order --skip=1 $mainline |
+		while read rev
+		do
+			cache_set "$rev" ""
+		done || exit $?
+	fi
+
 	# We can't restrict rev-list to only $dir here, because some of our
 	# parents have the $dir contents the root, and those won't match.
 	# (and rev-list --follow doesn't seem to solve this)
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 3/7] subtree: persist cache between split runs
  2020-10-06 22:05 ` [PATCH v2 " Tom Clarkson via GitGitGadget
  2020-10-06 22:05   ` [PATCH v2 1/7] subtree: handle multiple parents passed to cache_miss Tom Clarkson via GitGitGadget
  2020-10-06 22:05   ` [PATCH v2 2/7] subtree: exclude commits predating add from recursive processing Tom Clarkson via GitGitGadget
@ 2020-10-06 22:05   ` Tom Clarkson via GitGitGadget
  2020-10-07 16:06     ` Johannes Schindelin
  2020-10-06 22:05   ` [PATCH v2 4/7] subtree: add git subtree map command Tom Clarkson via GitGitGadget
                     ` (4 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Tom Clarkson via GitGitGadget @ 2020-10-06 22:05 UTC (permalink / raw)
  To: git
  Cc: Avery Pennarun, Ed Maste, Johannes Schindelin, Tom Clarkson,
	Tom Clarkson

From: Tom Clarkson <tom@tqclarkson.com>

Provide a mechanism for handling problematic commits. If the algorithm
in process_split_commit is getting something wrong, you can write a
corrected value to the cache before running split.

Signed-off-by: Tom Clarkson <tom@tqclarkson.com>
---
 contrib/subtree/git-subtree.sh | 37 ++++++++++++++++++++++++++--------
 1 file changed, 29 insertions(+), 8 deletions(-)

diff --git a/contrib/subtree/git-subtree.sh b/contrib/subtree/git-subtree.sh
index 160bad95c1..c21d620610 100755
--- a/contrib/subtree/git-subtree.sh
+++ b/contrib/subtree/git-subtree.sh
@@ -27,6 +27,7 @@ b,branch=     create a new branch from the split subtree
 ignore-joins  ignore prior --rejoin commits
 onto=         try connecting new tree to an existing one
 rejoin        merge the new branch back into HEAD
+clear-cache   reset the subtree mapping cache
  options for 'add', 'merge', and 'pull'
 squash        merge subtree changes as a single commit
 "
@@ -48,6 +49,7 @@ annotate=
 squash=
 message=
 prefix=
+clearcache=
 
 debug () {
 	if test -n "$debug"
@@ -131,6 +133,9 @@ do
 	--no-rejoin)
 		rejoin=
 		;;
+	--clear-cache)
+		clearcache=1
+		;;
 	--ignore-joins)
 		ignore_joins=1
 		;;
@@ -206,9 +211,13 @@ debug "opts: {$*}"
 debug
 
 cache_setup () {
-	cachedir="$GIT_DIR/subtree-cache/$$"
-	rm -rf "$cachedir" ||
-		die "Can't delete old cachedir: $cachedir"
+	cachedir="$GIT_DIR/subtree-cache/$prefix"
+	if test -n "$clearcache"
+	then
+		debug "Clearing cache"
+		rm -rf "$cachedir" ||
+			die "Can't delete old cachedir: $cachedir"
+	fi
 	mkdir -p "$cachedir" ||
 		die "Can't create new cachedir: $cachedir"
 	mkdir -p "$cachedir/notree" ||
@@ -266,6 +275,16 @@ cache_set () {
 	echo "$newrev" >"$cachedir/$oldrev"
 }
 
+cache_set_if_unset () {
+	oldrev="$1"
+	newrev="$2"
+	if test -e "$cachedir/$oldrev"
+	then
+		return
+	fi
+	echo "$newrev" >"$cachedir/$oldrev"
+}
+
 rev_exists () {
 	if git rev-parse "$1" >/dev/null 2>&1
 	then
@@ -375,13 +394,13 @@ find_existing_splits () {
 			then
 				# squash commits refer to a subtree
 				debug "  Squash: $sq from $sub"
-				cache_set "$sq" "$sub"
+				cache_set_if_unset "$sq" "$sub"
 			fi
 			if test -n "$main" -a -n "$sub"
 			then
 				debug "  Prior: $main -> $sub"
-				cache_set $main $sub
-				cache_set $sub $sub
+				cache_set_if_unset $main $sub
+				cache_set_if_unset $sub $sub
 				try_remove_previous "$main"
 				try_remove_previous "$sub"
 			fi
@@ -688,6 +707,8 @@ process_split_commit () {
 		if test -n "$newparents"
 		then
 			cache_set "$rev" "$rev"
+		else
+			cache_set "$rev" ""
 		fi
 		return
 	fi
@@ -785,7 +806,7 @@ cmd_split () {
 			# the 'onto' history is already just the subdir, so
 			# any parent we find there can be used verbatim
 			debug "  cache: $rev"
-			cache_set "$rev" "$rev"
+			cache_set_if_unset "$rev" "$rev"
 		done
 	fi
 
@@ -798,7 +819,7 @@ cmd_split () {
 		git rev-list --topo-order --skip=1 $mainline |
 		while read rev
 		do
-			cache_set "$rev" ""
+			cache_set_if_unset "$rev" ""
 		done || exit $?
 	fi
 
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 4/7] subtree: add git subtree map command
  2020-10-06 22:05 ` [PATCH v2 " Tom Clarkson via GitGitGadget
                     ` (2 preceding siblings ...)
  2020-10-06 22:05   ` [PATCH v2 3/7] subtree: persist cache between split runs Tom Clarkson via GitGitGadget
@ 2020-10-06 22:05   ` Tom Clarkson via GitGitGadget
  2020-10-06 22:05   ` [PATCH v2 5/7] subtree: add git subtree use and ignore commands Tom Clarkson via GitGitGadget
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: Tom Clarkson via GitGitGadget @ 2020-10-06 22:05 UTC (permalink / raw)
  To: git
  Cc: Avery Pennarun, Ed Maste, Johannes Schindelin, Tom Clarkson,
	Tom Clarkson

From: Tom Clarkson <tom@tqclarkson.com>

Adds an entry to the subtree cache so that subsequent split runs can skip
any commits that turn out to be problematic.

Signed-off-by: Tom Clarkson <tom@tqclarkson.com>
---
 contrib/subtree/git-subtree.sh | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/contrib/subtree/git-subtree.sh b/contrib/subtree/git-subtree.sh
index c21d620610..1559100c0e 100755
--- a/contrib/subtree/git-subtree.sh
+++ b/contrib/subtree/git-subtree.sh
@@ -15,6 +15,7 @@ git subtree merge --prefix=<prefix> <commit>
 git subtree pull  --prefix=<prefix> <repository> <ref>
 git subtree push  --prefix=<prefix> <repository> <ref>
 git subtree split --prefix=<prefix> <commit>
+git subtree map   --prefix=<prefix> <mainline> <subtree>
 --
 h,help        show the help
 q             quiet
@@ -161,7 +162,7 @@ command="$1"
 shift
 
 case "$command" in
-add|merge|pull)
+add|merge|pull|map)
 	default=
 	;;
 split|push)
@@ -192,7 +193,8 @@ dir="$(dirname "$prefix/.")"
 
 if test "$command" != "pull" &&
 		test "$command" != "add" &&
-		test "$command" != "push"
+		test "$command" != "push" &&
+		test "$command" != "map"
 then
 	revs=$(git rev-parse $default --revs-only "$@") || exit $?
 	dirs=$(git rev-parse --no-revs --no-flags "$@") || exit $?
@@ -793,6 +795,21 @@ cmd_add_commit () {
 	say "Added dir '$dir'"
 }
 
+cmd_map () {
+	oldrev="$1"
+	newrev="$2"
+
+	if test -z "$oldrev"
+	then
+		die "You must provide a revision to map"
+	fi
+
+	cache_setup || exit $?
+	cache_set "$oldrev" "$newrev"
+
+	say "Mapped $oldrev => $newrev"
+}
+
 cmd_split () {
 	debug "Splitting $dir..."
 	cache_setup || exit $?
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 5/7] subtree: add git subtree use and ignore commands
  2020-10-06 22:05 ` [PATCH v2 " Tom Clarkson via GitGitGadget
                     ` (3 preceding siblings ...)
  2020-10-06 22:05   ` [PATCH v2 4/7] subtree: add git subtree map command Tom Clarkson via GitGitGadget
@ 2020-10-06 22:05   ` Tom Clarkson via GitGitGadget
  2020-10-07 16:29     ` Johannes Schindelin
  2020-10-06 22:05   ` [PATCH v2 6/7] subtree: more robustly distinguish subtree and mainline commits Tom Clarkson via GitGitGadget
                     ` (2 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Tom Clarkson via GitGitGadget @ 2020-10-06 22:05 UTC (permalink / raw)
  To: git
  Cc: Avery Pennarun, Ed Maste, Johannes Schindelin, Tom Clarkson,
	Tom Clarkson

From: Tom Clarkson <tom@tqclarkson.com>

Tell split to use or ignore larger sections of the history. In most cases
split does this automatically based on metadata from subtree add.

Signed-off-by: Tom Clarkson <tom@tqclarkson.com>
---
 contrib/subtree/git-subtree.sh | 78 ++++++++++++++++++++++++++++------
 1 file changed, 66 insertions(+), 12 deletions(-)

diff --git a/contrib/subtree/git-subtree.sh b/contrib/subtree/git-subtree.sh
index 1559100c0e..e56621a986 100755
--- a/contrib/subtree/git-subtree.sh
+++ b/contrib/subtree/git-subtree.sh
@@ -9,13 +9,15 @@ then
 	set -- -h
 fi
 OPTS_SPEC="\
-git subtree add   --prefix=<prefix> <commit>
-git subtree add   --prefix=<prefix> <repository> <ref>
-git subtree merge --prefix=<prefix> <commit>
-git subtree pull  --prefix=<prefix> <repository> <ref>
-git subtree push  --prefix=<prefix> <repository> <ref>
-git subtree split --prefix=<prefix> <commit>
-git subtree map   --prefix=<prefix> <mainline> <subtree>
+git subtree add    --prefix=<prefix> <commit>
+git subtree add    --prefix=<prefix> <repository> <ref>
+git subtree merge  --prefix=<prefix> <commit>
+git subtree pull   --prefix=<prefix> <repository> <ref>
+git subtree push   --prefix=<prefix> <repository> <ref>
+git subtree split  --prefix=<prefix> <commit>
+git subtree map    --prefix=<prefix> <mainline> <subtree>
+git subtree ignore --prefix=<prefix> <commit>
+git subtree use    --prefix=<prefix> <commit>
 --
 h,help        show the help
 q             quiet
@@ -162,7 +164,7 @@ command="$1"
 shift
 
 case "$command" in
-add|merge|pull|map)
+add|merge|pull|map|ignore|use)
 	default=
 	;;
 split|push)
@@ -431,6 +433,18 @@ find_mainline_ref () {
 	done
 }
 
+exclude_processed_refs () {
+		if test -r "$cachedir/processed"
+		then
+			cat "$cachedir/processed" |
+			while read rev
+			do
+				debug "read $rev"
+				echo "^$rev"
+			done
+		fi
+}
+
 copy_commit () {
 	# We're going to set some environment vars here, so
 	# do it in a subshell to get rid of them safely later
@@ -796,20 +810,60 @@ cmd_add_commit () {
 }
 
 cmd_map () {
-	oldrev="$1"
-	newrev="$2"
 
-	if test -z "$oldrev"
+	if test -z "$1"
 	then
 		die "You must provide a revision to map"
 	fi
 
+	oldrev=$(git rev-parse --revs-only "$1") || exit $?
+	newrev=
+
+	if test -n "$2"
+	then
+		newrev=$(git rev-parse --revs-only "$2") || exit $?
+	fi
+
 	cache_setup || exit $?
 	cache_set "$oldrev" "$newrev"
 
 	say "Mapped $oldrev => $newrev"
 }
 
+cmd_ignore () {
+	revs=$(git rev-parse $default --revs-only "$@") || exit $?
+	ensure_single_rev $revs
+
+	say "Ignoring $revs"
+
+	cache_setup || exit $?
+
+	git rev-list $revs |
+	while read rev
+	do
+		cache_set "$rev" ""
+	done
+
+	echo "$revs" >>"$cachedir/processed"
+}
+
+cmd_use () {
+	revs=$(git rev-parse $default --revs-only "$@") || exit $?
+	ensure_single_rev $revs
+
+	say "Using existing subtree $revs"
+
+	cache_setup || exit $?
+
+	git rev-list $revs |
+	while read rev
+	do
+		cache_set "$rev" "$rev"
+	done
+
+	echo "$revs" >>"$cachedir/processed"
+}
+
 cmd_split () {
 	debug "Splitting $dir..."
 	cache_setup || exit $?
@@ -827,7 +881,7 @@ cmd_split () {
 		done
 	fi
 
-	unrevs="$(find_existing_splits "$dir" "$revs")"
+	unrevs="$(find_existing_splits "$dir" "$revs") $(exclude_processed_refs)"
 
 	mainline="$(find_mainline_ref "$dir" "$revs")"
 	if test -n "$mainline"
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 6/7] subtree: more robustly distinguish subtree and mainline commits
  2020-10-06 22:05 ` [PATCH v2 " Tom Clarkson via GitGitGadget
                     ` (4 preceding siblings ...)
  2020-10-06 22:05   ` [PATCH v2 5/7] subtree: add git subtree use and ignore commands Tom Clarkson via GitGitGadget
@ 2020-10-06 22:05   ` Tom Clarkson via GitGitGadget
  2020-10-07 19:42     ` Johannes Schindelin
  2020-10-06 22:05   ` [PATCH v2 7/7] subtree: document new subtree commands Tom Clarkson via GitGitGadget
  2020-10-07 19:46   ` [PATCH v2 0/7] subtree: Fix handling of complex history Johannes Schindelin
  7 siblings, 1 reply; 28+ messages in thread
From: Tom Clarkson via GitGitGadget @ 2020-10-06 22:05 UTC (permalink / raw)
  To: git
  Cc: Avery Pennarun, Ed Maste, Johannes Schindelin, Tom Clarkson,
	Tom Clarkson

From: Tom Clarkson <tom@tqclarkson.com>

Prevent a mainline commit without $dir being treated as a subtree
commit and pulling in the entire mainline history. Any valid subtree
commit will have only valid subtree commits as parents, which will be
unchanged by check_parents.

Signed-off-by: Tom Clarkson <tom@tqclarkson.com>
---
 contrib/subtree/git-subtree.sh | 24 +++++++++++-------------
 1 file changed, 11 insertions(+), 13 deletions(-)

diff --git a/contrib/subtree/git-subtree.sh b/contrib/subtree/git-subtree.sh
index e56621a986..fa6293b372 100755
--- a/contrib/subtree/git-subtree.sh
+++ b/contrib/subtree/git-subtree.sh
@@ -224,8 +224,6 @@ cache_setup () {
 	fi
 	mkdir -p "$cachedir" ||
 		die "Can't create new cachedir: $cachedir"
-	mkdir -p "$cachedir/notree" ||
-		die "Can't create new cachedir: $cachedir/notree"
 	debug "Using cachedir: $cachedir" >&2
 }
 
@@ -255,18 +253,11 @@ check_parents () {
 	local indent=$(($2 + 1))
 	for miss in $missed
 	do
-		if ! test -r "$cachedir/notree/$miss"
-		then
-			debug "  unprocessed parent commit: $miss ($indent)"
-			process_split_commit "$miss" "" "$indent"
-		fi
+		debug "  unprocessed parent commit: $miss ($indent)"
+		process_split_commit "$miss" "" "$indent"
 	done
 }
 
-set_notree () {
-	echo "1" > "$cachedir/notree/$1"
-}
-
 cache_set () {
 	oldrev="$1"
 	newrev="$2"
@@ -719,11 +710,18 @@ process_split_commit () {
 	# vs. a mainline commit?  Does it matter?
 	if test -z "$tree"
 	then
-		set_notree "$rev"
 		if test -n "$newparents"
 		then
-			cache_set "$rev" "$rev"
+			if test "$newparents" = "$parents"
+			then
+				# if all parents were subtrees, this can be a subtree commit
+				cache_set "$rev" "$rev"
+			else
+				# a mainline commit with tree missing is equivalent to the initial commit
+				cache_set "$rev" ""
+			fi
 		else
+			# no parents with valid subtree mappings means a commit prior to subtree add
 			cache_set "$rev" ""
 		fi
 		return
-- 
gitgitgadget


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v2 7/7] subtree: document new subtree commands
  2020-10-06 22:05 ` [PATCH v2 " Tom Clarkson via GitGitGadget
                     ` (5 preceding siblings ...)
  2020-10-06 22:05   ` [PATCH v2 6/7] subtree: more robustly distinguish subtree and mainline commits Tom Clarkson via GitGitGadget
@ 2020-10-06 22:05   ` Tom Clarkson via GitGitGadget
  2020-10-07 19:43     ` Johannes Schindelin
  2020-10-07 19:46   ` [PATCH v2 0/7] subtree: Fix handling of complex history Johannes Schindelin
  7 siblings, 1 reply; 28+ messages in thread
From: Tom Clarkson via GitGitGadget @ 2020-10-06 22:05 UTC (permalink / raw)
  To: git
  Cc: Avery Pennarun, Ed Maste, Johannes Schindelin, Tom Clarkson,
	Tom Clarkson

From: Tom Clarkson <tom@tqclarkson.com>

Signed-off-by: Tom Clarkson <tom@tqclarkson.com>
---
 contrib/subtree/git-subtree.txt | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/contrib/subtree/git-subtree.txt b/contrib/subtree/git-subtree.txt
index 352deda69d..a5a76e8ce6 100644
--- a/contrib/subtree/git-subtree.txt
+++ b/contrib/subtree/git-subtree.txt
@@ -52,6 +52,12 @@ useful elsewhere, you can extract its entire history and publish
 that as its own git repository, without accidentally
 intermingling the history of your application project.
 
+Although the relationship between subtree and mainline commits is stored
+in regular git history, it is also cached between subtree runs. In most
+cases this is merely a performance improvement, but for projects with
+large and complex histories the cache can be manipulated directly
+with the use, ignore and map commands.
+
 [TIP]
 In order to keep your commit messages clean, we recommend that
 people split their commits between the subtrees and the main
@@ -120,6 +126,21 @@ and friends will work as expected.
 Note that if you use '--squash' when you merge, you should usually not
 just '--rejoin' when you split.
 
+ignore::
+	Mark a commit and all of its history as irrelevant to subtree split.
+	In most cases this would be handled automatically based on metadata
+	from subtree join commits. Intended for improving performance on
+	extremely large repos and excluding complex history that turns out
+	to be otherwise problematic.
+
+use::
+	Mark a commit and all of its history as part of an existing subtree.
+	In normal circumstances this would be handled based on the metadata
+	from the subtree join commit. Similar to the --onto option of split.
+
+map::
+	Manually override the normal output of split for a particular commit.
+	Extreme flexibility for advanced troubleshooting purposes only.
 
 OPTIONS
 -------
@@ -142,6 +163,9 @@ OPTIONS
 	This option is only valid for add, merge and pull (unsure).
 	Specify <message> as the commit message for the merge commit.
 
+--clear-cache::
+	Reset the subtree cache and recalculate all subtree mappings from the
+	commit history
 
 OPTIONS FOR add, merge, push, pull
 ----------------------------------
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 1/7] subtree: handle multiple parents passed to cache_miss
  2020-10-06 22:05   ` [PATCH v2 1/7] subtree: handle multiple parents passed to cache_miss Tom Clarkson via GitGitGadget
@ 2020-10-07 13:12     ` Ed Maste
  0 siblings, 0 replies; 28+ messages in thread
From: Ed Maste @ 2020-10-07 13:12 UTC (permalink / raw)
  To: Tom Clarkson via GitGitGadget
  Cc: git mailing list, Avery Pennarun, Johannes Schindelin, Tom Clarkson

On Tue, 6 Oct 2020 at 18:05, Tom Clarkson via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Tom Clarkson <tom@tqclarkson.com>
>
> Signed-off-by: Tom Clarkson <tom@tqclarkson.com>
Reviewed-by: Ed Maste <emaste@FreeBSD.org>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 2/7] subtree: exclude commits predating add from recursive processing
  2020-10-06 22:05   ` [PATCH v2 2/7] subtree: exclude commits predating add from recursive processing Tom Clarkson via GitGitGadget
@ 2020-10-07 15:36     ` Johannes Schindelin
  0 siblings, 0 replies; 28+ messages in thread
From: Johannes Schindelin @ 2020-10-07 15:36 UTC (permalink / raw)
  To: Tom Clarkson via GitGitGadget
  Cc: git, Avery Pennarun, Ed Maste, Tom Clarkson, Tom Clarkson

Hi Tom,

On Tue, 6 Oct 2020, Tom Clarkson via GitGitGadget wrote:

> From: Tom Clarkson <tom@tqclarkson.com>
>
> Include recursion depth in debug logs so we can see when the recursion is
> getting out of hand.
>
> Making the cache handle null mappings correctly and adding older commits
> to the cache allows the recursive algorithm to terminate at any point on
> mainline rather than needing to reach either the add point or the initial
> commit.

Makes sense.

> diff --git a/contrib/subtree/git-subtree.sh b/contrib/subtree/git-subtree.sh
> index 9867718503..160bad95c1 100755
> --- a/contrib/subtree/git-subtree.sh
> +++ b/contrib/subtree/git-subtree.sh
> @@ -244,7 +244,7 @@ check_parents () {
>  	do
>  		if ! test -r "$cachedir/notree/$miss"
>  		then
> -			debug "  incorrect order: $miss"
> +			debug "  unprocessed parent commit: $miss ($indent)"

Without any context, it is hard to understand what the `$indent` variable
is supposed to mean, so it is unclear why we need to print it here.

I _guess_ it is the degree removed from the first-parent lineage?

In any case, it does not hurt here, so I trust that it is good to include
it in the debug output.

>  			process_split_commit "$miss" "" "$indent"
>  		fi
>  	done
> @@ -392,6 +392,24 @@ find_existing_splits () {
>  	done
>  }
>
> +find_mainline_ref () {
> +	debug "Looking for first split..."
> +	dir="$1"
> +	revs="$2"

The `git-subtree` script seems to rely on the `local` construct, using it
in plenty of other circumstances. How about using it here, too?

> +
> +	git log --reverse --grep="^git-subtree-dir: $dir/*\$" \
> +		--no-show-signature --pretty=format:'START %H%n%s%n%n%b%nEND%n' $revs |

Since all you are interested in is the `git-subtree-mainline:` trailer,
wouldn't a format like `%(trailers:key=git-subtree-mainline)` instead of
`START %H%n%s%n%n%b%nEND%n`?

See
https://git-scm.com/docs/git-log#Documentation/git-log.txt-emtrailersoptionsem
for more information about pretty formats.

BTW I am super unfamiliar with `git subtree`'s inner workings, and
therefore it would help me incredibly if the commit message talked a bit
about the commit message layout (with a particular eye on
`git-subtree-dir` and `git-subtree-mainline` which I guess are trailers
added by `git subtree`?)...

> +	while read a b junk
> +	do
> +		case "$a" in
> +		git-subtree-mainline:)
> +			echo "$b"
> +			return
> +			;;
> +		esac
> +	done
> +}
> +
>  copy_commit () {
>  	# We're going to set some environment vars here, so
>  	# do it in a subshell to get rid of them safely later
> @@ -646,9 +664,9 @@ process_split_commit () {
>
>  	progress "$revcount/$revmax ($createcount) [$extracount]"
>
> -	debug "Processing commit: $rev"
> +	debug "Processing commit: $rev ($indent)"
>  	exists=$(cache_get "$rev")
> -	if test -n "$exists"
> +	if test -z "$(cache_miss "$rev")"
>  	then
>  		debug "  prior: $exists"

I do not see the `exists` variable being used other than for the debug
statement. Maybe better something like this?

	debug "  prior found for $rev"

>  		return
> @@ -773,6 +791,17 @@ cmd_split () {
>
>  	unrevs="$(find_existing_splits "$dir" "$revs")"
>
> +	mainline="$(find_mainline_ref "$dir" "$revs")"
> +	if test -n "$mainline"
> +	then
> +		debug "Mainline $mainline predates subtree add"
> +		git rev-list --topo-order --skip=1 $mainline |
> +		while read rev
> +		do
> +			cache_set "$rev" ""

Ah, so they are not really "null mappings", but mapped to an empty string.
Makes sense. Maybe adjust the commit message?

> +		done || exit $?
> +	fi
> +
>  	# We can't restrict rev-list to only $dir here, because some of our
>  	# parents have the $dir contents the root, and those won't match.
>  	# (and rev-list --follow doesn't seem to solve this)
> --
> gitgitgadget
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 3/7] subtree: persist cache between split runs
  2020-10-06 22:05   ` [PATCH v2 3/7] subtree: persist cache between split runs Tom Clarkson via GitGitGadget
@ 2020-10-07 16:06     ` Johannes Schindelin
  0 siblings, 0 replies; 28+ messages in thread
From: Johannes Schindelin @ 2020-10-07 16:06 UTC (permalink / raw)
  To: Tom Clarkson via GitGitGadget
  Cc: git, Avery Pennarun, Ed Maste, Tom Clarkson, Tom Clarkson

Hi Tom,

On Tue, 6 Oct 2020, Tom Clarkson via GitGitGadget wrote:

> @@ -48,6 +49,7 @@ annotate=
>  squash=
>  message=
>  prefix=
> +clearcache=

It might be more consistent to call it `clear_cache` (i.e. with an
underscore), just like `ignore_joins`.

>
>  debug () {
>  	if test -n "$debug"
> @@ -131,6 +133,9 @@ do
>  	--no-rejoin)
>  		rejoin=
>  		;;
> +	--clear-cache)
> +		clearcache=1
> +		;;
>  	--ignore-joins)
>  		ignore_joins=1
>  		;;
> @@ -206,9 +211,13 @@ debug "opts: {$*}"
>  debug
>
>  cache_setup () {
> -	cachedir="$GIT_DIR/subtree-cache/$$"
> -	rm -rf "$cachedir" ||
> -		die "Can't delete old cachedir: $cachedir"
> +	cachedir="$GIT_DIR/subtree-cache/$prefix"

Excellent, the `prefix` should be "unique enough".

> +	if test -n "$clearcache"
> +	then
> +		debug "Clearing cache"
> +		rm -rf "$cachedir" ||
> +			die "Can't delete old cachedir: $cachedir"
> +	fi
>  	mkdir -p "$cachedir" ||
>  		die "Can't create new cachedir: $cachedir"
>  	mkdir -p "$cachedir/notree" ||
> @@ -266,6 +275,16 @@ cache_set () {
>  	echo "$newrev" >"$cachedir/$oldrev"
>  }
>
> +cache_set_if_unset () {
> +	oldrev="$1"
> +	newrev="$2"

`local`? ;-)

> +	if test -e "$cachedir/$oldrev"
> +	then
> +		return
> +	fi
> +	echo "$newrev" >"$cachedir/$oldrev"

So that directory contains commit mappings, a file for each mapped
revision.

Thinking back to patch 2/11, I am now no longer that sure that it makes
sense to fill it up with every commit in that commit range: performance
suffers when directories contain too many files.

For example, I had a case in the past where it took a minute just to
enumerate a directory, and even looking whether a file existed in that
directory was not exactly fun.

In any case, I would write it slightly shorter:

	test -e "$cachedir/$oldrev" ||
	echo "$newrev" >"$cachedir/$oldrev"

> +}
> +
>  rev_exists () {
>  	if git rev-parse "$1" >/dev/null 2>&1
>  	then
> @@ -375,13 +394,13 @@ find_existing_splits () {
>  			then
>  				# squash commits refer to a subtree
>  				debug "  Squash: $sq from $sub"
> -				cache_set "$sq" "$sub"
> +				cache_set_if_unset "$sq" "$sub"
>  			fi
>  			if test -n "$main" -a -n "$sub"
>  			then
>  				debug "  Prior: $main -> $sub"
> -				cache_set $main $sub
> -				cache_set $sub $sub
> +				cache_set_if_unset $main $sub
> +				cache_set_if_unset $sub $sub
>  				try_remove_previous "$main"
>  				try_remove_previous "$sub"
>  			fi
> @@ -688,6 +707,8 @@ process_split_commit () {
>  		if test -n "$newparents"
>  		then
>  			cache_set "$rev" "$rev"
> +		else
> +			cache_set "$rev" ""

Was this hunk intended to be snuck in here? I can understand the
s/cache_set/cache_set_if_unset/ changes, of course, but not this hunk.

>  		fi
>  		return
>  	fi
> @@ -785,7 +806,7 @@ cmd_split () {
>  			# the 'onto' history is already just the subdir, so
>  			# any parent we find there can be used verbatim
>  			debug "  cache: $rev"
> -			cache_set "$rev" "$rev"
> +			cache_set_if_unset "$rev" "$rev"
>  		done
>  	fi
>
> @@ -798,7 +819,7 @@ cmd_split () {
>  		git rev-list --topo-order --skip=1 $mainline |
>  		while read rev
>  		do
> -			cache_set "$rev" ""
> +			cache_set_if_unset "$rev" ""

Okay. A quite interesting question now would be: are there any callers of
`cache_set` left? If so, why?

Thanks,
Dscho

>  		done || exit $?
>  	fi
>
> --
> gitgitgadget
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 5/7] subtree: add git subtree use and ignore commands
  2020-10-06 22:05   ` [PATCH v2 5/7] subtree: add git subtree use and ignore commands Tom Clarkson via GitGitGadget
@ 2020-10-07 16:29     ` Johannes Schindelin
  0 siblings, 0 replies; 28+ messages in thread
From: Johannes Schindelin @ 2020-10-07 16:29 UTC (permalink / raw)
  To: Tom Clarkson via GitGitGadget
  Cc: git, Avery Pennarun, Ed Maste, Tom Clarkson, Tom Clarkson

Hi Tom,

On Tue, 6 Oct 2020, Tom Clarkson via GitGitGadget wrote:

> @@ -796,20 +810,60 @@ cmd_add_commit () {
>  }
>
>  cmd_map () {
> -	oldrev="$1"
> -	newrev="$2"
>
> -	if test -z "$oldrev"
> +	if test -z "$1"

I'd like to keep the nice name. Maybe if it is `local`, there is no longer
a need to replace `$oldrev` by `$1`?

>  	then
>  		die "You must provide a revision to map"
>  	fi
>
> +	oldrev=$(git rev-parse --revs-only "$1") || exit $?
> +	newrev=
> +
> +	if test -n "$2"
> +	then
> +		newrev=$(git rev-parse --revs-only "$2") || exit $?
> +	fi
> +

Would it not make more sense to validate the parameters before calling
`cmd_map`?

In any case, this strikes me like a subject for another commit.

Thanks,
Dscho

P.S.: I'll have to stop reviewing here for the moment, not sure whether
I'll come back to it later today or maybe tomorrow.


>  	cache_setup || exit $?
>  	cache_set "$oldrev" "$newrev"
>
>  	say "Mapped $oldrev => $newrev"
>  }
>
> +cmd_ignore () {
> +	revs=$(git rev-parse $default --revs-only "$@") || exit $?
> +	ensure_single_rev $revs
> +
> +	say "Ignoring $revs"
> +
> +	cache_setup || exit $?
> +
> +	git rev-list $revs |
> +	while read rev
> +	do
> +		cache_set "$rev" ""
> +	done
> +
> +	echo "$revs" >>"$cachedir/processed"
> +}
> +
> +cmd_use () {
> +	revs=$(git rev-parse $default --revs-only "$@") || exit $?
> +	ensure_single_rev $revs
> +
> +	say "Using existing subtree $revs"
> +
> +	cache_setup || exit $?
> +
> +	git rev-list $revs |
> +	while read rev
> +	do
> +		cache_set "$rev" "$rev"
> +	done
> +
> +	echo "$revs" >>"$cachedir/processed"
> +}
> +
>  cmd_split () {
>  	debug "Splitting $dir..."
>  	cache_setup || exit $?
> @@ -827,7 +881,7 @@ cmd_split () {
>  		done
>  	fi
>
> -	unrevs="$(find_existing_splits "$dir" "$revs")"
> +	unrevs="$(find_existing_splits "$dir" "$revs") $(exclude_processed_refs)"
>
>  	mainline="$(find_mainline_ref "$dir" "$revs")"
>  	if test -n "$mainline"
> --
> gitgitgadget
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/7] subtree: Fix handling of complex history
  2020-10-05 21:37     ` Ed Maste
@ 2020-10-07 16:31       ` Johannes Schindelin
  0 siblings, 0 replies; 28+ messages in thread
From: Johannes Schindelin @ 2020-10-07 16:31 UTC (permalink / raw)
  To: Ed Maste
  Cc: Tom Clarkson via GitGitGadget, git mailing list, Avery Pennarun,
	Tom Clarkson

Hi Ed,

On Mon, 5 Oct 2020, Ed Maste wrote:

> On Mon, 5 Oct 2020 at 09:18, Johannes Schindelin
> <Johannes.Schindelin@gmx.de> wrote:
> >
> > FWIW there have been more comments in favor of this patch in
> > https://github.com/gitgitgadget/git/pull/493.
> >
> > I guess what this patch series needs to proceed further is an ACK by
> > Avery?
>
> Avery says "if it's good enough for Johannes it's good enough for me"
> https://twitter.com/apenwarr/status/1313231132721401861

What Avery does not quite understand is that Johannes has not a lot of
clue about `git subtree`... ;-)

Seriously, I am not a user, unfamiliar with the implementation details
(although I am getting a bit more familiar through the review I started
and plan on finishing later). My biggest connection with `git subtree` is
that there seem to be a couple of Git for Windows users who actively use
that command (which is the reason why we include it in Git for Windows,
unlike many other things from `contrib/`).

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 6/7] subtree: more robustly distinguish subtree and mainline commits
  2020-10-06 22:05   ` [PATCH v2 6/7] subtree: more robustly distinguish subtree and mainline commits Tom Clarkson via GitGitGadget
@ 2020-10-07 19:42     ` Johannes Schindelin
  0 siblings, 0 replies; 28+ messages in thread
From: Johannes Schindelin @ 2020-10-07 19:42 UTC (permalink / raw)
  To: Tom Clarkson via GitGitGadget
  Cc: git, Avery Pennarun, Ed Maste, Tom Clarkson, Tom Clarkson

Hi,

On Tue, 6 Oct 2020, Tom Clarkson via GitGitGadget wrote:

> From: Tom Clarkson <tom@tqclarkson.com>
>
> Prevent a mainline commit without $dir being treated as a subtree
> commit and pulling in the entire mainline history. Any valid subtree
> commit will have only valid subtree commits as parents, which will be
> unchanged by check_parents.

I feel like this is only half the picture because I have a hard time
stitching these two sentences together.

After studying the code and your patch a bit, it appears to me that
`process_split_commit()` calls `check_parents()` first, which will call
`process_split_commit()` for all as yet unmapped parents. So basically, it
recurses until it found a commit all of whose parents are already mapped,
then permeates that information all the way back.

Doesn't this cause serious issues with stack overflows and all for long
commit histories?

> Signed-off-by: Tom Clarkson <tom@tqclarkson.com>
> ---
>  contrib/subtree/git-subtree.sh | 24 +++++++++++-------------
>  1 file changed, 11 insertions(+), 13 deletions(-)
>
> diff --git a/contrib/subtree/git-subtree.sh b/contrib/subtree/git-subtree.sh
> index e56621a986..fa6293b372 100755
> --- a/contrib/subtree/git-subtree.sh
> +++ b/contrib/subtree/git-subtree.sh
> @@ -224,8 +224,6 @@ cache_setup () {
>  	fi
>  	mkdir -p "$cachedir" ||
>  		die "Can't create new cachedir: $cachedir"
> -	mkdir -p "$cachedir/notree" ||
> -		die "Can't create new cachedir: $cachedir/notree"

It might make sense to talk about this a bit in the commit message.
Essentially, you are replacing the `notree/<rev>` files by mapping `<rev>`
to the empty string.

This makes me wonder, again, whether the file system layout of the cache
can hold up to the demands. If a main project were to merge a subtree
with, say, 10 million commits, wouldn't that mean that `git subtree` would
now fill one directory with 10 million files? I cannot imagine that this
performs well, still.

>  	debug "Using cachedir: $cachedir" >&2
>  }
>
> @@ -255,18 +253,11 @@ check_parents () {
>  	local indent=$(($2 + 1))
>  	for miss in $missed
>  	do
> -		if ! test -r "$cachedir/notree/$miss"
> -		then
> -			debug "  unprocessed parent commit: $miss ($indent)"
> -			process_split_commit "$miss" "" "$indent"
> -		fi
> +		debug "  unprocessed parent commit: $miss ($indent)"
> +		process_split_commit "$miss" "" "$indent"

That makes sense to me, as the `missed` variable only contains as yet
unmapped commits, therefore we do not have to have an equivalent `test -r`
check.

Ciao,
Dscho

>  	done
>  }
>
> -set_notree () {
> -	echo "1" > "$cachedir/notree/$1"
> -}
> -
>  cache_set () {
>  	oldrev="$1"
>  	newrev="$2"
> @@ -719,11 +710,18 @@ process_split_commit () {
>  	# vs. a mainline commit?  Does it matter?
>  	if test -z "$tree"
>  	then
> -		set_notree "$rev"
>  		if test -n "$newparents"
>  		then
> -			cache_set "$rev" "$rev"
> +			if test "$newparents" = "$parents"
> +			then
> +				# if all parents were subtrees, this can be a subtree commit
> +				cache_set "$rev" "$rev"
> +			else
> +				# a mainline commit with tree missing is equivalent to the initial commit
> +				cache_set "$rev" ""
> +			fi
>  		else
> +			# no parents with valid subtree mappings means a commit prior to subtree add
>  			cache_set "$rev" ""
>  		fi
>  		return
> --
> gitgitgadget
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 7/7] subtree: document new subtree commands
  2020-10-06 22:05   ` [PATCH v2 7/7] subtree: document new subtree commands Tom Clarkson via GitGitGadget
@ 2020-10-07 19:43     ` Johannes Schindelin
  0 siblings, 0 replies; 28+ messages in thread
From: Johannes Schindelin @ 2020-10-07 19:43 UTC (permalink / raw)
  To: Tom Clarkson via GitGitGadget
  Cc: git, Avery Pennarun, Ed Maste, Tom Clarkson, Tom Clarkson

Hi Tom,

On Tue, 6 Oct 2020, Tom Clarkson via GitGitGadget wrote:

> From: Tom Clarkson <tom@tqclarkson.com>
>
> Signed-off-by: Tom Clarkson <tom@tqclarkson.com>
> ---
>  contrib/subtree/git-subtree.txt | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
>
> diff --git a/contrib/subtree/git-subtree.txt b/contrib/subtree/git-subtree.txt
> index 352deda69d..a5a76e8ce6 100644
> --- a/contrib/subtree/git-subtree.txt
> +++ b/contrib/subtree/git-subtree.txt
> @@ -52,6 +52,12 @@ useful elsewhere, you can extract its entire history and publish
>  that as its own git repository, without accidentally
>  intermingling the history of your application project.
>
> +Although the relationship between subtree and mainline commits is stored

As far as I can see, this is the first time the term "mainline commit" is
used in that file, and it has not really be defined what you mean by that.
I *guess* you are referring to commits in the main project that did not
come from any subtree project.

Maybe this can be described without needing a new term?

Ciao,
Dscho

> +in regular git history, it is also cached between subtree runs. In most
> +cases this is merely a performance improvement, but for projects with
> +large and complex histories the cache can be manipulated directly
> +with the use, ignore and map commands.
> +
>  [TIP]
>  In order to keep your commit messages clean, we recommend that
>  people split their commits between the subtrees and the main
> @@ -120,6 +126,21 @@ and friends will work as expected.
>  Note that if you use '--squash' when you merge, you should usually not
>  just '--rejoin' when you split.
>
> +ignore::
> +	Mark a commit and all of its history as irrelevant to subtree split.
> +	In most cases this would be handled automatically based on metadata
> +	from subtree join commits. Intended for improving performance on
> +	extremely large repos and excluding complex history that turns out
> +	to be otherwise problematic.
> +
> +use::
> +	Mark a commit and all of its history as part of an existing subtree.
> +	In normal circumstances this would be handled based on the metadata
> +	from the subtree join commit. Similar to the --onto option of split.
> +
> +map::
> +	Manually override the normal output of split for a particular commit.
> +	Extreme flexibility for advanced troubleshooting purposes only.
>
>  OPTIONS
>  -------
> @@ -142,6 +163,9 @@ OPTIONS
>  	This option is only valid for add, merge and pull (unsure).
>  	Specify <message> as the commit message for the merge commit.
>
> +--clear-cache::
> +	Reset the subtree cache and recalculate all subtree mappings from the
> +	commit history
>
>  OPTIONS FOR add, merge, push, pull
>  ----------------------------------
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v2 0/7] subtree: Fix handling of complex history
  2020-10-06 22:05 ` [PATCH v2 " Tom Clarkson via GitGitGadget
                     ` (6 preceding siblings ...)
  2020-10-06 22:05   ` [PATCH v2 7/7] subtree: document new subtree commands Tom Clarkson via GitGitGadget
@ 2020-10-07 19:46   ` Johannes Schindelin
  7 siblings, 0 replies; 28+ messages in thread
From: Johannes Schindelin @ 2020-10-07 19:46 UTC (permalink / raw)
  To: Tom Clarkson via GitGitGadget; +Cc: git, Avery Pennarun, Ed Maste, Tom Clarkson

Hi Tom,

On Tue, 6 Oct 2020, Tom Clarkson via GitGitGadget wrote:

> Fixes several issues that could occur when running subtree split on large
> repos with more complex history.
>
>  1. A merge commit could bypass the known start point of the subtree, which
>     would cause the entire history to be processed recursively, leading to a
>     stack overflow / segfault after reading a few hundred commits. Older
>     commits are now explicitly recorded as irrelevant so that the recursive
>     process can terminate on any mainline commit rather than only on subtree
>     joins and initial commits.
>
>
>  2. It is possible for a repo to contain subtrees that lack the metadata
>     that is usually present in add/join commit messages (git-svn at least
>     can produce such a structure). The new use/ignore/map commands allow the
>     user to provide that information for any problematic commits.
>
>
>  3. A mainline commit that does not contain the subtree folder could be
>     erroneously identified as a subtree commit, which would add the entire
>     mainline history to the subtree. Commits will now only be used as is if
>     all their parents are already identified as subtree commits. While the
>     new code can still be tripped up by unusual folder structures, the
>     completely unambiguous solution turned out to involve a significant
>     performance penalty, and the new ignore / use commands provide a
>     workaround for that scenario.

I gave this as thorough a review as I can (which is not saying too much,
as I am not exactly familiar with `git subtree`'s inner workings).

Hopefully some of my comments and suggestions are helpful.

At some stage, especially given the problems I pointed out with the
implementation detail that is a flat directory with a potentially insane
number of files in it, I think it would make a lot of sense to go ahead
and turn this into a built-in Git command, implemented in C, and with a
more robust file system layout of its cache.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-10-07 19:46 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-11  5:49 [PATCH 0/7] subtree: Fix handling of complex history Tom Clarkson via GitGitGadget
2020-05-11  5:49 ` [PATCH 1/7] subtree: handle multiple parents passed to cache_miss Tom Clarkson via GitGitGadget
2020-05-11  5:49 ` [PATCH 2/7] subtree: exclude commits predating add from recursive processing Tom Clarkson via GitGitGadget
2020-05-11  5:49 ` [PATCH 3/7] subtree: persist cache between split runs Tom Clarkson via GitGitGadget
2020-05-11  5:49 ` [PATCH 4/7] subtree: add git subtree map command Tom Clarkson via GitGitGadget
2020-05-11  5:49 ` [PATCH 5/7] subtree: add git subtree use and ignore commands Tom Clarkson via GitGitGadget
2020-05-11  5:50 ` [PATCH 6/7] subtree: more robustly distinguish subtree and mainline commits Tom Clarkson via GitGitGadget
2020-05-11  5:50 ` [PATCH 7/7] subtree: document new subtree commands Tom Clarkson via GitGitGadget
2020-10-04 17:52 ` [PATCH 0/7] subtree: Fix handling of complex history Ed Maste
2020-10-04 19:27   ` Johannes Schindelin
2020-10-05 16:47     ` Junio C Hamano
2020-10-05 21:37     ` Ed Maste
2020-10-07 16:31       ` Johannes Schindelin
2020-10-06 22:05 ` [PATCH v2 " Tom Clarkson via GitGitGadget
2020-10-06 22:05   ` [PATCH v2 1/7] subtree: handle multiple parents passed to cache_miss Tom Clarkson via GitGitGadget
2020-10-07 13:12     ` Ed Maste
2020-10-06 22:05   ` [PATCH v2 2/7] subtree: exclude commits predating add from recursive processing Tom Clarkson via GitGitGadget
2020-10-07 15:36     ` Johannes Schindelin
2020-10-06 22:05   ` [PATCH v2 3/7] subtree: persist cache between split runs Tom Clarkson via GitGitGadget
2020-10-07 16:06     ` Johannes Schindelin
2020-10-06 22:05   ` [PATCH v2 4/7] subtree: add git subtree map command Tom Clarkson via GitGitGadget
2020-10-06 22:05   ` [PATCH v2 5/7] subtree: add git subtree use and ignore commands Tom Clarkson via GitGitGadget
2020-10-07 16:29     ` Johannes Schindelin
2020-10-06 22:05   ` [PATCH v2 6/7] subtree: more robustly distinguish subtree and mainline commits Tom Clarkson via GitGitGadget
2020-10-07 19:42     ` Johannes Schindelin
2020-10-06 22:05   ` [PATCH v2 7/7] subtree: document new subtree commands Tom Clarkson via GitGitGadget
2020-10-07 19:43     ` Johannes Schindelin
2020-10-07 19:46   ` [PATCH v2 0/7] subtree: Fix handling of complex history Johannes Schindelin

Code repositories for project(s) associated with this inbox:

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).