git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/3] Make add_missing_tags() linear
@ 2018-10-30 14:16 Derrick Stolee via GitGitGadget
  2018-10-30 14:16 ` [PATCH 1/3] commit-reach: implement get_reachable_subset Derrick Stolee via GitGitGadget
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-10-30 14:16 UTC (permalink / raw)
  To: git; +Cc: peff, newren, Junio C Hamano

As reported earlier [1], the add_missing_tags() method in remote.c has
quadratic performance. Some of that performance is curbed due to the
generation-number cutoff in in_merge_bases_many(). However, that fix doesn't
help users without a commit-graph, and it can still be painful if that
cutoff is sufficiently low compared to the tags we are using for
reachability testing.

Add a new method in commit-reach.c called get_reachable_subset() which does
a many-to-many reachability test. Starting at the 'from' commits, walk until
the generation is below the smallest generation in the 'to' commits, or all
'to' commits have been discovered. This performs only one commit walk for
the entire add_missing_tags() method, giving linear performance in the worst
case.

Tests are added in t6600-test-reach.sh to ensure get_reachable_subset()
works independently of its application in add_missing_tags().

Thanks, -Stolee

[1] 
https://public-inbox.org/git/CABPp-BECpSOxudovjbDG_3W9wus102RW+E+qPmd4g3Qyd-QDKQ@mail.gmail.com/

Derrick Stolee (3):
  commit-reach: implement get_reachable_subset
  test-reach: test get_reachable_subset
  remote: make add_missing_tags() linear

 commit-reach.c        | 70 +++++++++++++++++++++++++++++++++++++++++++
 commit-reach.h        | 10 +++++++
 remote.c              | 34 ++++++++++++++++++++-
 t/helper/test-reach.c | 34 ++++++++++++++++++---
 t/t6600-test-reach.sh | 52 ++++++++++++++++++++++++++++++++
 5 files changed, 195 insertions(+), 5 deletions(-)


base-commit: c670b1f876521c9f7cd40184bf7ed05aad843433
Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-60%2Fderrickstolee%2Fadd-missing-tags-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-60/derrickstolee/add-missing-tags-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/60
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/3] commit-reach: implement get_reachable_subset
  2018-10-30 14:16 [PATCH 0/3] Make add_missing_tags() linear Derrick Stolee via GitGitGadget
@ 2018-10-30 14:16 ` Derrick Stolee via GitGitGadget
  2018-10-31  3:35   ` Junio C Hamano
  2018-10-31  6:07   ` Elijah Newren
  2018-10-30 14:16 ` [PATCH 2/3] test-reach: test get_reachable_subset Derrick Stolee via GitGitGadget
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 22+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-10-30 14:16 UTC (permalink / raw)
  To: git; +Cc: peff, newren, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The existing reachability algorithms in commit-reach.c focus on
finding merge-bases or determining if all commits in a set X can
reach at least one commit in a set Y. However, for two commits sets
X and Y, we may also care about which commits in Y are reachable
from at least one commit in X.

Implement get_reachable_subset() which answers this question. Given
two arrays of commits, 'from' and 'to', return a commit_list with
every commit from the 'to' array that is reachable from at least
one commit in the 'from' array.

The algorithm is a simple walk starting at the 'from' commits, using
the PARENT2 flag to indicate "this commit has already been added to
the walk queue". By marking the 'to' commits with the PARENT1 flag,
we can determine when we see a commit from the 'to' array. We remove
the PARENT1 flag as we add that commit to the result list to avoid
duplicates.

The order of the resulting list is a reverse of the order that the
commits are discovered in the walk.

There are a couple shortcuts to avoid walking more than we need:

1. We determine the minimum generation number of commits in the
   'to' array. We do not walk commits with generation number
   below this minimum.

2. We count how many distinct commits are in the 'to' array, and
   decrement this count when we discover a 'to' commit during the
   walk. If this number reaches zero, then we can terminate the
   walk.

Tests will be added using the 'test-tool reach' helper in a
subsequent commit.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-reach.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++
 commit-reach.h | 10 ++++++++
 2 files changed, 80 insertions(+)

diff --git a/commit-reach.c b/commit-reach.c
index 9f79ce0a2..a98532ecc 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -688,3 +688,73 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 	object_array_clear(&from_objs);
 	return result;
 }
+
+struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
+					 struct commit **to, int nr_to,
+					 int reachable_flag)
+{
+	struct commit **item;
+	struct commit *current;
+	struct commit_list *found_commits = NULL;
+	struct commit **to_last = to + nr_to;
+	struct commit **from_last = from + nr_from;
+	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	int num_to_find = 0;
+
+	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
+
+	for (item = to; item < to_last; item++) {
+		struct commit *c = *item;
+		
+		parse_commit(c);
+		if (c->generation < min_generation)
+			min_generation = c->generation;
+
+		if (!(c->object.flags & PARENT1)) {
+			c->object.flags |= PARENT1;
+			num_to_find++;
+		}
+	}
+
+	for (item = from; item < from_last; item++) {
+		struct commit *c = *item;
+		if (!(c->object.flags & PARENT2)) {
+			c->object.flags |= PARENT2;
+			parse_commit(c);
+
+			prio_queue_put(&queue, *item);
+		}
+	}
+
+	while (num_to_find && (current = prio_queue_get(&queue)) != NULL) {
+		struct commit_list *parents;
+
+		if (current->object.flags & PARENT1) {
+			current->object.flags &= ~PARENT1;
+			current->object.flags |= reachable_flag;
+			commit_list_insert(current, &found_commits);
+			num_to_find--;
+		}
+
+		for (parents = current->parents; parents; parents = parents->next) {
+			struct commit *p = parents->item;
+
+			parse_commit(p);
+
+			if (p->generation < min_generation)
+				continue;
+
+			if (p->object.flags & PARENT2)
+				continue;
+
+			p->object.flags |= PARENT2;
+			prio_queue_put(&queue, p);
+		}
+	}
+
+	clear_commit_marks_many(nr_to, to, PARENT1);
+	clear_commit_marks_many(nr_from, from, PARENT2);
+
+	return found_commits;
+}
+
diff --git a/commit-reach.h b/commit-reach.h
index 7d313e297..43bd50a70 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -74,4 +74,14 @@ int can_all_from_reach_with_flag(struct object_array *from,
 int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 		       int commit_date_cutoff);
 
+
+/*
+ * Return a list of commits containing the commits in the 'to' array
+ * that are reachable from at least one commit in the 'from' array.
+ * Also add the given 'flag' to each of the commits in the returned list.
+ */
+struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
+					 struct commit **to, int nr_to,
+					 int reachable_flag);
+
 #endif
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/3] test-reach: test get_reachable_subset
  2018-10-30 14:16 [PATCH 0/3] Make add_missing_tags() linear Derrick Stolee via GitGitGadget
  2018-10-30 14:16 ` [PATCH 1/3] commit-reach: implement get_reachable_subset Derrick Stolee via GitGitGadget
@ 2018-10-30 14:16 ` Derrick Stolee via GitGitGadget
  2018-10-30 14:16 ` [PATCH 3/3] remote: make add_missing_tags() linear Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-10-30 14:16 UTC (permalink / raw)
  To: git; +Cc: peff, newren, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The get_reachable_subset() method returns the list of commits in
the 'to' array that are reachable from at least one commit in the
'from' array. Add tests that check this method works in a few
cases:

1. All commits in the 'to' list are reachable. This exercises the
   early-termination condition.

2. Some commits in the 'to' list are reachable. This exercises the
   loop-termination condition.

3. No commits in the 'to' list are reachable. This exercises the
   NULL return condition.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/helper/test-reach.c | 34 ++++++++++++++++++++++++----
 t/t6600-test-reach.sh | 52 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 82 insertions(+), 4 deletions(-)

diff --git a/t/helper/test-reach.c b/t/helper/test-reach.c
index 08d2ea68e..a0272178b 100644
--- a/t/helper/test-reach.c
+++ b/t/helper/test-reach.c
@@ -32,8 +32,8 @@ int cmd__reach(int ac, const char **av)
 	struct commit *A, *B;
 	struct commit_list *X, *Y;
 	struct object_array X_obj = OBJECT_ARRAY_INIT;
-	struct commit **X_array;
-	int X_nr, X_alloc;
+	struct commit **X_array, **Y_array;
+	int X_nr, X_alloc, Y_nr, Y_alloc;
 	struct strbuf buf = STRBUF_INIT;
 	struct repository *r = the_repository;
 
@@ -44,9 +44,10 @@ int cmd__reach(int ac, const char **av)
 
 	A = B = NULL;
 	X = Y = NULL;
-	X_nr = 0;
-	X_alloc = 16;
+	X_nr = Y_nr = 0;
+	X_alloc = Y_alloc = 16;
 	ALLOC_ARRAY(X_array, X_alloc);
+	ALLOC_ARRAY(Y_array, Y_alloc);
 
 	while (strbuf_getline(&buf, stdin) != EOF) {
 		struct object_id oid;
@@ -92,6 +93,8 @@ int cmd__reach(int ac, const char **av)
 
 			case 'Y':
 				commit_list_insert(c, &Y);
+				ALLOC_GROW(Y_array, Y_nr + 1, Y_alloc);
+				Y_array[Y_nr++] = c;
 				break;
 
 			default:
@@ -136,6 +139,29 @@ int cmd__reach(int ac, const char **av)
 			filter.with_commit_tag_algo = 0;
 
 		printf("%s(_,A,X,_):%d\n", av[1], commit_contains(&filter, A, X, &cache));
+	} else if (!strcmp(av[1], "get_reachable_subset")) {
+		const int reachable_flag = 1;
+		int i, count = 0;
+		struct commit_list *current;
+		struct commit_list *list = get_reachable_subset(X_array, X_nr,
+								Y_array, Y_nr,
+								reachable_flag);
+		printf("get_reachable_subset(X,Y)\n");
+		for (current = list; current; current = current->next) {
+			if (!(list->item->object.flags & reachable_flag))
+				die(_("commit %s is not marked reachable"),
+				    oid_to_hex(&list->item->object.oid));
+			count++;
+		}
+		for (i = 0; i < Y_nr; i++) {
+			if (Y_array[i]->object.flags & reachable_flag)
+				count--;
+		}
+
+		if (count < 0)
+			die(_("too many commits marked reachable"));
+
+		print_sorted_commit_ids(list);
 	}
 
 	exit(0);
diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index ae94b27f7..a0c64e617 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -265,4 +265,56 @@ test_expect_success 'commit_contains:miss' '
 	test_three_modes commit_contains --tag
 '
 
+test_expect_success 'get_reachable_subset:all' '
+	cat >input <<-\EOF &&
+	X:commit-9-1
+	X:commit-8-3
+	X:commit-7-5
+	X:commit-6-6
+	X:commit-1-7
+	Y:commit-3-3
+	Y:commit-1-7
+	Y:commit-5-6
+	EOF
+	(
+		echo "get_reachable_subset(X,Y)" &&
+		git rev-parse commit-3-3 \
+			      commit-1-7 \
+			      commit-5-6 | sort
+	) >expect &&
+	test_three_modes get_reachable_subset
+'
+
+test_expect_success 'get_reachable_subset:some' '
+	cat >input <<-\EOF &&
+	X:commit-9-1
+	X:commit-8-3
+	X:commit-7-5
+	X:commit-1-7
+	Y:commit-3-3
+	Y:commit-1-7
+	Y:commit-5-6
+	EOF
+	(
+		echo "get_reachable_subset(X,Y)" &&
+		git rev-parse commit-3-3 \
+			      commit-1-7 | sort
+	) >expect &&
+	test_three_modes get_reachable_subset
+'
+
+test_expect_success 'get_reachable_subset:none' '
+	cat >input <<-\EOF &&
+	X:commit-9-1
+	X:commit-8-3
+	X:commit-7-5
+	X:commit-1-7
+	Y:commit-9-3
+	Y:commit-7-6
+	Y:commit-2-8
+	EOF
+	echo "get_reachable_subset(X,Y)" >expect &&
+	test_three_modes get_reachable_subset
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/3] remote: make add_missing_tags() linear
  2018-10-30 14:16 [PATCH 0/3] Make add_missing_tags() linear Derrick Stolee via GitGitGadget
  2018-10-30 14:16 ` [PATCH 1/3] commit-reach: implement get_reachable_subset Derrick Stolee via GitGitGadget
  2018-10-30 14:16 ` [PATCH 2/3] test-reach: test get_reachable_subset Derrick Stolee via GitGitGadget
@ 2018-10-30 14:16 ` Derrick Stolee via GitGitGadget
  2018-10-31  3:05 ` [PATCH 0/3] Make " Junio C Hamano
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 22+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-10-30 14:16 UTC (permalink / raw)
  To: git; +Cc: peff, newren, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The add_missing_tags() method currently has quadratic behavior.
This is due to a linear number (based on number of tags T) of
calls to in_merge_bases_many, which has linear performance (based
on number of commits C in the repository).

Replace this O(T * C) algorithm with an O(T + C) algorithm by
using get_reachable_subset(). We ignore the return list and focus
instead on the reachable_flag assigned to the commits we care
about, because we need to interact with the tag ref and not just
the commit object.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 remote.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/remote.c b/remote.c
index 81f4f01b0..b850f2feb 100644
--- a/remote.c
+++ b/remote.c
@@ -1205,9 +1205,36 @@ static void add_missing_tags(struct ref *src, struct ref **dst, struct ref ***ds
 	 * sent to the other side.
 	 */
 	if (sent_tips.nr) {
+		const int reachable_flag = 1;
+		struct commit_list *found_commits;
+		struct commit **src_commits;
+		int nr_src_commits = 0, alloc_src_commits = 16;
+		ALLOC_ARRAY(src_commits, alloc_src_commits);
+
 		for_each_string_list_item(item, &src_tag) {
 			struct ref *ref = item->util;
+			struct commit *commit;
+
+			if (is_null_oid(&ref->new_oid))
+				continue;
+			commit = lookup_commit_reference_gently(the_repository,
+								&ref->new_oid,
+								1);
+			if (!commit)
+				/* not pushing a commit, which is not an error */
+				continue;
+
+			ALLOC_GROW(src_commits, nr_src_commits + 1, alloc_src_commits);
+			src_commits[nr_src_commits++] = commit;
+		}
+
+		found_commits = get_reachable_subset(sent_tips.tip, sent_tips.nr,
+						     src_commits, nr_src_commits,
+						     reachable_flag);
+
+		for_each_string_list_item(item, &src_tag) {
 			struct ref *dst_ref;
+			struct ref *ref = item->util;
 			struct commit *commit;
 
 			if (is_null_oid(&ref->new_oid))
@@ -1223,7 +1250,7 @@ static void add_missing_tags(struct ref *src, struct ref **dst, struct ref ***ds
 			 * Is this tag, which they do not have, reachable from
 			 * any of the commits we are sending?
 			 */
-			if (!in_merge_bases_many(commit, sent_tips.nr, sent_tips.tip))
+			if (!(commit->object.flags & reachable_flag))
 				continue;
 
 			/* Add it in */
@@ -1231,7 +1258,12 @@ static void add_missing_tags(struct ref *src, struct ref **dst, struct ref ***ds
 			oidcpy(&dst_ref->new_oid, &ref->new_oid);
 			dst_ref->peer_ref = copy_ref(ref);
 		}
+
+		clear_commit_marks_many(nr_src_commits, src_commits, reachable_flag);
+		free(src_commits);
+		free_commit_list(found_commits);
 	}
+
 	string_list_clear(&src_tag, 0);
 	free(sent_tips.tip);
 }
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] Make add_missing_tags() linear
  2018-10-30 14:16 [PATCH 0/3] Make add_missing_tags() linear Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2018-10-30 14:16 ` [PATCH 3/3] remote: make add_missing_tags() linear Derrick Stolee via GitGitGadget
@ 2018-10-31  3:05 ` Junio C Hamano
  2018-10-31  6:04 ` Elijah Newren
  2018-11-02 13:14 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  5 siblings, 0 replies; 22+ messages in thread
From: Junio C Hamano @ 2018-10-31  3:05 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, peff, newren

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Add a new method in commit-reach.c called get_reachable_subset() which does
> a many-to-many reachability test. Starting at the 'from' commits, walk until
> the generation is below the smallest generation in the 'to' commits, or all
> 'to' commits have been discovered. This performs only one commit walk for
> the entire add_missing_tags() method, giving linear performance in the worst
> case.

;-)

I think in_merge_bases_many() was an attempt to do one half of this
(i.e. it checks only one 'to' against main 'from' to see if any of
them reach).  I wonder why we didn't extend it to multiple 'to' back
when we invented it.

In any case, good to see this code optimized.

Thanks.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] commit-reach: implement get_reachable_subset
  2018-10-30 14:16 ` [PATCH 1/3] commit-reach: implement get_reachable_subset Derrick Stolee via GitGitGadget
@ 2018-10-31  3:35   ` Junio C Hamano
  2018-10-31 12:01     ` Derrick Stolee
  2018-10-31  6:07   ` Elijah Newren
  1 sibling, 1 reply; 22+ messages in thread
From: Junio C Hamano @ 2018-10-31  3:35 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, peff, newren, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
> +					 struct commit **to, int nr_to,
> +					 int reachable_flag)

This is OR'ed into object.flags, and I somoehow expected to see it
as 'unsigned int', not a signed one.

> +{
> +	struct commit **item;
> +	struct commit *current;
> +	struct commit_list *found_commits = NULL;
> +	struct commit **to_last = to + nr_to;
> +	struct commit **from_last = from + nr_from;
> +	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
> +	int num_to_find = 0;
> +
> +	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
> +
> +	for (item = to; item < to_last; item++) {
> +		struct commit *c = *item;
> +		
> +		parse_commit(c);
> +		if (c->generation < min_generation)
> +			min_generation = c->generation;
> +
> +		if (!(c->object.flags & PARENT1)) {
> +			c->object.flags |= PARENT1;
> +			num_to_find++;
> +		}
> +	}
> +
> +	for (item = from; item < from_last; item++) {
> +		struct commit *c = *item;
> +		if (!(c->object.flags & PARENT2)) {
> +			c->object.flags |= PARENT2;
> +			parse_commit(c);
> +
> +			prio_queue_put(&queue, *item);
> +		}
> +	}

OK, we marked "to" with PARENT1 and counted them in num_to_find
without dups.  We also marked "from" with PARENT2 and threw them in
the "queue" without dups.

Mental note: the caller must guarantee that everybody reachable from
"to" and "from" have PARENT1 and PARENT2 clear.  This might deserve
to be in the comment before the function.

> +	while (num_to_find && (current = prio_queue_get(&queue)) != NULL) {
> +		struct commit_list *parents;
> +
> +		if (current->object.flags & PARENT1) {
> +			current->object.flags &= ~PARENT1;
> +			current->object.flags |= reachable_flag;
> +			commit_list_insert(current, &found_commits);
> +			num_to_find--;

What is in "queue" are all reachable from "from" so we found this
object in the "to" set is reachable.  One down.

> +		}
> +
> +		for (parents = current->parents; parents; parents = parents->next) {
> +			struct commit *p = parents->item;
> +
> +			parse_commit(p);
> +
> +			if (p->generation < min_generation)
> +				continue;

Any commit reachable from "from" whose generation is too small
(i.e. old) can never reach any of "to", because all "to" are at
generation "min_generation" or greater.  Makes sense.

> +			if (p->object.flags & PARENT2)
> +				continue;

If the object is painted with PARENT2, we know we have it in queue
and traversing to find more of its ancestors.  Avoid recursing into
it twice.  Makes sense.

> +			p->object.flags |= PARENT2;
> +			prio_queue_put(&queue, p);

Otherwise, explore this parent.

> +		}
> +	}
> +
> +	clear_commit_marks_many(nr_to, to, PARENT1);

Among "to", the ones that were not found still have PARENT1.
Because we do not propagate this flag down the ancestry chain like
merge-base discovery, this only has to clear just one level.

> +	clear_commit_marks_many(nr_from, from, PARENT2);

And then we smudged everything that are reachable from "from"
with this flag.  In the worst case, i.e. when num_to_find is still
positive here, we would have painted all commits down to the root,
and we need to clear all of them.

> +	return found_commits;
> +}

OK, this all makes sense.  Unlike merge-base traversals, this does
not have to traverse from the "to" side at all, which makes it quite
simpler and straight-forward.

I do wonder if we can now reimplement in_merge_bases_many() in terms
of this helper, and if that gives us a better performance.  It asks
"is 'commit', i.e. a single 'to', an ancestor of, i.e. reachable
from, one of the 'references', i.e.  'from'"?

> diff --git a/commit-reach.h b/commit-reach.h
> index 7d313e297..43bd50a70 100644
> --- a/commit-reach.h
> +++ b/commit-reach.h
> @@ -74,4 +74,14 @@ int can_all_from_reach_with_flag(struct object_array *from,
>  int can_all_from_reach(struct commit_list *from, struct commit_list *to,
>  		       int commit_date_cutoff);
>  
> +
> +/*
> + * Return a list of commits containing the commits in the 'to' array
> + * that are reachable from at least one commit in the 'from' array.
> + * Also add the given 'flag' to each of the commits in the returned list.
> + */
> +struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
> +					 struct commit **to, int nr_to,
> +					 int reachable_flag);
> +
>  #endif

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] Make add_missing_tags() linear
  2018-10-30 14:16 [PATCH 0/3] Make add_missing_tags() linear Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2018-10-31  3:05 ` [PATCH 0/3] Make " Junio C Hamano
@ 2018-10-31  6:04 ` Elijah Newren
  2018-10-31 12:05   ` Derrick Stolee
  2018-11-02 13:14 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  5 siblings, 1 reply; 22+ messages in thread
From: Elijah Newren @ 2018-10-31  6:04 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget
  Cc: Git Mailing List, Jeff King, Junio C Hamano

On Tue, Oct 30, 2018 at 7:16 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> As reported earlier [1], the add_missing_tags() method in remote.c has
> quadratic performance. Some of that performance is curbed due to the
> generation-number cutoff in in_merge_bases_many(). However, that fix doesn't
> help users without a commit-graph, and it can still be painful if that
> cutoff is sufficiently low compared to the tags we are using for
> reachability testing.
>
> Add a new method in commit-reach.c called get_reachable_subset() which does
> a many-to-many reachability test. Starting at the 'from' commits, walk until
> the generation is below the smallest generation in the 'to' commits, or all
> 'to' commits have been discovered. This performs only one commit walk for
> the entire add_missing_tags() method, giving linear performance in the worst
> case.
>
> Tests are added in t6600-test-reach.sh to ensure get_reachable_subset()
> works independently of its application in add_missing_tags().

On the original repo where the topic was brought up, with commit-graph
NOT turned on and using origin/master, I see:

$ time git push --dry-run --follow-tags /home/newren/repo-mirror
To /home/newren/repo-mirror
 * [new branch]              test5 -> test5

real 1m20.081s
user 1m19.688s
sys 0m0.292s

Merging this series in, I now get:

$ time git push --dry-run --follow-tags /home/newren/repo-mirror
To /home/newren/repo-mirror
 * [new branch]              test5 -> test5

real 0m2.857s
user 0m2.580s
sys 0m0.328s

which provides a very nice speedup.

Oddly enough, if I _also_ do the following:
$ git config core.commitgraph true
$ git config gc.writecommitgraph true
$ git gc

then my timing actually slows down just slightly:
$ time git push --follow-tags --dry-run /home/newren/repo-mirror
To /home/newren/repo-mirror
 * [new branch]              test5 -> test5

real 0m3.027s
user 0m2.696s
sys 0m0.400s

(run-to-run variation seems pretty consistent, < .1s variation, so
this difference is just enough to notice.)  I wouldn't be that
surprised if that means there's some really old tags with very small
generation numbers, meaning it's not gaining anything in this special
case from the commit-graph, but it does pay the cost of loading the
commit-graph.


Anyway, looks good in my testing.  Thanks much for working on this!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] commit-reach: implement get_reachable_subset
  2018-10-30 14:16 ` [PATCH 1/3] commit-reach: implement get_reachable_subset Derrick Stolee via GitGitGadget
  2018-10-31  3:35   ` Junio C Hamano
@ 2018-10-31  6:07   ` Elijah Newren
  2018-10-31 11:54     ` Derrick Stolee
  1 sibling, 1 reply; 22+ messages in thread
From: Elijah Newren @ 2018-10-31  6:07 UTC (permalink / raw)
  To: Johannes Schindelin via GitGitGadget
  Cc: Git Mailing List, Jeff King, Junio C Hamano, Derrick Stolee

On Tue, Oct 30, 2018 at 7:16 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
> --- a/commit-reach.c
> +++ b/commit-reach.c
> @@ -688,3 +688,73 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
>         object_array_clear(&from_objs);
>         return result;
>  }
> +
> +struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
> +                                        struct commit **to, int nr_to,
> +                                        int reachable_flag)
> +{
> +       struct commit **item;
> +       struct commit *current;
> +       struct commit_list *found_commits = NULL;
> +       struct commit **to_last = to + nr_to;
> +       struct commit **from_last = from + nr_from;
> +       uint32_t min_generation = GENERATION_NUMBER_INFINITY;
> +       int num_to_find = 0;
> +
> +       struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
> +
> +       for (item = to; item < to_last; item++) {
> +               struct commit *c = *item;
> +
> +               parse_commit(c);
> +               if (c->generation < min_generation)
> +                       min_generation = c->generation;

So, when we don't have a commit-graph, is c->generation just going to
be 0, making min_generation also be 0? (meaning we get no possible
speed benefit from the commit-graph, since we just don't have that
information available)?

> +
> +               if (!(c->object.flags & PARENT1)) {
> +                       c->object.flags |= PARENT1;
> +                       num_to_find++;
> +               }
> +       }
> +
> +       for (item = from; item < from_last; item++) {
> +               struct commit *c = *item;
> +               if (!(c->object.flags & PARENT2)) {
> +                       c->object.flags |= PARENT2;
> +                       parse_commit(c);
> +
> +                       prio_queue_put(&queue, *item);
> +               }
> +       }
> +
> +       while (num_to_find && (current = prio_queue_get(&queue)) != NULL) {
> +               struct commit_list *parents;
> +
> +               if (current->object.flags & PARENT1) {
> +                       current->object.flags &= ~PARENT1;
> +                       current->object.flags |= reachable_flag;
> +                       commit_list_insert(current, &found_commits);
> +                       num_to_find--;
> +               }
> +
> +               for (parents = current->parents; parents; parents = parents->next) {
> +                       struct commit *p = parents->item;
> +
> +                       parse_commit(p);
> +
> +                       if (p->generation < min_generation)
> +                               continue;
> +
> +                       if (p->object.flags & PARENT2)
> +                               continue;
> +
> +                       p->object.flags |= PARENT2;
> +                       prio_queue_put(&queue, p);
> +               }
> +       }
> +
> +       clear_commit_marks_many(nr_to, to, PARENT1);
> +       clear_commit_marks_many(nr_from, from, PARENT2);
> +
> +       return found_commits;
> +}
> +
> diff --git a/commit-reach.h b/commit-reach.h
> index 7d313e297..43bd50a70 100644
> --- a/commit-reach.h
> +++ b/commit-reach.h
> @@ -74,4 +74,14 @@ int can_all_from_reach_with_flag(struct object_array *from,
>  int can_all_from_reach(struct commit_list *from, struct commit_list *to,
>                        int commit_date_cutoff);
>
> +
> +/*
> + * Return a list of commits containing the commits in the 'to' array
> + * that are reachable from at least one commit in the 'from' array.
> + * Also add the given 'flag' to each of the commits in the returned list.
> + */
> +struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
> +                                        struct commit **to, int nr_to,
> +                                        int reachable_flag);
> +
>  #endif
> --
> gitgitgadget
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] commit-reach: implement get_reachable_subset
  2018-10-31  6:07   ` Elijah Newren
@ 2018-10-31 11:54     ` Derrick Stolee
  0 siblings, 0 replies; 22+ messages in thread
From: Derrick Stolee @ 2018-10-31 11:54 UTC (permalink / raw)
  To: Elijah Newren, Johannes Schindelin via GitGitGadget
  Cc: Git Mailing List, Jeff King, Junio C Hamano, Derrick Stolee

On 10/31/2018 2:07 AM, Elijah Newren wrote:
> On Tue, Oct 30, 2018 at 7:16 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> --- a/commit-reach.c
>> +++ b/commit-reach.c
>> @@ -688,3 +688,73 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
>>          object_array_clear(&from_objs);
>>          return result;
>>   }
>> +
>> +struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
>> +                                        struct commit **to, int nr_to,
>> +                                        int reachable_flag)
>> +{
>> +       struct commit **item;
>> +       struct commit *current;
>> +       struct commit_list *found_commits = NULL;
>> +       struct commit **to_last = to + nr_to;
>> +       struct commit **from_last = from + nr_from;
>> +       uint32_t min_generation = GENERATION_NUMBER_INFINITY;
>> +       int num_to_find = 0;
>> +
>> +       struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
>> +
>> +       for (item = to; item < to_last; item++) {
>> +               struct commit *c = *item;
>> +
>> +               parse_commit(c);
>> +               if (c->generation < min_generation)
>> +                       min_generation = c->generation;
> So, when we don't have a commit-graph, is c->generation just going to
> be 0, making min_generation also be 0? (meaning we get no possible
> speed benefit from the commit-graph, since we just don't have that
> information available)?

If we don't have a commit-graph, then we can only terminate the loop early
if we discover all of the commits (num_to_find == 0).Otherwise, we need to
walk the entire graph in order to determine non-reachability. This relates
to Junio's point about in_merge_bases_many(), which I'll respond to his
message in more detail about that.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] commit-reach: implement get_reachable_subset
  2018-10-31  3:35   ` Junio C Hamano
@ 2018-10-31 12:01     ` Derrick Stolee
  2018-11-02  1:51       ` Junio C Hamano
  0 siblings, 1 reply; 22+ messages in thread
From: Derrick Stolee @ 2018-10-31 12:01 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, peff, newren, Derrick Stolee

On 10/30/2018 11:35 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
>> +struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
>> +					 struct commit **to, int nr_to,
>> +					 int reachable_flag)
> This is OR'ed into object.flags, and I somoehow expected to see it
> as 'unsigned int', not a signed one.

Will do. Thanks.

>
>> +{
>> +	struct commit **item;
>> +	struct commit *current;
>> +	struct commit_list *found_commits = NULL;
>> +	struct commit **to_last = to + nr_to;
>> +	struct commit **from_last = from + nr_from;
>> +	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
>> +	int num_to_find = 0;
>> +
>> +	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
>> +
>> +	for (item = to; item < to_last; item++) {
>> +		struct commit *c = *item;
>> +		
>> +		parse_commit(c);
>> +		if (c->generation < min_generation)
>> +			min_generation = c->generation;
>> +
>> +		if (!(c->object.flags & PARENT1)) {
>> +			c->object.flags |= PARENT1;
>> +			num_to_find++;
>> +		}
>> +	}
>> +
>> +	for (item = from; item < from_last; item++) {
>> +		struct commit *c = *item;
>> +		if (!(c->object.flags & PARENT2)) {
>> +			c->object.flags |= PARENT2;
>> +			parse_commit(c);
>> +
>> +			prio_queue_put(&queue, *item);
>> +		}
>> +	}
> OK, we marked "to" with PARENT1 and counted them in num_to_find
> without dups.  We also marked "from" with PARENT2 and threw them in
> the "queue" without dups.
>
> Mental note: the caller must guarantee that everybody reachable from
> "to" and "from" have PARENT1 and PARENT2 clear.  This might deserve
> to be in the comment before the function.

I'll put that in the header file.

[snip]
> OK, this all makes sense.  Unlike merge-base traversals, this does
> not have to traverse from the "to" side at all, which makes it quite
> simpler and straight-forward.
>
> I do wonder if we can now reimplement in_merge_bases_many() in terms
> of this helper, and if that gives us a better performance.  It asks
> "is 'commit', i.e. a single 'to', an ancestor of, i.e. reachable
> from, one of the 'references', i.e.  'from'"?

We could do this, but it does come with a performance hit when the following
are all true:

1. 'to' is not reachable from any 'from' commits.

2. The 'to' and 'from' commits are close in commit-date.

3. Generation numbers are not available, or the topology is skewed to have
    commits with high commit date and low generation number.

Since in_merge_bases_many() calls paint_down_to_common(), it has the same
issues with the current generation numbers. This can be fixed when we have
the next version of generation numbers available.

I'll make a note to have in_merge_bases_many() call get_reachable_subset()
conditionally (like the generation_numbers_available() trick in the 
--topo-order
series) after the generation numbers are settled and implemented.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] Make add_missing_tags() linear
  2018-10-31  6:04 ` Elijah Newren
@ 2018-10-31 12:05   ` Derrick Stolee
  2018-11-01  6:52     ` Elijah Newren
  0 siblings, 1 reply; 22+ messages in thread
From: Derrick Stolee @ 2018-10-31 12:05 UTC (permalink / raw)
  To: git; +Cc: newren, gitster, peff, Derrick Stolee


On 10/31/2018 2:04 AM, Elijah Newren wrote:
> On Tue, Oct 30, 2018 at 7:16 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> As reported earlier [1], the add_missing_tags() method in remote.c has
>> quadratic performance. Some of that performance is curbed due to the
>> generation-number cutoff in in_merge_bases_many(). However, that fix doesn't
>> help users without a commit-graph, and it can still be painful if that
>> cutoff is sufficiently low compared to the tags we are using for
>> reachability testing.
>>
>> Add a new method in commit-reach.c called get_reachable_subset() which does
>> a many-to-many reachability test. Starting at the 'from' commits, walk until
>> the generation is below the smallest generation in the 'to' commits, or all
>> 'to' commits have been discovered. This performs only one commit walk for
>> the entire add_missing_tags() method, giving linear performance in the worst
>> case.
>>
>> Tests are added in t6600-test-reach.sh to ensure get_reachable_subset()
>> works independently of its application in add_missing_tags().
>
> On the original repo where the topic was brought up, with commit-graph
> NOT turned on and using origin/master, I see:
>
> $ time git push --dry-run --follow-tags /home/newren/repo-mirror
> To /home/newren/repo-mirror
>  * [new branch]       test5 -> test5
>
> real 1m20.081s
> user 1m19.688s
> sys 0m0.292s
>
> Merging this series in, I now get:
>
> $ time git push --dry-run --follow-tags /home/newren/repo-mirror
> To /home/newren/repo-mirror
>  * [new branch]       test5 -> test5
>
> real 0m2.857s
> user 0m2.580s
> sys 0m0.328s
>
> which provides a very nice speedup.
>
> Oddly enough, if I _also_ do the following:
> $ git config core.commitgraph true
> $ git config gc.writecommitgraph true
> $ git gc
>
> then my timing actually slows down just slightly:
> $ time git push --follow-tags --dry-run /home/newren/repo-mirror
> To /home/newren/repo-mirror
>  * [new branch]          test5 -> test5
>
> real 0m3.027s
> user 0m2.696s
> sys 0m0.400s

So you say that the commit-graph is off in the 2.8s case, but not here
in the 3.1s case? I would expect _at minimum_ that the cost of parsing
commits would have a speedup in the commit-graph case.  There may be
something else going on here, since you are timing a `push` event that
is doing more than the current walk.

> (run-to-run variation seems pretty consistent, < .1s variation, so
> this difference is just enough to notice.)  I wouldn't be that
> surprised if that means there's some really old tags with very small
> generation numbers, meaning it's not gaining anything in this special
> case from the commit-graph, but it does pay the cost of loading the
> commit-graph.

While you have this test environment, do you mind applying the diff
below and re-running the tests? It will output a count for how many
commits are walked by the algorithm. This should help us determine if
this is another case where generation numbers are worse than commit-date,
or if there is something else going on. Thanks!

-->8--

From 2115e7dcafb2770455b7b4793f90edc2254bad97 Mon Sep 17 00:00:00 2001
From: Derrick Stolee <dstolee@microsoft.com>
Date: Wed, 31 Oct 2018 11:40:50 +0000
Subject: [PATCH] DO-NOT-MERGE: count commits in get_reachable_subset

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-reach.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/commit-reach.c b/commit-reach.c
index a98532ecc8..b512461cf7 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -700,6 +700,7 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
 	struct commit **from_last = from + nr_from;
 	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
 	int num_to_find = 0;
+	int count = 0;
 
 	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 
@@ -729,6 +730,8 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
 	while (num_to_find && (current = prio_queue_get(&queue)) != NULL) {
 		struct commit_list *parents;
 
+		count++;
+
 		if (current->object.flags & PARENT1) {
 			current->object.flags &= ~PARENT1;
 			current->object.flags |= reachable_flag;
@@ -755,6 +758,8 @@ struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
 	clear_commit_marks_many(nr_to, to, PARENT1);
 	clear_commit_marks_many(nr_from, from, PARENT2);
 
+	fprintf(stderr, "count: %d\n", count);
+
 	return found_commits;
 }
 
-- 
2.19.1.542.gc4df23f792


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] Make add_missing_tags() linear
  2018-10-31 12:05   ` Derrick Stolee
@ 2018-11-01  6:52     ` Elijah Newren
  2018-11-01 12:32       ` Derrick Stolee
  0 siblings, 1 reply; 22+ messages in thread
From: Elijah Newren @ 2018-11-01  6:52 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Derrick Stolee

On Wed, Oct 31, 2018 at 5:05 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 10/31/2018 2:04 AM, Elijah Newren wrote:
> > On Tue, Oct 30, 2018 at 7:16 AM Derrick Stolee via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> >>
> >> As reported earlier [1], the add_missing_tags() method in remote.c has
> >> quadratic performance. Some of that performance is curbed due to the
> >> generation-number cutoff in in_merge_bases_many(). However, that fix doesn't
> >> help users without a commit-graph, and it can still be painful if that
> >> cutoff is sufficiently low compared to the tags we are using for
> >> reachability testing.
> >>
> >> Add a new method in commit-reach.c called get_reachable_subset() which does
> >> a many-to-many reachability test. Starting at the 'from' commits, walk until
> >> the generation is below the smallest generation in the 'to' commits, or all
> >> 'to' commits have been discovered. This performs only one commit walk for
> >> the entire add_missing_tags() method, giving linear performance in the worst
> >> case.
> >>
> >> Tests are added in t6600-test-reach.sh to ensure get_reachable_subset()
> >> works independently of its application in add_missing_tags().
> >
> > On the original repo where the topic was brought up, with commit-graph
> > NOT turned on and using origin/master, I see:
> >
> > $ time git push --dry-run --follow-tags /home/newren/repo-mirror
> > To /home/newren/repo-mirror
> >  * [new branch]       test5 -> test5
> >
> > real 1m20.081s
> > user 1m19.688s
> > sys 0m0.292s
> >
> > Merging this series in, I now get:
> >
> > $ time git push --dry-run --follow-tags /home/newren/repo-mirror
> > To /home/newren/repo-mirror
> >  * [new branch]       test5 -> test5
> >
> > real 0m2.857s
> > user 0m2.580s
> > sys 0m0.328s
> >
> > which provides a very nice speedup.
> >
> > Oddly enough, if I _also_ do the following:
> > $ git config core.commitgraph true
> > $ git config gc.writecommitgraph true
> > $ git gc
> >
> > then my timing actually slows down just slightly:
> > $ time git push --follow-tags --dry-run /home/newren/repo-mirror
> > To /home/newren/repo-mirror
> >  * [new branch]          test5 -> test5
> >
> > real 0m3.027s
> > user 0m2.696s
> > sys 0m0.400s
>
> So you say that the commit-graph is off in the 2.8s case, but not here
> in the 3.1s case? I would expect _at minimum_ that the cost of parsing
> commits would have a speedup in the commit-graph case.  There may be
> something else going on here, since you are timing a `push` event that
> is doing more than the current walk.
>
> > (run-to-run variation seems pretty consistent, < .1s variation, so
> > this difference is just enough to notice.)  I wouldn't be that
> > surprised if that means there's some really old tags with very small
> > generation numbers, meaning it's not gaining anything in this special
> > case from the commit-graph, but it does pay the cost of loading the
> > commit-graph.
>
> While you have this test environment, do you mind applying the diff
> below and re-running the tests? It will output a count for how many
> commits are walked by the algorithm. This should help us determine if
> this is another case where generation numbers are worse than commit-date,
> or if there is something else going on. Thanks!

I can do that, but wouldn't you want a similar patch for the old
get_merge_bases_many() in order to compare?  Does an absolute number
help by itself?
It's going to have to be tomorrow, though; not enough time tonight.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] Make add_missing_tags() linear
  2018-11-01  6:52     ` Elijah Newren
@ 2018-11-01 12:32       ` Derrick Stolee
  2018-11-01 18:57         ` Elijah Newren
  0 siblings, 1 reply; 22+ messages in thread
From: Derrick Stolee @ 2018-11-01 12:32 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List, Junio C Hamano, Jeff King, Derrick Stolee

On 11/1/2018 2:52 AM, Elijah Newren wrote:
> On Wed, Oct 31, 2018 at 5:05 AM Derrick Stolee <stolee@gmail.com> wrote:
>> On 10/31/2018 2:04 AM, Elijah Newren wrote:
>>>
>>> On the original repo where the topic was brought up, with commit-graph
>>> NOT turned on and using origin/master, I see:
>>>
>>> $ time git push --dry-run --follow-tags /home/newren/repo-mirror
>>> To /home/newren/repo-mirror
>>>   * [new branch]       test5 -> test5
>>>
>>> real 1m20.081s
>>> user 1m19.688s
>>> sys 0m0.292s
>>>
>>> Merging this series in, I now get:
>>>
>>> $ time git push --dry-run --follow-tags /home/newren/repo-mirror
>>> To /home/newren/repo-mirror
>>>   * [new branch]       test5 -> test5
>>>
>>> real 0m2.857s
>>> user 0m2.580s
>>> sys 0m0.328s
>>>
>>> which provides a very nice speedup.
>>>
>>> Oddly enough, if I _also_ do the following:
>>> $ git config core.commitgraph true
>>> $ git config gc.writecommitgraph true
>>> $ git gc
>>>
>>> then my timing actually slows down just slightly:
>>> $ time git push --follow-tags --dry-run /home/newren/repo-mirror
>>> To /home/newren/repo-mirror
>>>   * [new branch]          test5 -> test5
>>>
>>> real 0m3.027s
>>> user 0m2.696s
>>> sys 0m0.400s
>> So you say that the commit-graph is off in the 2.8s case, but not here
>> in the 3.1s case? I would expect _at minimum_ that the cost of parsing
>> commits would have a speedup in the commit-graph case.  There may be
>> something else going on here, since you are timing a `push` event that
>> is doing more than the current walk.
>>
>>> (run-to-run variation seems pretty consistent, < .1s variation, so
>>> this difference is just enough to notice.)  I wouldn't be that
>>> surprised if that means there's some really old tags with very small
>>> generation numbers, meaning it's not gaining anything in this special
>>> case from the commit-graph, but it does pay the cost of loading the
>>> commit-graph.
>> While you have this test environment, do you mind applying the diff
>> below and re-running the tests? It will output a count for how many
>> commits are walked by the algorithm. This should help us determine if
>> this is another case where generation numbers are worse than commit-date,
>> or if there is something else going on. Thanks!
> I can do that, but wouldn't you want a similar patch for the old
> get_merge_bases_many() in order to compare?  Does an absolute number
> help by itself?
> It's going to have to be tomorrow, though; not enough time tonight.

No rush. I'd just like to understand how removing the commit-graph file
can make the new algorithm faster. Putting a similar count in the old
algorithm would involve giving a count for every call to
in_merge_bases_many(), which would be very noisy.

Thanks!
-Stolee

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] Make add_missing_tags() linear
  2018-11-01 12:32       ` Derrick Stolee
@ 2018-11-01 18:57         ` Elijah Newren
  2018-11-01 19:02           ` Derrick Stolee
  0 siblings, 1 reply; 22+ messages in thread
From: Elijah Newren @ 2018-11-01 18:57 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Derrick Stolee

On Thu, Nov 1, 2018 at 5:32 AM Derrick Stolee <stolee@gmail.com> wrote:
> On 11/1/2018 2:52 AM, Elijah Newren wrote:
> > On Wed, Oct 31, 2018 at 5:05 AM Derrick Stolee <stolee@gmail.com> wrote:
> >> On 10/31/2018 2:04 AM, Elijah Newren wrote:
> >>>
> >>> On the original repo where the topic was brought up, with commit-graph
> >>> NOT turned on and using origin/master, I see:
> >>>
> >>> $ time git push --dry-run --follow-tags /home/newren/repo-mirror
> >>> To /home/newren/repo-mirror
> >>>   * [new branch]       test5 -> test5
> >>>
> >>> real 1m20.081s
> >>> user 1m19.688s
> >>> sys 0m0.292s
> >>>
> >>> Merging this series in, I now get:
> >>>
> >>> $ time git push --dry-run --follow-tags /home/newren/repo-mirror
> >>> To /home/newren/repo-mirror
> >>>   * [new branch]       test5 -> test5
> >>>
> >>> real 0m2.857s
> >>> user 0m2.580s
> >>> sys 0m0.328s
> >>>
> >>> which provides a very nice speedup.
> >>>
> >>> Oddly enough, if I _also_ do the following:
> >>> $ git config core.commitgraph true
> >>> $ git config gc.writecommitgraph true
> >>> $ git gc
> >>>
> >>> then my timing actually slows down just slightly:
> >>> $ time git push --follow-tags --dry-run /home/newren/repo-mirror
> >>> To /home/newren/repo-mirror
> >>>   * [new branch]          test5 -> test5
> >>>
> >>> real 0m3.027s
> >>> user 0m2.696s
> >>> sys 0m0.400s
> >> So you say that the commit-graph is off in the 2.8s case, but not here
> >> in the 3.1s case? I would expect _at minimum_ that the cost of parsing
> >> commits would have a speedup in the commit-graph case.  There may be
> >> something else going on here, since you are timing a `push` event that
> >> is doing more than the current walk.
> >>
> >>> (run-to-run variation seems pretty consistent, < .1s variation, so
> >>> this difference is just enough to notice.)  I wouldn't be that
> >>> surprised if that means there's some really old tags with very small
> >>> generation numbers, meaning it's not gaining anything in this special
> >>> case from the commit-graph, but it does pay the cost of loading the
> >>> commit-graph.
> >> While you have this test environment, do you mind applying the diff
> >> below and re-running the tests? It will output a count for how many
> >> commits are walked by the algorithm. This should help us determine if
> >> this is another case where generation numbers are worse than commit-date,
> >> or if there is something else going on. Thanks!
> > I can do that, but wouldn't you want a similar patch for the old
> > get_merge_bases_many() in order to compare?  Does an absolute number
> > help by itself?
> > It's going to have to be tomorrow, though; not enough time tonight.
>
> No rush. I'd just like to understand how removing the commit-graph file
> can make the new algorithm faster. Putting a similar count in the old
> algorithm would involve giving a count for every call to
> in_merge_bases_many(), which would be very noisy.

$ time git push --dry-run --follow-tags /home/newren/repo-mirror
count: 92912
To /home/newren/repo-mirror
 * [new branch]              test5 -> test5

real    0m3.024s
user    0m2.752s
sys    0m0.320s


Also:
$ git rev-list --count HEAD
55764
$ git rev-list --count --all
91820

Seems a little odd to me that count is greater than `git rev-list
--count --all`.  However, the fact that they are close in magnitude
isn't surprising since I went digging for the commit with smallest
generation number not found in the upstream repo, and found:
$ git ls-remote /home/newren/repo-mirror/ | grep refs/tags/v0.2.0; echo $?
1
$ git rev-list --count refs/tags/v0.2.0
4
$ git rev-list --count refs/tags/v0.2.0 ^HEAD
4


So, the commit-graph can only help us avoid parsing 3 or so commits,
but we have to parse the 5M .git/objects/info/commit-graph file, and
then for every parse_commit() call we need to bsearch_graph() for the
commit.    My theory is that parsing the commit-graph file and binary
searching it for commits is relatively fast, but that the time is just
big enough to measure and notice, while avoiding parsing 3 more
commits is a negligible time savings.

To me, I'm thinking this is one of those bizarre corner cases where
the commit-graph is almost imperceptibly slower than without the
commit-graph.  (And it is a very weird repo; someone repeatedly
filter-branched lots of small independent repos into a monorepo, but
didn't always push everything and didn't clean out all old stuff.)
But if you still see weird stuff you want to dig into further (maybe
the 92912 > 91820 bit?), I'm happy to try out other stuff.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] Make add_missing_tags() linear
  2018-11-01 18:57         ` Elijah Newren
@ 2018-11-01 19:02           ` Derrick Stolee
  2018-11-02 14:58             ` Elijah Newren
  0 siblings, 1 reply; 22+ messages in thread
From: Derrick Stolee @ 2018-11-01 19:02 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List, Junio C Hamano, Jeff King, Derrick Stolee

On 11/1/2018 2:57 PM, Elijah Newren wrote:
> On Thu, Nov 1, 2018 at 5:32 AM Derrick Stolee <stolee@gmail.com> wrote:
>> No rush. I'd just like to understand how removing the commit-graph file
>> can make the new algorithm faster. Putting a similar count in the old
>> algorithm would involve giving a count for every call to
>> in_merge_bases_many(), which would be very noisy.
> $ time git push --dry-run --follow-tags /home/newren/repo-mirror
> count: 92912
> To /home/newren/repo-mirror
>   * [new branch]              test5 -> test5
>
> real    0m3.024s
> user    0m2.752s
> sys    0m0.320s

Is the above test with or without the commit-graph file? Can you run it 
in the other mode, too? I'd like to see if the "count" value changes 
when the only difference is the presence of a commit-graph file.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/3] commit-reach: implement get_reachable_subset
  2018-10-31 12:01     ` Derrick Stolee
@ 2018-11-02  1:51       ` Junio C Hamano
  0 siblings, 0 replies; 22+ messages in thread
From: Junio C Hamano @ 2018-11-02  1:51 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, peff, newren,
	Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> We could do this, but it does come with a performance hit when the following
> are all true:
>
> 1. 'to' is not reachable from any 'from' commits.
>
> 2. The 'to' and 'from' commits are close in commit-date.
>
> 3. Generation numbers are not available, or the topology is skewed to have
>    commits with high commit date and low generation number.
>
> Since in_merge_bases_many() calls paint_down_to_common(), it has the same
> issues with the current generation numbers. This can be fixed when we have
> the next version of generation numbers available.
>
> I'll make a note to have in_merge_bases_many() call get_reachable_subset()
> conditionally (like the generation_numbers_available() trick in the
> --topo-order
> series) after the generation numbers are settled and implemented.

Sounds good.

I wondered how this compares with in_merge_bases_many() primarily
because how performance characteristics between two approaches trade
off.  After all, what it needs to compute is a specialized case of
get_reachable_subset() where "to" happens to be a set with only
single element.  If the latter with a single element "to" has the
above performance "issues" compared to paint-down-to-common based
approach, it could be possible that, for small enough N>1, running N
in_merge_bases_many() traversals is more performant than a single
get_reachable_subset() call that works on N-element "to".

I am hoping that an update to better generation numbers to help
get_reachable_subset() would help paint-down-to-common the same way,
and would eventually allow us to use a single traversal that is best
for N==1 and N>1.

Thanks.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 0/3] Make add_missing_tags() linear
  2018-10-30 14:16 [PATCH 0/3] Make add_missing_tags() linear Derrick Stolee via GitGitGadget
                   ` (4 preceding siblings ...)
  2018-10-31  6:04 ` Elijah Newren
@ 2018-11-02 13:14 ` Derrick Stolee via GitGitGadget
  2018-11-02 13:14   ` [PATCH v2 1/3] commit-reach: implement get_reachable_subset Derrick Stolee via GitGitGadget
                     ` (2 more replies)
  5 siblings, 3 replies; 22+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-11-02 13:14 UTC (permalink / raw)
  To: git; +Cc: peff, newren, Junio C Hamano

As reported earlier [1], the add_missing_tags() method in remote.c has
quadratic performance. Some of that performance is curbed due to the
generation-number cutoff in in_merge_bases_many(). However, that fix doesn't
help users without a commit-graph, and it can still be painful if that
cutoff is sufficiently low compared to the tags we are using for
reachability testing.

Add a new method in commit-reach.c called get_reachable_subset() which does
a many-to-many reachability test. Starting at the 'from' commits, walk until
the generation is below the smallest generation in the 'to' commits, or all
'to' commits have been discovered. This performs only one commit walk for
the entire add_missing_tags() method, giving linear performance in the worst
case.

Tests are added in t6600-test-reach.sh to ensure get_reachable_subset()
works independently of its application in add_missing_tags().

Thanks, -Stolee

[1] 
https://public-inbox.org/git/CABPp-BECpSOxudovjbDG_3W9wus102RW+E+qPmd4g3Qyd-QDKQ@mail.gmail.com/

Derrick Stolee (3):
  commit-reach: implement get_reachable_subset
  test-reach: test get_reachable_subset
  remote: make add_missing_tags() linear

 commit-reach.c        | 70 +++++++++++++++++++++++++++++++++++++++++++
 commit-reach.h        | 13 ++++++++
 remote.c              | 34 ++++++++++++++++++++-
 t/helper/test-reach.c | 34 ++++++++++++++++++---
 t/t6600-test-reach.sh | 52 ++++++++++++++++++++++++++++++++
 5 files changed, 198 insertions(+), 5 deletions(-)


base-commit: c670b1f876521c9f7cd40184bf7ed05aad843433
Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-60%2Fderrickstolee%2Fadd-missing-tags-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-60/derrickstolee/add-missing-tags-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/60

Range-diff vs v1:

 1:  4c0c5c9143 ! 1:  9e570603bd commit-reach: implement get_reachable_subset
     @@ -49,7 +49,7 @@
      +
      +struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
      +					 struct commit **to, int nr_to,
     -+					 int reachable_flag)
     ++					 unsigned int reachable_flag)
      +{
      +	struct commit **item;
      +	struct commit *current;
     @@ -129,9 +129,12 @@
      + * Return a list of commits containing the commits in the 'to' array
      + * that are reachable from at least one commit in the 'from' array.
      + * Also add the given 'flag' to each of the commits in the returned list.
     ++ *
     ++ * This method uses the PARENT1 and PARENT2 flags during its operation,
     ++ * so be sure these flags are not set before calling the method.
      + */
      +struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
      +					 struct commit **to, int nr_to,
     -+					 int reachable_flag);
     ++					 unsigned int reachable_flag);
      +
       #endif
 2:  382f4f4a5b = 2:  52e847b928 test-reach: test get_reachable_subset
 3:  ecbed3de5c = 3:  dfaceb162f remote: make add_missing_tags() linear

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 1/3] commit-reach: implement get_reachable_subset
  2018-11-02 13:14 ` [PATCH v2 " Derrick Stolee via GitGitGadget
@ 2018-11-02 13:14   ` Derrick Stolee via GitGitGadget
  2018-11-02 13:14   ` [PATCH v2 2/3] test-reach: test get_reachable_subset Derrick Stolee via GitGitGadget
  2018-11-02 13:14   ` [PATCH v2 3/3] remote: make add_missing_tags() linear Derrick Stolee via GitGitGadget
  2 siblings, 0 replies; 22+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-11-02 13:14 UTC (permalink / raw)
  To: git; +Cc: peff, newren, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The existing reachability algorithms in commit-reach.c focus on
finding merge-bases or determining if all commits in a set X can
reach at least one commit in a set Y. However, for two commits sets
X and Y, we may also care about which commits in Y are reachable
from at least one commit in X.

Implement get_reachable_subset() which answers this question. Given
two arrays of commits, 'from' and 'to', return a commit_list with
every commit from the 'to' array that is reachable from at least
one commit in the 'from' array.

The algorithm is a simple walk starting at the 'from' commits, using
the PARENT2 flag to indicate "this commit has already been added to
the walk queue". By marking the 'to' commits with the PARENT1 flag,
we can determine when we see a commit from the 'to' array. We remove
the PARENT1 flag as we add that commit to the result list to avoid
duplicates.

The order of the resulting list is a reverse of the order that the
commits are discovered in the walk.

There are a couple shortcuts to avoid walking more than we need:

1. We determine the minimum generation number of commits in the
   'to' array. We do not walk commits with generation number
   below this minimum.

2. We count how many distinct commits are in the 'to' array, and
   decrement this count when we discover a 'to' commit during the
   walk. If this number reaches zero, then we can terminate the
   walk.

Tests will be added using the 'test-tool reach' helper in a
subsequent commit.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-reach.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++
 commit-reach.h | 13 ++++++++++
 2 files changed, 83 insertions(+)

diff --git a/commit-reach.c b/commit-reach.c
index 9f79ce0a22..8ad5352752 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -688,3 +688,73 @@ int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 	object_array_clear(&from_objs);
 	return result;
 }
+
+struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
+					 struct commit **to, int nr_to,
+					 unsigned int reachable_flag)
+{
+	struct commit **item;
+	struct commit *current;
+	struct commit_list *found_commits = NULL;
+	struct commit **to_last = to + nr_to;
+	struct commit **from_last = from + nr_from;
+	uint32_t min_generation = GENERATION_NUMBER_INFINITY;
+	int num_to_find = 0;
+
+	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
+
+	for (item = to; item < to_last; item++) {
+		struct commit *c = *item;
+		
+		parse_commit(c);
+		if (c->generation < min_generation)
+			min_generation = c->generation;
+
+		if (!(c->object.flags & PARENT1)) {
+			c->object.flags |= PARENT1;
+			num_to_find++;
+		}
+	}
+
+	for (item = from; item < from_last; item++) {
+		struct commit *c = *item;
+		if (!(c->object.flags & PARENT2)) {
+			c->object.flags |= PARENT2;
+			parse_commit(c);
+
+			prio_queue_put(&queue, *item);
+		}
+	}
+
+	while (num_to_find && (current = prio_queue_get(&queue)) != NULL) {
+		struct commit_list *parents;
+
+		if (current->object.flags & PARENT1) {
+			current->object.flags &= ~PARENT1;
+			current->object.flags |= reachable_flag;
+			commit_list_insert(current, &found_commits);
+			num_to_find--;
+		}
+
+		for (parents = current->parents; parents; parents = parents->next) {
+			struct commit *p = parents->item;
+
+			parse_commit(p);
+
+			if (p->generation < min_generation)
+				continue;
+
+			if (p->object.flags & PARENT2)
+				continue;
+
+			p->object.flags |= PARENT2;
+			prio_queue_put(&queue, p);
+		}
+	}
+
+	clear_commit_marks_many(nr_to, to, PARENT1);
+	clear_commit_marks_many(nr_from, from, PARENT2);
+
+	return found_commits;
+}
+
diff --git a/commit-reach.h b/commit-reach.h
index 7d313e2975..bb34af0269 100644
--- a/commit-reach.h
+++ b/commit-reach.h
@@ -74,4 +74,17 @@ int can_all_from_reach_with_flag(struct object_array *from,
 int can_all_from_reach(struct commit_list *from, struct commit_list *to,
 		       int commit_date_cutoff);
 
+
+/*
+ * Return a list of commits containing the commits in the 'to' array
+ * that are reachable from at least one commit in the 'from' array.
+ * Also add the given 'flag' to each of the commits in the returned list.
+ *
+ * This method uses the PARENT1 and PARENT2 flags during its operation,
+ * so be sure these flags are not set before calling the method.
+ */
+struct commit_list *get_reachable_subset(struct commit **from, int nr_from,
+					 struct commit **to, int nr_to,
+					 unsigned int reachable_flag);
+
 #endif
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 2/3] test-reach: test get_reachable_subset
  2018-11-02 13:14 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  2018-11-02 13:14   ` [PATCH v2 1/3] commit-reach: implement get_reachable_subset Derrick Stolee via GitGitGadget
@ 2018-11-02 13:14   ` Derrick Stolee via GitGitGadget
  2018-11-02 13:14   ` [PATCH v2 3/3] remote: make add_missing_tags() linear Derrick Stolee via GitGitGadget
  2 siblings, 0 replies; 22+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-11-02 13:14 UTC (permalink / raw)
  To: git; +Cc: peff, newren, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The get_reachable_subset() method returns the list of commits in
the 'to' array that are reachable from at least one commit in the
'from' array. Add tests that check this method works in a few
cases:

1. All commits in the 'to' list are reachable. This exercises the
   early-termination condition.

2. Some commits in the 'to' list are reachable. This exercises the
   loop-termination condition.

3. No commits in the 'to' list are reachable. This exercises the
   NULL return condition.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 t/helper/test-reach.c | 34 ++++++++++++++++++++++++----
 t/t6600-test-reach.sh | 52 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 82 insertions(+), 4 deletions(-)

diff --git a/t/helper/test-reach.c b/t/helper/test-reach.c
index 08d2ea68e8..a0272178b7 100644
--- a/t/helper/test-reach.c
+++ b/t/helper/test-reach.c
@@ -32,8 +32,8 @@ int cmd__reach(int ac, const char **av)
 	struct commit *A, *B;
 	struct commit_list *X, *Y;
 	struct object_array X_obj = OBJECT_ARRAY_INIT;
-	struct commit **X_array;
-	int X_nr, X_alloc;
+	struct commit **X_array, **Y_array;
+	int X_nr, X_alloc, Y_nr, Y_alloc;
 	struct strbuf buf = STRBUF_INIT;
 	struct repository *r = the_repository;
 
@@ -44,9 +44,10 @@ int cmd__reach(int ac, const char **av)
 
 	A = B = NULL;
 	X = Y = NULL;
-	X_nr = 0;
-	X_alloc = 16;
+	X_nr = Y_nr = 0;
+	X_alloc = Y_alloc = 16;
 	ALLOC_ARRAY(X_array, X_alloc);
+	ALLOC_ARRAY(Y_array, Y_alloc);
 
 	while (strbuf_getline(&buf, stdin) != EOF) {
 		struct object_id oid;
@@ -92,6 +93,8 @@ int cmd__reach(int ac, const char **av)
 
 			case 'Y':
 				commit_list_insert(c, &Y);
+				ALLOC_GROW(Y_array, Y_nr + 1, Y_alloc);
+				Y_array[Y_nr++] = c;
 				break;
 
 			default:
@@ -136,6 +139,29 @@ int cmd__reach(int ac, const char **av)
 			filter.with_commit_tag_algo = 0;
 
 		printf("%s(_,A,X,_):%d\n", av[1], commit_contains(&filter, A, X, &cache));
+	} else if (!strcmp(av[1], "get_reachable_subset")) {
+		const int reachable_flag = 1;
+		int i, count = 0;
+		struct commit_list *current;
+		struct commit_list *list = get_reachable_subset(X_array, X_nr,
+								Y_array, Y_nr,
+								reachable_flag);
+		printf("get_reachable_subset(X,Y)\n");
+		for (current = list; current; current = current->next) {
+			if (!(list->item->object.flags & reachable_flag))
+				die(_("commit %s is not marked reachable"),
+				    oid_to_hex(&list->item->object.oid));
+			count++;
+		}
+		for (i = 0; i < Y_nr; i++) {
+			if (Y_array[i]->object.flags & reachable_flag)
+				count--;
+		}
+
+		if (count < 0)
+			die(_("too many commits marked reachable"));
+
+		print_sorted_commit_ids(list);
 	}
 
 	exit(0);
diff --git a/t/t6600-test-reach.sh b/t/t6600-test-reach.sh
index ae94b27f70..a0c64e617a 100755
--- a/t/t6600-test-reach.sh
+++ b/t/t6600-test-reach.sh
@@ -265,4 +265,56 @@ test_expect_success 'commit_contains:miss' '
 	test_three_modes commit_contains --tag
 '
 
+test_expect_success 'get_reachable_subset:all' '
+	cat >input <<-\EOF &&
+	X:commit-9-1
+	X:commit-8-3
+	X:commit-7-5
+	X:commit-6-6
+	X:commit-1-7
+	Y:commit-3-3
+	Y:commit-1-7
+	Y:commit-5-6
+	EOF
+	(
+		echo "get_reachable_subset(X,Y)" &&
+		git rev-parse commit-3-3 \
+			      commit-1-7 \
+			      commit-5-6 | sort
+	) >expect &&
+	test_three_modes get_reachable_subset
+'
+
+test_expect_success 'get_reachable_subset:some' '
+	cat >input <<-\EOF &&
+	X:commit-9-1
+	X:commit-8-3
+	X:commit-7-5
+	X:commit-1-7
+	Y:commit-3-3
+	Y:commit-1-7
+	Y:commit-5-6
+	EOF
+	(
+		echo "get_reachable_subset(X,Y)" &&
+		git rev-parse commit-3-3 \
+			      commit-1-7 | sort
+	) >expect &&
+	test_three_modes get_reachable_subset
+'
+
+test_expect_success 'get_reachable_subset:none' '
+	cat >input <<-\EOF &&
+	X:commit-9-1
+	X:commit-8-3
+	X:commit-7-5
+	X:commit-1-7
+	Y:commit-9-3
+	Y:commit-7-6
+	Y:commit-2-8
+	EOF
+	echo "get_reachable_subset(X,Y)" >expect &&
+	test_three_modes get_reachable_subset
+'
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 3/3] remote: make add_missing_tags() linear
  2018-11-02 13:14 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  2018-11-02 13:14   ` [PATCH v2 1/3] commit-reach: implement get_reachable_subset Derrick Stolee via GitGitGadget
  2018-11-02 13:14   ` [PATCH v2 2/3] test-reach: test get_reachable_subset Derrick Stolee via GitGitGadget
@ 2018-11-02 13:14   ` Derrick Stolee via GitGitGadget
  2 siblings, 0 replies; 22+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-11-02 13:14 UTC (permalink / raw)
  To: git; +Cc: peff, newren, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The add_missing_tags() method currently has quadratic behavior.
This is due to a linear number (based on number of tags T) of
calls to in_merge_bases_many, which has linear performance (based
on number of commits C in the repository).

Replace this O(T * C) algorithm with an O(T + C) algorithm by
using get_reachable_subset(). We ignore the return list and focus
instead on the reachable_flag assigned to the commits we care
about, because we need to interact with the tag ref and not just
the commit object.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 remote.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/remote.c b/remote.c
index 81f4f01b00..b850f2feb3 100644
--- a/remote.c
+++ b/remote.c
@@ -1205,9 +1205,36 @@ static void add_missing_tags(struct ref *src, struct ref **dst, struct ref ***ds
 	 * sent to the other side.
 	 */
 	if (sent_tips.nr) {
+		const int reachable_flag = 1;
+		struct commit_list *found_commits;
+		struct commit **src_commits;
+		int nr_src_commits = 0, alloc_src_commits = 16;
+		ALLOC_ARRAY(src_commits, alloc_src_commits);
+
 		for_each_string_list_item(item, &src_tag) {
 			struct ref *ref = item->util;
+			struct commit *commit;
+
+			if (is_null_oid(&ref->new_oid))
+				continue;
+			commit = lookup_commit_reference_gently(the_repository,
+								&ref->new_oid,
+								1);
+			if (!commit)
+				/* not pushing a commit, which is not an error */
+				continue;
+
+			ALLOC_GROW(src_commits, nr_src_commits + 1, alloc_src_commits);
+			src_commits[nr_src_commits++] = commit;
+		}
+
+		found_commits = get_reachable_subset(sent_tips.tip, sent_tips.nr,
+						     src_commits, nr_src_commits,
+						     reachable_flag);
+
+		for_each_string_list_item(item, &src_tag) {
 			struct ref *dst_ref;
+			struct ref *ref = item->util;
 			struct commit *commit;
 
 			if (is_null_oid(&ref->new_oid))
@@ -1223,7 +1250,7 @@ static void add_missing_tags(struct ref *src, struct ref **dst, struct ref ***ds
 			 * Is this tag, which they do not have, reachable from
 			 * any of the commits we are sending?
 			 */
-			if (!in_merge_bases_many(commit, sent_tips.nr, sent_tips.tip))
+			if (!(commit->object.flags & reachable_flag))
 				continue;
 
 			/* Add it in */
@@ -1231,7 +1258,12 @@ static void add_missing_tags(struct ref *src, struct ref **dst, struct ref ***ds
 			oidcpy(&dst_ref->new_oid, &ref->new_oid);
 			dst_ref->peer_ref = copy_ref(ref);
 		}
+
+		clear_commit_marks_many(nr_src_commits, src_commits, reachable_flag);
+		free(src_commits);
+		free_commit_list(found_commits);
 	}
+
 	string_list_clear(&src_tag, 0);
 	free(sent_tips.tip);
 }
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] Make add_missing_tags() linear
  2018-11-01 19:02           ` Derrick Stolee
@ 2018-11-02 14:58             ` Elijah Newren
  2018-11-02 15:38               ` Derrick Stolee
  0 siblings, 1 reply; 22+ messages in thread
From: Elijah Newren @ 2018-11-02 14:58 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Git Mailing List, Junio C Hamano, Jeff King, Derrick Stolee

On Thu, Nov 1, 2018 at 12:02 PM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 11/1/2018 2:57 PM, Elijah Newren wrote:
> > On Thu, Nov 1, 2018 at 5:32 AM Derrick Stolee <stolee@gmail.com> wrote:
> >> No rush. I'd just like to understand how removing the commit-graph file
> >> can make the new algorithm faster. Putting a similar count in the old
> >> algorithm would involve giving a count for every call to
> >> in_merge_bases_many(), which would be very noisy.
> > $ time git push --dry-run --follow-tags /home/newren/repo-mirror
> > count: 92912
> > To /home/newren/repo-mirror
> >   * [new branch]              test5 -> test5
> >
> > real    0m3.024s
> > user    0m2.752s
> > sys    0m0.320s
>
> Is the above test with or without the commit-graph file? Can you run it
> in the other mode, too? I'd like to see if the "count" value changes
> when the only difference is the presence of a commit-graph file.

I apologize for providing misleading information earlier; this was an
apples to oranges comparison.  Here's what I did:

<build a version of git with your fixes>
git clone coworker.bundle coworker-repo
cd coworker-repo
time git push --dry-run --follow-tags /home/newren/repo-mirror
git config core.commitgraph true
git config gc.writecommitgraph true
git gc
time git push --dry-run --follow-tags /home/newren/nucleus-mirror


I figured I had just done a fresh clone, so surely the gc wouldn't do
anything other than write the .git/objects/info/commit-graph file.
However, the original bundle contained many references outside of
refs/heads/ and refs/tags/:

$ git bundle list-heads ../coworker.bundle | grep -v -e refs/heads/ -e
refs/tags/ -e HEAD | wc -l
2396

These other refs apparently referred to objects not otherwise
referenced in refs/heads/ and refs/tags/, and caused the gc to explode
lots of loose objects:
$ git count-objects -v
count: 147604
size: 856416
in-pack: 1180692
packs: 1
size-pack: 796143
prune-packable: 0
garbage: 0
size-garbage: 0


The slowdown with commit-graph was entirely due to there being lots of
loose objects (147K of them).  If I add a git-prune before doing the
timing with commit-graph, then the timing with commit-graph is faster
than the run without a commit-graph.

Sorry for the wild goose chase.

And thanks for the fixes; get_reachable_subset() makes things much
faster even without a commit-graph, and the commit-graph just improves
it more.  :-)

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 0/3] Make add_missing_tags() linear
  2018-11-02 14:58             ` Elijah Newren
@ 2018-11-02 15:38               ` Derrick Stolee
  0 siblings, 0 replies; 22+ messages in thread
From: Derrick Stolee @ 2018-11-02 15:38 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Git Mailing List, Junio C Hamano, Jeff King, Derrick Stolee

On 11/2/2018 10:58 AM, Elijah Newren wrote:
> On Thu, Nov 1, 2018 at 12:02 PM Derrick Stolee <stolee@gmail.com> wrote:
>> On 11/1/2018 2:57 PM, Elijah Newren wrote:
>>> On Thu, Nov 1, 2018 at 5:32 AM Derrick Stolee <stolee@gmail.com> wrote:
>>>> No rush. I'd just like to understand how removing the commit-graph file
>>>> can make the new algorithm faster. Putting a similar count in the old
>>>> algorithm would involve giving a count for every call to
>>>> in_merge_bases_many(), which would be very noisy.
>>> $ time git push --dry-run --follow-tags /home/newren/repo-mirror
>>> count: 92912
>>> To /home/newren/repo-mirror
>>>    * [new branch]              test5 -> test5
>>>
>>> real    0m3.024s
>>> user    0m2.752s
>>> sys    0m0.320s
>> Is the above test with or without the commit-graph file? Can you run it
>> in the other mode, too? I'd like to see if the "count" value changes
>> when the only difference is the presence of a commit-graph file.
> I apologize for providing misleading information earlier; this was an
> apples to oranges comparison.  Here's what I did:
>
> <build a version of git with your fixes>
> git clone coworker.bundle coworker-repo
> cd coworker-repo
> time git push --dry-run --follow-tags /home/newren/repo-mirror
> git config core.commitgraph true
> git config gc.writecommitgraph true
> git gc
> time git push --dry-run --follow-tags /home/newren/nucleus-mirror
>
>
> I figured I had just done a fresh clone, so surely the gc wouldn't do
> anything other than write the .git/objects/info/commit-graph file.
> However, the original bundle contained many references outside of
> refs/heads/ and refs/tags/:
>
> $ git bundle list-heads ../coworker.bundle | grep -v -e refs/heads/ -e
> refs/tags/ -e HEAD | wc -l
> 2396
>
> These other refs apparently referred to objects not otherwise
> referenced in refs/heads/ and refs/tags/, and caused the gc to explode
> lots of loose objects:
> $ git count-objects -v
> count: 147604
> size: 856416
> in-pack: 1180692
> packs: 1
> size-pack: 796143
> prune-packable: 0
> garbage: 0
> size-garbage: 0
>
>
> The slowdown with commit-graph was entirely due to there being lots of
> loose objects (147K of them).  If I add a git-prune before doing the
> timing with commit-graph, then the timing with commit-graph is faster
> than the run without a commit-graph.
>
> Sorry for the wild goose chase.
>
> And thanks for the fixes; get_reachable_subset() makes things much
> faster even without a commit-graph, and the commit-graph just improves
> it more.  :-)


Thanks for double-checking! It's good to have confidence that this is a good

algorithm to use.


Thanks,

-Stolee


^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-11-02 15:39 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-30 14:16 [PATCH 0/3] Make add_missing_tags() linear Derrick Stolee via GitGitGadget
2018-10-30 14:16 ` [PATCH 1/3] commit-reach: implement get_reachable_subset Derrick Stolee via GitGitGadget
2018-10-31  3:35   ` Junio C Hamano
2018-10-31 12:01     ` Derrick Stolee
2018-11-02  1:51       ` Junio C Hamano
2018-10-31  6:07   ` Elijah Newren
2018-10-31 11:54     ` Derrick Stolee
2018-10-30 14:16 ` [PATCH 2/3] test-reach: test get_reachable_subset Derrick Stolee via GitGitGadget
2018-10-30 14:16 ` [PATCH 3/3] remote: make add_missing_tags() linear Derrick Stolee via GitGitGadget
2018-10-31  3:05 ` [PATCH 0/3] Make " Junio C Hamano
2018-10-31  6:04 ` Elijah Newren
2018-10-31 12:05   ` Derrick Stolee
2018-11-01  6:52     ` Elijah Newren
2018-11-01 12:32       ` Derrick Stolee
2018-11-01 18:57         ` Elijah Newren
2018-11-01 19:02           ` Derrick Stolee
2018-11-02 14:58             ` Elijah Newren
2018-11-02 15:38               ` Derrick Stolee
2018-11-02 13:14 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2018-11-02 13:14   ` [PATCH v2 1/3] commit-reach: implement get_reachable_subset Derrick Stolee via GitGitGadget
2018-11-02 13:14   ` [PATCH v2 2/3] test-reach: test get_reachable_subset Derrick Stolee via GitGitGadget
2018-11-02 13:14   ` [PATCH v2 3/3] remote: make add_missing_tags() linear Derrick Stolee via GitGitGadget

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).