git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/3] Speed up remove_redundant()
@ 2021-01-28 16:24 Derrick Stolee via GitGitGadget
  2021-01-28 16:24 ` [PATCH 1/3] commit-reach: use one walk in remove_redundant() Derrick Stolee via GitGitGadget
                   ` (4 more replies)
  0 siblings, 5 replies; 36+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-28 16:24 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, me, peff, gitster, Derrick Stolee

Michael Haggerty pointed out that git merge-base --independent is quite slow
when many commits are provided in the input. This boils down to some
quadratic behavior in remove_redundant() because it calls
paint_down_to_common() in a loop.

This series avoids that by using a single commit walk, pushing a flag that
means "this commit is reachable from a parent of a commit in the list." At
the end of the walk, the input commits without that flag are independent.

There is an extra performance enhancement using the first-parent history as
a heuristic. This helps only when there is a single independent result, as
we can short circuit the walk when all other inputs are found to be
reachable from the top one.

My tests were on the Linux kernel repository, testing this command:

  git merge-base --independent $(git for-each-ref refs/tags --format="%(refname)")

       Before: 16.4s
After Patch 1:  1.1s
After Patch 2:  0.1s

This does not change the correctness of the function. It is tested carefully
in t6600-test-reach.c, among existing merge-base tests.

Thanks, -Stolee

Derrick Stolee (3):
  commit-reach: use one walk in remove_redundant()
  commit-reach: move compare_commits_by_gen
  commit-reach: use heuristic in remove_redundant()

 commit-reach.c | 178 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 120 insertions(+), 58 deletions(-)


base-commit: 5a3b130cad0d5c770f766e3af6d32b41766374c0
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-852%2Fderrickstolee%2Fmerge-independent-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-852/derrickstolee/merge-independent-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/852
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/3] commit-reach: use one walk in remove_redundant()
  2021-01-28 16:24 [PATCH 0/3] Speed up remove_redundant() Derrick Stolee via GitGitGadget
@ 2021-01-28 16:24 ` Derrick Stolee via GitGitGadget
  2021-01-28 20:51   ` Junio C Hamano
  2021-01-29 17:10   ` René Scharfe
  2021-01-28 16:24 ` [PATCH 2/3] commit-reach: move compare_commits_by_gen Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 36+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-28 16:24 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, me, peff, gitster, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The current implementation of remove_redundant() uses several calls to
paint_down_to_common() to determine that commits are independent of each
other. This leads to quadratic behavior when many inputs are passed to
commands such as 'git merge-base'.

For example, in the Linux kernel repository, I tested the performance
by passing all tags:

 git merge-base --independent $(git for-each-ref refs/tags --format="$(refname)")

(Note: I had to delete the tags v2.6.11-tree and v2.6.11 as they do
not point to commits.)

Here is the performance improvement introduced by this change:

 Before: 16.4s
  After:  1.1s

The basic approach is to do one commit walk instead of many. First, scan
all commits in the list and mark their _parents_ with the STALE flag.
This flag will indicate commits that are reachable from one of the
inputs, except not including themselves. Then, walk commits until
covering all commits up to the minimum generation number pushing the
STALE flag throughout.

At the end of the walk, commits in the input list that have the STALE
flag are reachable from a _different_ commit in the list. These should
be moved to the end of the array while the others are shifted to the
front.

This logic is covered by tests in t6600-test-reach.sh, so the behavior
does not change.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-reach.c | 108 +++++++++++++++++++++++++++++--------------------
 1 file changed, 65 insertions(+), 43 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index e38771ca5a1..677f6f7c3f3 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -164,58 +164,80 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
 	 * the array, and return the number of commits that
 	 * are independent from each other.
 	 */
-	struct commit **work;
-	unsigned char *redundant;
-	int *filled_index;
-	int i, j, filled;
+	int i, count_non_stale = 0;
+	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
+	struct commit **dup;
+	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 
-	work = xcalloc(cnt, sizeof(*work));
-	redundant = xcalloc(cnt, 1);
-	ALLOC_ARRAY(filled_index, cnt - 1);
+	/* Mark all parents of the input as STALE */
+	for (i = 0; i < cnt; i++) {
+		struct commit_list *parents;
+		timestamp_t generation;
 
-	for (i = 0; i < cnt; i++)
 		repo_parse_commit(r, array[i]);
-	for (i = 0; i < cnt; i++) {
-		struct commit_list *common;
-		timestamp_t min_generation = commit_graph_generation(array[i]);
+		parents = array[i]->parents;
+
+		while (parents) {
+			repo_parse_commit(r, parents->item);
+			if (!(parents->item->object.flags & STALE)) {
+				parents->item->object.flags |= STALE;
+				prio_queue_put(&queue, parents->item);
+			}
+			parents = parents->next;
+		}
+
+		generation = commit_graph_generation(array[i]);
+
+		if (generation < min_generation)
+			min_generation = generation;
+	}
+
+	/* push the STALE bits up to min generation */
+	while (queue.nr) {
+		struct commit_list *parents;
+		struct commit *c = prio_queue_get(&queue);
+
+		repo_parse_commit(r, c);
 
-		if (redundant[i])
+		if (commit_graph_generation(c) < min_generation)
 			continue;
-		for (j = filled = 0; j < cnt; j++) {
-			timestamp_t curr_generation;
-			if (i == j || redundant[j])
-				continue;
-			filled_index[filled] = j;
-			work[filled++] = array[j];
 
-			curr_generation = commit_graph_generation(array[j]);
-			if (curr_generation < min_generation)
-				min_generation = curr_generation;
+		parents = c->parents;
+		while (parents) {
+			if (!(parents->item->object.flags & STALE)) {
+				parents->item->object.flags |= STALE;
+				prio_queue_put(&queue, parents->item);
+			}
+			parents = parents->next;
+		}
+	}
+
+	/* rearrange array */
+	dup = xcalloc(cnt, sizeof(struct commit *));
+	COPY_ARRAY(dup, array, cnt);
+	for (i = 0; i < cnt; i++) {
+		if (dup[i]->object.flags & STALE) {
+			int insert = cnt - 1 - (i - count_non_stale);
+			array[insert] = dup[i];
+		} else {
+			array[count_non_stale] = dup[i];
+			count_non_stale++;
+		}
+	}
+	free(dup);
+
+	/* clear marks */
+	for (i = 0; i < cnt; i++) {
+		struct commit_list *parents;
+		parents = array[i]->parents;
+
+		while (parents) {
+			clear_commit_marks(parents->item, STALE);
+			parents = parents->next;
 		}
-		common = paint_down_to_common(r, array[i], filled,
-					      work, min_generation);
-		if (array[i]->object.flags & PARENT2)
-			redundant[i] = 1;
-		for (j = 0; j < filled; j++)
-			if (work[j]->object.flags & PARENT1)
-				redundant[filled_index[j]] = 1;
-		clear_commit_marks(array[i], all_flags);
-		clear_commit_marks_many(filled, work, all_flags);
-		free_commit_list(common);
 	}
 
-	/* Now collect the result */
-	COPY_ARRAY(work, array, cnt);
-	for (i = filled = 0; i < cnt; i++)
-		if (!redundant[i])
-			array[filled++] = work[i];
-	for (j = filled, i = 0; i < cnt; i++)
-		if (redundant[i])
-			array[j++] = work[i];
-	free(work);
-	free(redundant);
-	free(filled_index);
-	return filled;
+	return count_non_stale;
 }
 
 static struct commit_list *get_merge_bases_many_0(struct repository *r,
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/3] commit-reach: move compare_commits_by_gen
  2021-01-28 16:24 [PATCH 0/3] Speed up remove_redundant() Derrick Stolee via GitGitGadget
  2021-01-28 16:24 ` [PATCH 1/3] commit-reach: use one walk in remove_redundant() Derrick Stolee via GitGitGadget
@ 2021-01-28 16:24 ` Derrick Stolee via GitGitGadget
  2021-01-28 16:24 ` [PATCH 3/3] commit-reach: use heuristic in remove_redundant() Derrick Stolee via GitGitGadget
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-28 16:24 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, me, peff, gitster, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Move this earlier in the file so it can be used by more methods.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-reach.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index 677f6f7c3f3..783c604a405 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -17,6 +17,21 @@
 
 static const unsigned all_flags = (PARENT1 | PARENT2 | STALE | RESULT);
 
+static int compare_commits_by_gen(const void *_a, const void *_b)
+{
+	const struct commit *a = *(const struct commit * const *)_a;
+	const struct commit *b = *(const struct commit * const *)_b;
+
+	timestamp_t generation_a = commit_graph_generation(a);
+	timestamp_t generation_b = commit_graph_generation(b);
+
+	if (generation_a < generation_b)
+		return -1;
+	if (generation_a > generation_b)
+		return 1;
+	return 0;
+}
+
 static int queue_has_nonstale(struct prio_queue *queue)
 {
 	int i;
@@ -583,21 +598,6 @@ int commit_contains(struct ref_filter *filter, struct commit *commit,
 	return repo_is_descendant_of(the_repository, commit, list);
 }
 
-static int compare_commits_by_gen(const void *_a, const void *_b)
-{
-	const struct commit *a = *(const struct commit * const *)_a;
-	const struct commit *b = *(const struct commit * const *)_b;
-
-	timestamp_t generation_a = commit_graph_generation(a);
-	timestamp_t generation_b = commit_graph_generation(b);
-
-	if (generation_a < generation_b)
-		return -1;
-	if (generation_a > generation_b)
-		return 1;
-	return 0;
-}
-
 int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/3] commit-reach: use heuristic in remove_redundant()
  2021-01-28 16:24 [PATCH 0/3] Speed up remove_redundant() Derrick Stolee via GitGitGadget
  2021-01-28 16:24 ` [PATCH 1/3] commit-reach: use one walk in remove_redundant() Derrick Stolee via GitGitGadget
  2021-01-28 16:24 ` [PATCH 2/3] commit-reach: move compare_commits_by_gen Derrick Stolee via GitGitGadget
@ 2021-01-28 16:24 ` Derrick Stolee via GitGitGadget
  2021-01-28 20:20 ` [PATCH 0/3] Speed up remove_redundant() Junio C Hamano
  2021-02-01 12:47 ` [PATCH v2 0/5] " Derrick Stolee via GitGitGadget
  4 siblings, 0 replies; 36+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-01-28 16:24 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, me, peff, gitster, Derrick Stolee,
	Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Reachability algorithms in commit-reach.c frequently benefit from using
the first-parent history as a heuristic for satisfying reachability
queries. The most obvious example was implemented in 4fbcca4e
(commit-reach: make can_all_from_reach... linear, 2018-07-20).

Update the walk in remove_redundant() to use this same heuristic. Here,
we are walking starting at the parents of the input commits. Sort those
parents and walk from the highest generation to lower. Each time, use
the heuristic of searching the first parent history before continuing to
expand the walk.

The important piece is to ensure we short-circuit the walk when we find
that there is a single non-redundant commit. This happens frequently
when looking for merge-bases or comparing several tags with 'git
merge-base --independent'. Use a new count 'count_still_independent' and
if that hits 1 we can stop walking.

To update 'count_still_independent' properly, we add use of the RESULT
flag on the input commits. Then we can detect when we reach one of these
commits and decrease the count. We need to remove the RESULT flag at
that moment because we might re-visit that commit when popping the
stack.

We use the STALE flag to mark parents that have been added to the new
walk_start list, but we need to clear that flag before we start walking
so those flags don't halt our depth-first-search walk.

On my copy of the Linux kernel repository, the performance of 'git
merge-base --independent <all-tags>' goes from 1.1 seconds to 0.11
seconds.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-reach.c | 72 +++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 56 insertions(+), 16 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index 783c604a405..6032624282e 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -179,10 +179,13 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
 	 * the array, and return the number of commits that
 	 * are independent from each other.
 	 */
-	int i, count_non_stale = 0;
+	int i, count_non_stale = 0, count_still_independent = cnt;
 	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
 	struct commit **dup;
-	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
+	struct commit **walk_start;
+	size_t walk_start_nr = 0, walk_start_alloc = cnt;
+
+	ALLOC_ARRAY(walk_start, walk_start_alloc);
 
 	/* Mark all parents of the input as STALE */
 	for (i = 0; i < cnt; i++) {
@@ -190,13 +193,15 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
 		timestamp_t generation;
 
 		repo_parse_commit(r, array[i]);
+		array[i]->object.flags |= RESULT;
 		parents = array[i]->parents;
 
 		while (parents) {
 			repo_parse_commit(r, parents->item);
 			if (!(parents->item->object.flags & STALE)) {
 				parents->item->object.flags |= STALE;
-				prio_queue_put(&queue, parents->item);
+				ALLOC_GROW(walk_start, walk_start_nr + 1, walk_start_alloc);
+				walk_start[walk_start_nr++] = parents->item;
 			}
 			parents = parents->next;
 		}
@@ -207,30 +212,65 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
 			min_generation = generation;
 	}
 
-	/* push the STALE bits up to min generation */
-	while (queue.nr) {
-		struct commit_list *parents;
-		struct commit *c = prio_queue_get(&queue);
+	QSORT(walk_start, walk_start_nr, compare_commits_by_gen);
 
-		repo_parse_commit(r, c);
+	/* remove STALE bit for now to allow walking through parents */
+	for (i = 0; i < walk_start_nr; i++)
+		walk_start[i]->object.flags &= ~STALE;
 
-		if (commit_graph_generation(c) < min_generation)
-			continue;
+	/*
+	 * Start walking from the highest generation. Hopefully, it will
+	 * find all other items during the first-parent walk, and we can
+	 * terminate early. Otherwise, we will do the same amount of work
+	 * as before.
+	 */
+	for (i = walk_start_nr - 1; i >= 0 && count_still_independent > 1; i--) {
+		/* push the STALE bits up to min generation */
+		struct commit_list *stack = NULL;
 
-		parents = c->parents;
-		while (parents) {
-			if (!(parents->item->object.flags & STALE)) {
-				parents->item->object.flags |= STALE;
-				prio_queue_put(&queue, parents->item);
+		commit_list_insert(walk_start[i], &stack);
+		walk_start[i]->object.flags |= STALE;
+
+		while (stack) {
+			struct commit_list *parents;
+			struct commit *c = stack->item;
+
+			repo_parse_commit(r, c);
+
+			if (c->object.flags & RESULT) {
+				c->object.flags &= ~RESULT;
+				if (--count_still_independent <= 1)
+					break;
 			}
-			parents = parents->next;
+
+			if (commit_graph_generation(c) < min_generation) {
+				pop_commit(&stack);
+				continue;
+			}
+
+			parents = c->parents;
+			while (parents) {
+				if (!(parents->item->object.flags & STALE)) {
+					parents->item->object.flags |= STALE;
+					commit_list_insert(parents->item, &stack);
+					break;
+				}
+				parents = parents->next;
+			}
+
+			/* pop if all parents have been visited already */
+			if (!parents)
+				pop_commit(&stack);
 		}
+		free_commit_list(stack);
 	}
+	free(walk_start);
 
 	/* rearrange array */
 	dup = xcalloc(cnt, sizeof(struct commit *));
 	COPY_ARRAY(dup, array, cnt);
 	for (i = 0; i < cnt; i++) {
+		dup[i]->object.flags &= ~RESULT;
 		if (dup[i]->object.flags & STALE) {
 			int insert = cnt - 1 - (i - count_non_stale);
 			array[insert] = dup[i];
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/3] Speed up remove_redundant()
  2021-01-28 16:24 [PATCH 0/3] Speed up remove_redundant() Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2021-01-28 16:24 ` [PATCH 3/3] commit-reach: use heuristic in remove_redundant() Derrick Stolee via GitGitGadget
@ 2021-01-28 20:20 ` Junio C Hamano
  2021-02-01 12:47 ` [PATCH v2 0/5] " Derrick Stolee via GitGitGadget
  4 siblings, 0 replies; 36+ messages in thread
From: Junio C Hamano @ 2021-01-28 20:20 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Michael Haggerty, me, peff, gitster, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Michael Haggerty pointed out that git merge-base --independent is quite slow
> when many commits are provided in the input. This boils down to some
> quadratic behavior in remove_redundant() because it calls
> paint_down_to_common() in a loop.
>
> This series avoids that by using a single commit walk, pushing a flag that
> means "this commit is reachable from a parent of a commit in the list." At
> the end of the walk, the input commits without that flag are independent.

Yay.  I've been quite unhappy with the loop ever since I wrote it
back in 2012.

Thanks for cleaning it up.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] commit-reach: use one walk in remove_redundant()
  2021-01-28 16:24 ` [PATCH 1/3] commit-reach: use one walk in remove_redundant() Derrick Stolee via GitGitGadget
@ 2021-01-28 20:51   ` Junio C Hamano
  2021-01-29 17:11     ` René Scharfe
  2021-01-31  3:59     ` Derrick Stolee
  2021-01-29 17:10   ` René Scharfe
  1 sibling, 2 replies; 36+ messages in thread
From: Junio C Hamano @ 2021-01-28 20:51 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Michael Haggerty, me, peff, Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +	/* Mark all parents of the input as STALE */
> +	for (i = 0; i < cnt; i++) {
> +		struct commit_list *parents;
> +		timestamp_t generation;
>  
>  		repo_parse_commit(r, array[i]);
> +		parents = array[i]->parents;
> +
> +		while (parents) {
> +			repo_parse_commit(r, parents->item);
> +			if (!(parents->item->object.flags & STALE)) {
> +				parents->item->object.flags |= STALE;
> +				prio_queue_put(&queue, parents->item);
> +			}
> +			parents = parents->next;
> +		}
> +
> +		generation = commit_graph_generation(array[i]);
> +
> +		if (generation < min_generation)
> +			min_generation = generation;
> +	}
> +
> +	/* push the STALE bits up to min generation */
> +	while (queue.nr) {
> +		struct commit_list *parents;
> +		struct commit *c = prio_queue_get(&queue);
> +
> +		repo_parse_commit(r, c);
>  
> +		if (commit_graph_generation(c) < min_generation)
>  			continue;
>  
> +		parents = c->parents;
> +		while (parents) {
> +			if (!(parents->item->object.flags & STALE)) {
> +				parents->item->object.flags |= STALE;
> +				prio_queue_put(&queue, parents->item);
> +			}
> +			parents = parents->next;
> +		}
> +	}

So, the inner loop makes sure we won't revisit STALE parent, but
keep digging parents we haven't seen, and stop when the generation
is old enough.  What happens when there is no generation number
computed yet, I wonder...  We'll keep getting infinity and dig all
the way down to root?

> +	/* rearrange array */
> +	dup = xcalloc(cnt, sizeof(struct commit *));
> +	COPY_ARRAY(dup, array, cnt);
> +	for (i = 0; i < cnt; i++) {
> +		if (dup[i]->object.flags & STALE) {
> +			int insert = cnt - 1 - (i - count_non_stale);
> +			array[insert] = dup[i];
> +		} else {
> +			array[count_non_stale] = dup[i];
> +			count_non_stale++;
> +		}
> +	}
> +	free(dup);

The "fill stale ones from the end, non-stale ones from the
beginning" in the loop looks unnecessarily complex to me.  I wonder
if we can do only the "fill non-stale ones from the beginning" half,
i.e.

	for (i = count_non_stale = 0; i < cnt; i++) {
		if (dup[i] is not stale)
			array[count_non_stale++] = dup[i];
	}

without the "keep the stale one at the end of array[]", and clear
marks using what is in dup[] as starting points before discarding
dup[]?

Or do the callers still look at the entries beyond count_non_stale?

Other than that, nicely done.

> +	/* clear marks */
> +	for (i = 0; i < cnt; i++) {
> +		struct commit_list *parents;
> +		parents = array[i]->parents;
> +
> +		while (parents) {
> +			clear_commit_marks(parents->item, STALE);
> +			parents = parents->next;
>  		}
> -		common = paint_down_to_common(r, array[i], filled,
> -					      work, min_generation);
> -		if (array[i]->object.flags & PARENT2)
> -			redundant[i] = 1;
> -		for (j = 0; j < filled; j++)
> -			if (work[j]->object.flags & PARENT1)
> -				redundant[filled_index[j]] = 1;
> -		clear_commit_marks(array[i], all_flags);
> -		clear_commit_marks_many(filled, work, all_flags);
> -		free_commit_list(common);
>  	}
>  
> -	/* Now collect the result */
> -	COPY_ARRAY(work, array, cnt);
> -	for (i = filled = 0; i < cnt; i++)
> -		if (!redundant[i])
> -			array[filled++] = work[i];
> -	for (j = filled, i = 0; i < cnt; i++)
> -		if (redundant[i])
> -			array[j++] = work[i];
> -	free(work);
> -	free(redundant);
> -	free(filled_index);
> -	return filled;
> +	return count_non_stale;
>  }
>  
>  static struct commit_list *get_merge_bases_many_0(struct repository *r,

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] commit-reach: use one walk in remove_redundant()
  2021-01-28 16:24 ` [PATCH 1/3] commit-reach: use one walk in remove_redundant() Derrick Stolee via GitGitGadget
  2021-01-28 20:51   ` Junio C Hamano
@ 2021-01-29 17:10   ` René Scharfe
  1 sibling, 0 replies; 36+ messages in thread
From: René Scharfe @ 2021-01-29 17:10 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: Michael Haggerty, me, peff, gitster, Derrick Stolee,
	Derrick Stolee

Am 28.01.21 um 17:24 schrieb Derrick Stolee via GitGitGadget:
> +	/* rearrange array */
> +	dup = xcalloc(cnt, sizeof(struct commit *));

You could use CALLOC_ARRAY instead here, which is shorter and uses the
correct type automatically.  Or -- seeing that the next line overwrites
all items anyway --  ALLOC_ARRAY.

> +	COPY_ARRAY(dup, array, cnt);

René

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] commit-reach: use one walk in remove_redundant()
  2021-01-28 20:51   ` Junio C Hamano
@ 2021-01-29 17:11     ` René Scharfe
  2021-01-31  3:52       ` Derrick Stolee
  2021-01-31  3:59     ` Derrick Stolee
  1 sibling, 1 reply; 36+ messages in thread
From: René Scharfe @ 2021-01-29 17:11 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, Michael Haggerty, me, peff, Derrick Stolee, Derrick Stolee

Am 28.01.21 um 21:51 schrieb Junio C Hamano:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> +	/* rearrange array */
>> +	dup = xcalloc(cnt, sizeof(struct commit *));
>> +	COPY_ARRAY(dup, array, cnt);
>> +	for (i = 0; i < cnt; i++) {
>> +		if (dup[i]->object.flags & STALE) {
>> +			int insert = cnt - 1 - (i - count_non_stale);
>> +			array[insert] = dup[i];
>> +		} else {
>> +			array[count_non_stale] = dup[i];
>> +			count_non_stale++;
>> +		}
>> +	}
>> +	free(dup);
>
> The "fill stale ones from the end, non-stale ones from the
> beginning" in the loop looks unnecessarily complex to me.  I wonder
> if we can do only the "fill non-stale ones from the beginning" half,
> i.e.
>
> 	for (i = count_non_stale = 0; i < cnt; i++) {
> 		if (dup[i] is not stale)
> 			array[count_non_stale++] = dup[i];
> 	}
>
> without the "keep the stale one at the end of array[]", and clear
> marks using what is in dup[] as starting points before discarding
> dup[]?
>
> Or do the callers still look at the entries beyond count_non_stale?

Had the same reaction.  Both callers ignore the stale entries.

> Other than that, nicely done.
>
>> +	/* clear marks */
>> +	for (i = 0; i < cnt; i++) {
>> +		struct commit_list *parents;
>> +		parents = array[i]->parents;
>> +
>> +		while (parents) {
>> +			clear_commit_marks(parents->item, STALE);
>> +			parents = parents->next;
>>  		}

This loop clears STALE from the parents of both the non-stale and
stale entries.  OK.  Should it also clear it from the stale entries
themselves?

>> -		common = paint_down_to_common(r, array[i], filled,
>> -					      work, min_generation);
>> -		if (array[i]->object.flags & PARENT2)
>> -			redundant[i] = 1;
>> -		for (j = 0; j < filled; j++)
>> -			if (work[j]->object.flags & PARENT1)
>> -				redundant[filled_index[j]] = 1;
>> -		clear_commit_marks(array[i], all_flags);
>> -		clear_commit_marks_many(filled, work, all_flags);
>> -		free_commit_list(common);
>>  	}
>>
>> -	/* Now collect the result */
>> -	COPY_ARRAY(work, array, cnt);
>> -	for (i = filled = 0; i < cnt; i++)
>> -		if (!redundant[i])
>> -			array[filled++] = work[i];
>> -	for (j = filled, i = 0; i < cnt; i++)
>> -		if (redundant[i])
>> -			array[j++] = work[i];
>> -	free(work);
>> -	free(redundant);
>> -	free(filled_index);
>> -	return filled;
>> +	return count_non_stale;
>>  }
>>
>>  static struct commit_list *get_merge_bases_many_0(struct repository *r,


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] commit-reach: use one walk in remove_redundant()
  2021-01-29 17:11     ` René Scharfe
@ 2021-01-31  3:52       ` Derrick Stolee
  2021-01-31 10:20         ` René Scharfe
  0 siblings, 1 reply; 36+ messages in thread
From: Derrick Stolee @ 2021-01-31  3:52 UTC (permalink / raw)
  To: René Scharfe, Junio C Hamano,
	Derrick Stolee via GitGitGadget
  Cc: git, Michael Haggerty, me, peff, Derrick Stolee, Derrick Stolee

On 1/29/2021 12:11 PM, René Scharfe wrote:
> Am 28.01.21 um 21:51 schrieb Junio C Hamano:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>> +	/* rearrange array */
>>> +	dup = xcalloc(cnt, sizeof(struct commit *));
>>> +	COPY_ARRAY(dup, array, cnt);
>>> +	for (i = 0; i < cnt; i++) {
>>> +		if (dup[i]->object.flags & STALE) {
>>> +			int insert = cnt - 1 - (i - count_non_stale);
>>> +			array[insert] = dup[i];
>>> +		} else {
>>> +			array[count_non_stale] = dup[i];
>>> +			count_non_stale++;
>>> +		}
>>> +	}
>>> +	free(dup);
>>
>> The "fill stale ones from the end, non-stale ones from the
>> beginning" in the loop looks unnecessarily complex to me.  I wonder
>> if we can do only the "fill non-stale ones from the beginning" half,
>> i.e.
>>
>> 	for (i = count_non_stale = 0; i < cnt; i++) {
>> 		if (dup[i] is not stale)
>> 			array[count_non_stale++] = dup[i];
>> 	}
>>
>> without the "keep the stale one at the end of array[]", and clear
>> marks using what is in dup[] as starting points before discarding
>> dup[]?
>>
>> Or do the callers still look at the entries beyond count_non_stale?
> 
> Had the same reaction.  Both callers ignore the stale entries.

Ok, I can update that logic accordingly. I wanted to keep consistent
with the comment at the start of the method:

	/*
	 * Some commit in the array may be an ancestor of
	 * another commit.  Move such commit to the end of
	 * the array, and return the number of commits that
	 * are independent from each other.
	 */

but if no caller actually needs that, then I can remove this
behavior. Anyone mind if it is a follow-up patch to change this
part of the behavior?

>> Other than that, nicely done.
>>
>>> +	/* clear marks */
>>> +	for (i = 0; i < cnt; i++) {
>>> +		struct commit_list *parents;
>>> +		parents = array[i]->parents;
>>> +
>>> +		while (parents) {
>>> +			clear_commit_marks(parents->item, STALE);
>>> +			parents = parents->next;
>>>  		}
> 
> This loop clears STALE from the parents of both the non-stale and
> stale entries.  OK.  Should it also clear it from the stale entries
> themselves?

clear_commit_marks() walks commits starting from the input commit
(parents->item in this case) and clears the STALE bit as long as
it is present. This way, the accumulated clear_commit_marks() will
walk each commit only once _and_ will visit any of the commits from
'array' that received the STALE bit during the above walk.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] commit-reach: use one walk in remove_redundant()
  2021-01-28 20:51   ` Junio C Hamano
  2021-01-29 17:11     ` René Scharfe
@ 2021-01-31  3:59     ` Derrick Stolee
  2021-01-31 20:13       ` Derrick Stolee
  2021-01-31 20:25       ` Junio C Hamano
  1 sibling, 2 replies; 36+ messages in thread
From: Derrick Stolee @ 2021-01-31  3:59 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, Michael Haggerty, me, peff, Derrick Stolee, Derrick Stolee

On 1/28/2021 3:51 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> +		parents = c->parents;
>> +		while (parents) {
>> +			if (!(parents->item->object.flags & STALE)) {
>> +				parents->item->object.flags |= STALE;
>> +				prio_queue_put(&queue, parents->item);
>> +			}
>> +			parents = parents->next;
>> +		}
>> +	}
> 
> So, the inner loop makes sure we won't revisit STALE parent, but
> keep digging parents we haven't seen, and stop when the generation
> is old enough.  What happens when there is no generation number
> computed yet, I wonder...  We'll keep getting infinity and dig all
> the way down to root?

If we are on commits that have no generation number yet, then we
will walk until reaching commits in the commit-graph file that have
a computed generation (or in the heuristic case, when we have reached
all but one of the commits).

In the case of the commit-graph, all commits will have generation
number "infinity". In such a case, perhaps the old algorithm _is_
the best we can do, at least for now.

The trade-off is that we might walk more commits in unusual cases
with few input commits. That quadratic behavior will take over as
the input size grows, no matter if there is a commit-graph or not.

I can do a big more digging into these unusual cases, especially
when we cannot rely on commit-graphs being present.

One way to ensure we do not regress from the current behavior
would be to condition the new algorithm with

	if (generation_numbers_enabled(the_repository))
		new_algorithm();
	else
		old_algorithm();

much like in repo_is_descendant_of().

Is that a good plan?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] commit-reach: use one walk in remove_redundant()
  2021-01-31  3:52       ` Derrick Stolee
@ 2021-01-31 10:20         ` René Scharfe
  0 siblings, 0 replies; 36+ messages in thread
From: René Scharfe @ 2021-01-31 10:20 UTC (permalink / raw)
  To: Derrick Stolee, Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, Michael Haggerty, me, peff, Derrick Stolee, Derrick Stolee

Am 31.01.21 um 04:52 schrieb Derrick Stolee:
> On 1/29/2021 12:11 PM, René Scharfe wrote:
>> Am 28.01.21 um 21:51 schrieb Junio C Hamano:
>>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>>> +	/* clear marks */
>>>> +	for (i = 0; i < cnt; i++) {
>>>> +		struct commit_list *parents;
>>>> +		parents = array[i]->parents;
>>>> +
>>>> +		while (parents) {
>>>> +			clear_commit_marks(parents->item, STALE);
>>>> +			parents = parents->next;
>>>>  		}
>>
>> This loop clears STALE from the parents of both the non-stale and
>> stale entries.  OK.  Should it also clear it from the stale entries
>> themselves?
>
> clear_commit_marks() walks commits starting from the input commit
> (parents->item in this case) and clears the STALE bit as long as
> it is present. This way, the accumulated clear_commit_marks() will
> walk each commit only once _and_ will visit any of the commits from
> 'array' that received the STALE bit during the above walk.

OK, makes sense -- stale items are ancestors of non-stale items.  That
means you don't need to keep them around even for this cleanup loop and
clearing STALE from the parents of the first count_non_stale items
suffices, right?  In that case you don't need to duplicate the array.

René

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] commit-reach: use one walk in remove_redundant()
  2021-01-31  3:59     ` Derrick Stolee
@ 2021-01-31 20:13       ` Derrick Stolee
  2021-01-31 20:25       ` Junio C Hamano
  1 sibling, 0 replies; 36+ messages in thread
From: Derrick Stolee @ 2021-01-31 20:13 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, Michael Haggerty, me, peff, Derrick Stolee, Derrick Stolee

On 1/30/2021 10:59 PM, Derrick Stolee wrote:
> On 1/28/2021 3:51 PM, Junio C Hamano wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>> +		parents = c->parents;
>>> +		while (parents) {
>>> +			if (!(parents->item->object.flags & STALE)) {
>>> +				parents->item->object.flags |= STALE;
>>> +				prio_queue_put(&queue, parents->item);
>>> +			}
>>> +			parents = parents->next;
>>> +		}
>>> +	}
>>
>> So, the inner loop makes sure we won't revisit STALE parent, but
>> keep digging parents we haven't seen, and stop when the generation
>> is old enough.  What happens when there is no generation number
>> computed yet, I wonder...  We'll keep getting infinity and dig all
>> the way down to root?
> 
> If we are on commits that have no generation number yet, then we
> will walk until reaching commits in the commit-graph file that have
> a computed generation (or in the heuristic case, when we have reached
> all but one of the commits).
> 
> In the case of the commit-graph, all commits will have generation
> number "infinity". In such a case, perhaps the old algorithm _is_
> the best we can do, at least for now.
> 
> The trade-off is that we might walk more commits in unusual cases
> with few input commits. That quadratic behavior will take over as
> the input size grows, no matter if there is a commit-graph or not.
> 
> I can do a big more digging into these unusual cases, especially
> when we cannot rely on commit-graphs being present.

Indeed, the old algorithm is better for cases where generation
numbers do not help and there are multiple independent commits.

For example, this command in the Linux kernel repo changes from
0.047s in the old algorithm to 6.6s with the new algorithm:

git -c core.commitGraph=false merge-base --independent 2ecedd756908 d2360a398f0b >/dev/null

> One way to ensure we do not regress from the current behavior
> would be to condition the new algorithm with
> 
> 	if (generation_numbers_enabled(the_repository))
> 		new_algorithm();
> 	else
> 		old_algorithm();
> 
> much like in repo_is_descendant_of().

I will include this in a v2.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] commit-reach: use one walk in remove_redundant()
  2021-01-31  3:59     ` Derrick Stolee
  2021-01-31 20:13       ` Derrick Stolee
@ 2021-01-31 20:25       ` Junio C Hamano
  2021-02-01  3:55         ` Derrick Stolee
  1 sibling, 1 reply; 36+ messages in thread
From: Junio C Hamano @ 2021-01-31 20:25 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, Michael Haggerty, me, peff,
	Derrick Stolee, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

>> So, the inner loop makes sure we won't revisit STALE parent, but
>> keep digging parents we haven't seen, and stop when the generation
>> is old enough.  What happens when there is no generation number
>> computed yet, I wonder...  We'll keep getting infinity and dig all
>> the way down to root?
>
> If we are on commits that have no generation number yet, then we
> will walk until reaching commits in the commit-graph file that have
> a computed generation (or in the heuristic case, when we have reached
> all but one of the commits).
>
> In the case of the commit-graph, all commits will have generation
> number "infinity". In such a case, perhaps the old algorithm _is_
> the best we can do, at least for now.

Hmph, I am afraid that such is life X-<.

> One way to ensure we do not regress from the current behavior
> would be to condition the new algorithm with
>
> 	if (generation_numbers_enabled(the_repository))
> 		new_algorithm();
> 	else
> 		old_algorithm();
>
> much like in repo_is_descendant_of().
>
> Is that a good plan?

It would certainly avoid one particular form of regression, so it is
better than nothing.

But at the same time, we'd probably want to encourage people to
enable and maintain generation numbers for the majority of commits
in their repository, but unless you "gc" twice a day or something,
you'd inevitably have a mixture, say, all commits that are more than
two weeks old are covered by commit-graph, but more recent ones are
not yet enumerated, and you have to traverse at runtime.

And the performance characteristics we would care the most in the
longer term is to make sure that we perform well in such a mixed
environment for the parts of the history that are not old enough.

Many things can be sped up by precomputing and storing the result in
the commit-graph file and that is not all that interesting or
surprising part of the story, I would think.  Rather, we want to
ensure that we do not perform on the youngest part of the history
any worse---that way, people will have strong incentive to enable
commit-graph, as things will work superbly for older parts of the
history, while not doing any worse than the original system for the
newest parts of the history.

There was a side thread where somebody wished if they can remove
support for all the codepaths that do not use commit-graph, but
would this be an example of how such a wish is impractical, I have
to wonder?

Thanks.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/3] commit-reach: use one walk in remove_redundant()
  2021-01-31 20:25       ` Junio C Hamano
@ 2021-02-01  3:55         ` Derrick Stolee
  0 siblings, 0 replies; 36+ messages in thread
From: Derrick Stolee @ 2021-02-01  3:55 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, Michael Haggerty, me, peff,
	Derrick Stolee, Derrick Stolee

On 1/31/2021 3:25 PM, Junio C Hamano wrote:
> Derrick Stolee <stolee@gmail.com> writes:
> 
>>> So, the inner loop makes sure we won't revisit STALE parent, but
>>> keep digging parents we haven't seen, and stop when the generation
>>> is old enough.  What happens when there is no generation number
>>> computed yet, I wonder...  We'll keep getting infinity and dig all
>>> the way down to root?
>>
>> If we are on commits that have no generation number yet, then we
>> will walk until reaching commits in the commit-graph file that have
>> a computed generation (or in the heuristic case, when we have reached
>> all but one of the commits).
>>
>> In the case of the commit-graph, all commits will have generation
>> number "infinity". In such a case, perhaps the old algorithm _is_
>> the best we can do, at least for now.
> 
> Hmph, I am afraid that such is life X-<.
> 
>> One way to ensure we do not regress from the current behavior
>> would be to condition the new algorithm with
>>
>> 	if (generation_numbers_enabled(the_repository))
>> 		new_algorithm();
>> 	else
>> 		old_algorithm();
>>
>> much like in repo_is_descendant_of().
>>
>> Is that a good plan?
> 
> It would certainly avoid one particular form of regression, so it is
> better than nothing.
> 
> But at the same time, we'd probably want to encourage people to
> enable and maintain generation numbers for the majority of commits
> in their repository, but unless you "gc" twice a day or something,
> you'd inevitably have a mixture, say, all commits that are more than
> two weeks old are covered by commit-graph, but more recent ones are
> not yet enumerated, and you have to traverse at runtime.
> 
> And the performance characteristics we would care the most in the
> longer term is to make sure that we perform well in such a mixed
> environment for the parts of the history that are not old enough.

You're right, that in the mixed case of "all of these input commits
have infinite generation number" is the worst possible case for
this algorithm. However, if even a single commit has a finite
generation number, then this new algorithm should out-perform the
one using paint_down_to_common() in all cases.

I tested the painful case of

	git merge-base --independent 2ecedd756908 d2360a398f0b

in the Linux kernel repository with different commit-graph files.
I found that the new algorithm performed worse than the old
algorithm until there were fewer than 2,500 commits reachable
from these two but not in the commit-graph file. That's probably
too small of a gap to expect of a typical user.

I will modify my check

	if (generation_numbers_enabled(r)) {
		int i;

		/*
		 * If we have a single commit with finite generation
		 * number, then the _with_gen algorithm is preferred.
		 */
		for (i = 0; i < cnt; i++) {
			if (commit_graph_generation(array[i]) < GENERATION_NUMBER_INFINITY)
				return remove_redundant_with_gen(r, array, cnt);
		}
	}

	return remove_redundant_no_gen(r, array, cnt);

> Many things can be sped up by precomputing and storing the result in
> the commit-graph file and that is not all that interesting or
> surprising part of the story, I would think.  Rather, we want to
> ensure that we do not perform on the youngest part of the history
> any worse---that way, people will have strong incentive to enable
> commit-graph, as things will work superbly for older parts of the
> history, while not doing any worse than the original system for the
> newest parts of the history.

I'm definitely biased to a case where there is an up-to-date
commit-graph file, since I care most about server-side performance
or users with background maintenance enabled (and computing the
commit-graph frequently). You are right to call out that that is
not always the case.
 
> There was a side thread where somebody wished if they can remove
> support for all the codepaths that do not use commit-graph, but
> would this be an example of how such a wish is impractical, I have
> to wonder?

Some cases might allow de-duplication, such as
sort_in_topological_order(), that would have roughly the same
performance, give or take some overhead. It takes time to check
and see if that is the case, however.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 0/5] Speed up remove_redundant()
  2021-01-28 16:24 [PATCH 0/3] Speed up remove_redundant() Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2021-01-28 20:20 ` [PATCH 0/3] Speed up remove_redundant() Junio C Hamano
@ 2021-02-01 12:47 ` Derrick Stolee via GitGitGadget
  2021-02-01 12:47   ` [PATCH v2 1/5] commit-reach: reduce requirements for remove_redundant() Derrick Stolee via GitGitGadget
                     ` (6 more replies)
  4 siblings, 7 replies; 36+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-01 12:47 UTC (permalink / raw)
  To: git
  Cc: Michael Haggerty, me, peff, gitster, René Scharfe,
	Derrick Stolee, Derrick Stolee

Michael Haggerty pointed out that git merge-base --independent is quite slow
when many commits are provided in the input. This boils down to some
quadratic behavior in remove_redundant() because it calls
paint_down_to_common() in a loop.

This series avoids that by using a single commit walk, pushing a flag that
means "this commit is reachable from a parent of a commit in the list." At
the end of the walk, the input commits without that flag are independent.

There is an extra performance enhancement using the first-parent history as
a heuristic. This helps only when there is a single independent result, as
we can short circuit the walk when all other inputs are found to be
reachable from the top one. It also gets faster when we discover the commit
with lowest generation and can restrict our walk further to the next-lowest
generation number.

My tests were on the Linux kernel repository, testing this command:

  git merge-base --independent $(git for-each-ref refs/tags --format="%(refname)")

       Before: 16.4s
After Patch 1:  1.1s
After Patch 2:  0.1s

This does not change the correctness of the function. It is tested carefully
in t6600-test-reach.c, among existing merge-base tests. There are also more
tests in scripts like t6012-rev-list-simplify.sh that trigger the new logic
under GIT_TEST_COMMIT_GRAPH=1.


Updates in V2
=============

 * The old algorithm is not entirely removed. It is still faster than the
   new algorithm if all input commits are not in the commit-graph file
   (unless the commit-graph file is really close to the input commits, but
   that is an unreasonable expectation). Now, the new logic only happens if
   there is an input commit with finite generation number.

 * There was a copy of 'array' being used for the final reorder which is now
   removed. We still need to be careful about the RESULT bit.

 * The final patch is new, adding yet another speedup in the form of
   increasing the min_generation cutoff when we find the input commit of
   smallest generation. This can reduce the time significantly when given
   many inputs. This does require copying the 'array' again, so we can sort
   by generation number while preserving the original order of 'array' which
   is required by some callers.

Thanks, -Stolee

Derrick Stolee (5):
  commit-reach: reduce requirements for remove_redundant()
  commit-reach: use one walk in remove_redundant()
  commit-reach: move compare_commits_by_gen
  commit-reach: use heuristic in remove_redundant()
  commit-reach: stale commits may prune generation further

 commit-reach.c | 188 +++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 166 insertions(+), 22 deletions(-)


base-commit: 5a3b130cad0d5c770f766e3af6d32b41766374c0
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-852%2Fderrickstolee%2Fmerge-independent-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-852/derrickstolee/merge-independent-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/852

Range-diff vs v1:

 -:  ----------- > 1:  649f6799e6b commit-reach: reduce requirements for remove_redundant()
 1:  3fe74e339fc ! 2:  2f80ae5fcb0 commit-reach: use one walk in remove_redundant()
     @@ Commit message
           Before: 16.4s
            After:  1.1s
      
     +    This performance improvement requires the commit-graph file to be
     +    present. We keep the old algorithm around as remove_redundant_no_gen()
     +    and use it when generation_numbers_enabled() is false. This is similar
     +    to other algorithms within commit-reach.c. The new algorithm is
     +    implemented in remove_redundant_with_gen().
     +
          The basic approach is to do one commit walk instead of many. First, scan
          all commits in the list and mark their _parents_ with the STALE flag.
          This flag will indicate commits that are reachable from one of the
     @@ Commit message
          covering all commits up to the minimum generation number pushing the
          STALE flag throughout.
      
     -    At the end of the walk, commits in the input list that have the STALE
     -    flag are reachable from a _different_ commit in the list. These should
     -    be moved to the end of the array while the others are shifted to the
     -    front.
     +    At the end, we need to clear the STALE bit from all of the commits
     +    we walked. We move the non-stale commits in 'array' to the beginning of
     +    the list, and this might overwrite stale commits. However, we store an
     +    array of commits that started the walk, and use clear_commit_marks() on
     +    each of those starting commits. That method will walk the reachable
     +    commits with the STALE bit and clear them all. This makes the algorithm
     +    safe for re-entry or for other uses of those commits after this walk.
      
          This logic is covered by tests in t6600-test-reach.sh, so the behavior
     -    does not change.
     +    does not change. This is tested both in the case with a commit-graph and
     +    without.
      
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## commit-reach.c ##
     +@@ commit-reach.c: struct commit_list *get_octopus_merge_bases(struct commit_list *in)
     + 	return ret;
     + }
     + 
     +-static int remove_redundant(struct repository *r, struct commit **array, int cnt)
     ++static int remove_redundant_no_gen(struct repository *r,
     ++				   struct commit **array, int cnt)
     + {
     +-	/*
     +-	 * Some commit in the array may be an ancestor of
     +-	 * another commit.  Move the independent commits to the
     +-	 * beginning of 'array' and return their number. Callers
     +-	 * should not rely upon the contents of 'array' after
     +-	 * that number.
     +-	 */
     + 	struct commit **work;
     + 	unsigned char *redundant;
     + 	int *filled_index;
      @@ commit-reach.c: static int remove_redundant(struct repository *r, struct commit **array, int cnt
     - 	 * the array, and return the number of commits that
     - 	 * are independent from each other.
     - 	 */
     --	struct commit **work;
     --	unsigned char *redundant;
     --	int *filled_index;
     --	int i, j, filled;
     + 	for (i = filled = 0; i < cnt; i++)
     + 		if (!redundant[i])
     + 			array[filled++] = work[i];
     ++	for (j = filled, i = 0; i < cnt; i++)
     ++		if (redundant[i])
     ++			array[j++] = work[i];
     + 	free(work);
     + 	free(redundant);
     + 	free(filled_index);
     + 	return filled;
     + }
     + 
     ++static int remove_redundant_with_gen(struct repository *r,
     ++				     struct commit **array, int cnt)
     ++{
      +	int i, count_non_stale = 0;
      +	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
     -+	struct commit **dup;
     ++	struct commit **walk_start;
     ++	size_t walk_start_nr = 0, walk_start_alloc = cnt;
      +	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
     - 
     --	work = xcalloc(cnt, sizeof(*work));
     --	redundant = xcalloc(cnt, 1);
     --	ALLOC_ARRAY(filled_index, cnt - 1);
     ++
     ++	ALLOC_ARRAY(walk_start, walk_start_alloc);
     ++
      +	/* Mark all parents of the input as STALE */
      +	for (i = 0; i < cnt; i++) {
      +		struct commit_list *parents;
      +		timestamp_t generation;
     - 
     --	for (i = 0; i < cnt; i++)
     - 		repo_parse_commit(r, array[i]);
     --	for (i = 0; i < cnt; i++) {
     --		struct commit_list *common;
     --		timestamp_t min_generation = commit_graph_generation(array[i]);
     ++
     ++		repo_parse_commit(r, array[i]);
      +		parents = array[i]->parents;
      +
      +		while (parents) {
      +			repo_parse_commit(r, parents->item);
      +			if (!(parents->item->object.flags & STALE)) {
      +				parents->item->object.flags |= STALE;
     ++				ALLOC_GROW(walk_start, walk_start_nr + 1, walk_start_alloc);
     ++				walk_start[walk_start_nr++] = parents->item;
      +				prio_queue_put(&queue, parents->item);
      +			}
      +			parents = parents->next;
     @@ commit-reach.c: static int remove_redundant(struct repository *r, struct commit
      +		struct commit *c = prio_queue_get(&queue);
      +
      +		repo_parse_commit(r, c);
     - 
     --		if (redundant[i])
     ++
      +		if (commit_graph_generation(c) < min_generation)
     - 			continue;
     --		for (j = filled = 0; j < cnt; j++) {
     --			timestamp_t curr_generation;
     --			if (i == j || redundant[j])
     --				continue;
     --			filled_index[filled] = j;
     --			work[filled++] = array[j];
     - 
     --			curr_generation = commit_graph_generation(array[j]);
     --			if (curr_generation < min_generation)
     --				min_generation = curr_generation;
     ++			continue;
     ++
      +		parents = c->parents;
      +		while (parents) {
      +			if (!(parents->item->object.flags & STALE)) {
     @@ commit-reach.c: static int remove_redundant(struct repository *r, struct commit
      +	}
      +
      +	/* rearrange array */
     -+	dup = xcalloc(cnt, sizeof(struct commit *));
     -+	COPY_ARRAY(dup, array, cnt);
     -+	for (i = 0; i < cnt; i++) {
     -+		if (dup[i]->object.flags & STALE) {
     -+			int insert = cnt - 1 - (i - count_non_stale);
     -+			array[insert] = dup[i];
     -+		} else {
     -+			array[count_non_stale] = dup[i];
     -+			count_non_stale++;
     -+		}
     ++	for (i = count_non_stale = 0; i < cnt; i++) {
     ++		if (!(array[i]->object.flags & STALE))
     ++			array[count_non_stale++] = array[i];
      +	}
     -+	free(dup);
      +
      +	/* clear marks */
     -+	for (i = 0; i < cnt; i++) {
     -+		struct commit_list *parents;
     -+		parents = array[i]->parents;
     ++	for (i = 0; i < walk_start_nr; i++)
     ++		clear_commit_marks(walk_start[i], STALE);
     ++	free(walk_start);
      +
     -+		while (parents) {
     -+			clear_commit_marks(parents->item, STALE);
     -+			parents = parents->next;
     - 		}
     --		common = paint_down_to_common(r, array[i], filled,
     --					      work, min_generation);
     --		if (array[i]->object.flags & PARENT2)
     --			redundant[i] = 1;
     --		for (j = 0; j < filled; j++)
     --			if (work[j]->object.flags & PARENT1)
     --				redundant[filled_index[j]] = 1;
     --		clear_commit_marks(array[i], all_flags);
     --		clear_commit_marks_many(filled, work, all_flags);
     --		free_commit_list(common);
     - 	}
     - 
     --	/* Now collect the result */
     --	COPY_ARRAY(work, array, cnt);
     --	for (i = filled = 0; i < cnt; i++)
     --		if (!redundant[i])
     --			array[filled++] = work[i];
     --	for (j = filled, i = 0; i < cnt; i++)
     --		if (redundant[i])
     --			array[j++] = work[i];
     --	free(work);
     --	free(redundant);
     --	free(filled_index);
     --	return filled;
      +	return count_non_stale;
     - }
     - 
     ++}
     ++
     ++static int remove_redundant(struct repository *r, struct commit **array, int cnt)
     ++{
     ++	/*
     ++	 * Some commit in the array may be an ancestor of
     ++	 * another commit.  Move the independent commits to the
     ++	 * beginning of 'array' and return their number. Callers
     ++	 * should not rely upon the contents of 'array' after
     ++	 * that number.
     ++	 */
     ++	if (generation_numbers_enabled(r)) {
     ++		int i;
     ++
     ++		/*
     ++		 * If we have a single commit with finite generation
     ++		 * number, then the _with_gen algorithm is preferred.
     ++		 */
     ++		for (i = 0; i < cnt; i++) {
     ++			if (commit_graph_generation(array[i]) < GENERATION_NUMBER_INFINITY)
     ++				return remove_redundant_with_gen(r, array, cnt);
     ++		}
     ++	}
     ++
     ++	return remove_redundant_no_gen(r, array, cnt);
     ++}
     ++
       static struct commit_list *get_merge_bases_many_0(struct repository *r,
     + 						  struct commit *one,
     + 						  int n,
 2:  4c58877a709 = 3:  009f64b53c9 commit-reach: move compare_commits_by_gen
 3:  fbe7bdc1ec2 ! 4:  83feabeebb5 commit-reach: use heuristic in remove_redundant()
     @@ Commit message
          the heuristic of searching the first parent history before continuing to
          expand the walk.
      
     +    The order in which we explore the commits matters, so update
     +    compare_commits_by_gen to break generation number ties with commit date.
     +    This has no effect when the commits are in a commit-graph file with
     +    corrected commit dates computed, but it will assist when the commits are
     +    in the region "above" the commit-graph with "infinite" generation
     +    number. Note that we cannot shift to use
     +    compare_commits_by_gen_then_commit_date as the method prototype is
     +    different. We use compare_commits_by_gen for QSORT() as opposed to as a
     +    priority function.
     +
          The important piece is to ensure we short-circuit the walk when we find
          that there is a single non-redundant commit. This happens frequently
          when looking for merge-bases or comparing several tags with 'git
     @@ Commit message
          Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
      
       ## commit-reach.c ##
     -@@ commit-reach.c: static int remove_redundant(struct repository *r, struct commit **array, int cnt
     - 	 * the array, and return the number of commits that
     - 	 * are independent from each other.
     - 	 */
     +@@ commit-reach.c: static int compare_commits_by_gen(const void *_a, const void *_b)
     + 		return -1;
     + 	if (generation_a > generation_b)
     + 		return 1;
     ++	if (a->date < b->date)
     ++		return -1;
     ++	if (a->date > b->date)
     ++		return 1;
     + 	return 0;
     + }
     + 
     +@@ commit-reach.c: static int remove_redundant_no_gen(struct repository *r,
     + static int remove_redundant_with_gen(struct repository *r,
     + 				     struct commit **array, int cnt)
     + {
      -	int i, count_non_stale = 0;
      +	int i, count_non_stale = 0, count_still_independent = cnt;
       	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
     - 	struct commit **dup;
     + 	struct commit **walk_start;
     + 	size_t walk_start_nr = 0, walk_start_alloc = cnt;
      -	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
     -+	struct commit **walk_start;
     -+	size_t walk_start_nr = 0, walk_start_alloc = cnt;
     -+
     -+	ALLOC_ARRAY(walk_start, walk_start_alloc);
       
     - 	/* Mark all parents of the input as STALE */
     - 	for (i = 0; i < cnt; i++) {
     -@@ commit-reach.c: static int remove_redundant(struct repository *r, struct commit **array, int cnt
     + 	ALLOC_ARRAY(walk_start, walk_start_alloc);
     + 
     +@@ commit-reach.c: static int remove_redundant_with_gen(struct repository *r,
       		timestamp_t generation;
       
       		repo_parse_commit(r, array[i]);
     @@ commit-reach.c: static int remove_redundant(struct repository *r, struct commit
       		parents = array[i]->parents;
       
       		while (parents) {
     - 			repo_parse_commit(r, parents->item);
     - 			if (!(parents->item->object.flags & STALE)) {
     +@@ commit-reach.c: static int remove_redundant_with_gen(struct repository *r,
       				parents->item->object.flags |= STALE;
     + 				ALLOC_GROW(walk_start, walk_start_nr + 1, walk_start_alloc);
     + 				walk_start[walk_start_nr++] = parents->item;
      -				prio_queue_put(&queue, parents->item);
     -+				ALLOC_GROW(walk_start, walk_start_nr + 1, walk_start_alloc);
     -+				walk_start[walk_start_nr++] = parents->item;
       			}
       			parents = parents->next;
       		}
     -@@ commit-reach.c: static int remove_redundant(struct repository *r, struct commit **array, int cnt
     +@@ commit-reach.c: static int remove_redundant_with_gen(struct repository *r,
       			min_generation = generation;
       	}
       
     @@ commit-reach.c: static int remove_redundant(struct repository *r, struct commit
       		}
      +		free_commit_list(stack);
       	}
     -+	free(walk_start);
       
     ++	/* clear result */
     ++	for (i = 0; i < cnt; i++)
     ++		array[i]->object.flags &= ~RESULT;
     ++
       	/* rearrange array */
     - 	dup = xcalloc(cnt, sizeof(struct commit *));
     - 	COPY_ARRAY(dup, array, cnt);
     - 	for (i = 0; i < cnt; i++) {
     -+		dup[i]->object.flags &= ~RESULT;
     - 		if (dup[i]->object.flags & STALE) {
     - 			int insert = cnt - 1 - (i - count_non_stale);
     - 			array[insert] = dup[i];
     + 	for (i = count_non_stale = 0; i < cnt; i++) {
     + 		if (!(array[i]->object.flags & STALE))
 -:  ----------- > 5:  14f0974c987 commit-reach: stale commits may prune generation further

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v2 1/5] commit-reach: reduce requirements for remove_redundant()
  2021-02-01 12:47 ` [PATCH v2 0/5] " Derrick Stolee via GitGitGadget
@ 2021-02-01 12:47   ` Derrick Stolee via GitGitGadget
  2021-02-01 19:51     ` Junio C Hamano
  2021-02-01 12:47   ` [PATCH v2 2/5] commit-reach: use one walk in remove_redundant() Derrick Stolee via GitGitGadget
                     ` (5 subsequent siblings)
  6 siblings, 1 reply; 36+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-01 12:47 UTC (permalink / raw)
  To: git
  Cc: Michael Haggerty, me, peff, gitster, René Scharfe,
	Derrick Stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Remove a comment at the beggining of remove_redundant() that mentions a
reordering of the input array to have the initial segment be the
independent commits and the final segment be the redundant commits.
While this behavior is followed in remove_redundant(), no callers rely
on that behavior.

Remove the final loop that copies this final segment and update the
comment to match the new behavior.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-reach.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index e38771ca5a1..9af51fe7e07 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -160,9 +160,10 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
 {
 	/*
 	 * Some commit in the array may be an ancestor of
-	 * another commit.  Move such commit to the end of
-	 * the array, and return the number of commits that
-	 * are independent from each other.
+	 * another commit.  Move the independent commits to the
+	 * beginning of 'array' and return their number. Callers
+	 * should not rely upon the contents of 'array' after
+	 * that number.
 	 */
 	struct commit **work;
 	unsigned char *redundant;
@@ -209,9 +210,6 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
 	for (i = filled = 0; i < cnt; i++)
 		if (!redundant[i])
 			array[filled++] = work[i];
-	for (j = filled, i = 0; i < cnt; i++)
-		if (redundant[i])
-			array[j++] = work[i];
 	free(work);
 	free(redundant);
 	free(filled_index);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 2/5] commit-reach: use one walk in remove_redundant()
  2021-02-01 12:47 ` [PATCH v2 0/5] " Derrick Stolee via GitGitGadget
  2021-02-01 12:47   ` [PATCH v2 1/5] commit-reach: reduce requirements for remove_redundant() Derrick Stolee via GitGitGadget
@ 2021-02-01 12:47   ` Derrick Stolee via GitGitGadget
  2021-02-01 16:12     ` René Scharfe.
  2021-02-01 12:47   ` [PATCH v2 3/5] commit-reach: move compare_commits_by_gen Derrick Stolee via GitGitGadget
                     ` (4 subsequent siblings)
  6 siblings, 1 reply; 36+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-01 12:47 UTC (permalink / raw)
  To: git
  Cc: Michael Haggerty, me, peff, gitster, René Scharfe,
	Derrick Stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The current implementation of remove_redundant() uses several calls to
paint_down_to_common() to determine that commits are independent of each
other. This leads to quadratic behavior when many inputs are passed to
commands such as 'git merge-base'.

For example, in the Linux kernel repository, I tested the performance
by passing all tags:

 git merge-base --independent $(git for-each-ref refs/tags --format="$(refname)")

(Note: I had to delete the tags v2.6.11-tree and v2.6.11 as they do
not point to commits.)

Here is the performance improvement introduced by this change:

 Before: 16.4s
  After:  1.1s

This performance improvement requires the commit-graph file to be
present. We keep the old algorithm around as remove_redundant_no_gen()
and use it when generation_numbers_enabled() is false. This is similar
to other algorithms within commit-reach.c. The new algorithm is
implemented in remove_redundant_with_gen().

The basic approach is to do one commit walk instead of many. First, scan
all commits in the list and mark their _parents_ with the STALE flag.
This flag will indicate commits that are reachable from one of the
inputs, except not including themselves. Then, walk commits until
covering all commits up to the minimum generation number pushing the
STALE flag throughout.

At the end, we need to clear the STALE bit from all of the commits
we walked. We move the non-stale commits in 'array' to the beginning of
the list, and this might overwrite stale commits. However, we store an
array of commits that started the walk, and use clear_commit_marks() on
each of those starting commits. That method will walk the reachable
commits with the STALE bit and clear them all. This makes the algorithm
safe for re-entry or for other uses of those commits after this walk.

This logic is covered by tests in t6600-test-reach.sh, so the behavior
does not change. This is tested both in the case with a commit-graph and
without.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-reach.c | 108 +++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 100 insertions(+), 8 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index 9af51fe7e07..053676e51d0 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -156,15 +156,9 @@ struct commit_list *get_octopus_merge_bases(struct commit_list *in)
 	return ret;
 }
 
-static int remove_redundant(struct repository *r, struct commit **array, int cnt)
+static int remove_redundant_no_gen(struct repository *r,
+				   struct commit **array, int cnt)
 {
-	/*
-	 * Some commit in the array may be an ancestor of
-	 * another commit.  Move the independent commits to the
-	 * beginning of 'array' and return their number. Callers
-	 * should not rely upon the contents of 'array' after
-	 * that number.
-	 */
 	struct commit **work;
 	unsigned char *redundant;
 	int *filled_index;
@@ -210,12 +204,110 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
 	for (i = filled = 0; i < cnt; i++)
 		if (!redundant[i])
 			array[filled++] = work[i];
+	for (j = filled, i = 0; i < cnt; i++)
+		if (redundant[i])
+			array[j++] = work[i];
 	free(work);
 	free(redundant);
 	free(filled_index);
 	return filled;
 }
 
+static int remove_redundant_with_gen(struct repository *r,
+				     struct commit **array, int cnt)
+{
+	int i, count_non_stale = 0;
+	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
+	struct commit **walk_start;
+	size_t walk_start_nr = 0, walk_start_alloc = cnt;
+	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
+
+	ALLOC_ARRAY(walk_start, walk_start_alloc);
+
+	/* Mark all parents of the input as STALE */
+	for (i = 0; i < cnt; i++) {
+		struct commit_list *parents;
+		timestamp_t generation;
+
+		repo_parse_commit(r, array[i]);
+		parents = array[i]->parents;
+
+		while (parents) {
+			repo_parse_commit(r, parents->item);
+			if (!(parents->item->object.flags & STALE)) {
+				parents->item->object.flags |= STALE;
+				ALLOC_GROW(walk_start, walk_start_nr + 1, walk_start_alloc);
+				walk_start[walk_start_nr++] = parents->item;
+				prio_queue_put(&queue, parents->item);
+			}
+			parents = parents->next;
+		}
+
+		generation = commit_graph_generation(array[i]);
+
+		if (generation < min_generation)
+			min_generation = generation;
+	}
+
+	/* push the STALE bits up to min generation */
+	while (queue.nr) {
+		struct commit_list *parents;
+		struct commit *c = prio_queue_get(&queue);
+
+		repo_parse_commit(r, c);
+
+		if (commit_graph_generation(c) < min_generation)
+			continue;
+
+		parents = c->parents;
+		while (parents) {
+			if (!(parents->item->object.flags & STALE)) {
+				parents->item->object.flags |= STALE;
+				prio_queue_put(&queue, parents->item);
+			}
+			parents = parents->next;
+		}
+	}
+
+	/* rearrange array */
+	for (i = count_non_stale = 0; i < cnt; i++) {
+		if (!(array[i]->object.flags & STALE))
+			array[count_non_stale++] = array[i];
+	}
+
+	/* clear marks */
+	for (i = 0; i < walk_start_nr; i++)
+		clear_commit_marks(walk_start[i], STALE);
+	free(walk_start);
+
+	return count_non_stale;
+}
+
+static int remove_redundant(struct repository *r, struct commit **array, int cnt)
+{
+	/*
+	 * Some commit in the array may be an ancestor of
+	 * another commit.  Move the independent commits to the
+	 * beginning of 'array' and return their number. Callers
+	 * should not rely upon the contents of 'array' after
+	 * that number.
+	 */
+	if (generation_numbers_enabled(r)) {
+		int i;
+
+		/*
+		 * If we have a single commit with finite generation
+		 * number, then the _with_gen algorithm is preferred.
+		 */
+		for (i = 0; i < cnt; i++) {
+			if (commit_graph_generation(array[i]) < GENERATION_NUMBER_INFINITY)
+				return remove_redundant_with_gen(r, array, cnt);
+		}
+	}
+
+	return remove_redundant_no_gen(r, array, cnt);
+}
+
 static struct commit_list *get_merge_bases_many_0(struct repository *r,
 						  struct commit *one,
 						  int n,
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 3/5] commit-reach: move compare_commits_by_gen
  2021-02-01 12:47 ` [PATCH v2 0/5] " Derrick Stolee via GitGitGadget
  2021-02-01 12:47   ` [PATCH v2 1/5] commit-reach: reduce requirements for remove_redundant() Derrick Stolee via GitGitGadget
  2021-02-01 12:47   ` [PATCH v2 2/5] commit-reach: use one walk in remove_redundant() Derrick Stolee via GitGitGadget
@ 2021-02-01 12:47   ` Derrick Stolee via GitGitGadget
  2021-02-01 12:47   ` [PATCH v2 4/5] commit-reach: use heuristic in remove_redundant() Derrick Stolee via GitGitGadget
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 36+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-01 12:47 UTC (permalink / raw)
  To: git
  Cc: Michael Haggerty, me, peff, gitster, René Scharfe,
	Derrick Stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Move this earlier in the file so it can be used by more methods.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-reach.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index 053676e51d0..7bf52e94429 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -17,6 +17,21 @@
 
 static const unsigned all_flags = (PARENT1 | PARENT2 | STALE | RESULT);
 
+static int compare_commits_by_gen(const void *_a, const void *_b)
+{
+	const struct commit *a = *(const struct commit * const *)_a;
+	const struct commit *b = *(const struct commit * const *)_b;
+
+	timestamp_t generation_a = commit_graph_generation(a);
+	timestamp_t generation_b = commit_graph_generation(b);
+
+	if (generation_a < generation_b)
+		return -1;
+	if (generation_a > generation_b)
+		return 1;
+	return 0;
+}
+
 static int queue_has_nonstale(struct prio_queue *queue)
 {
 	int i;
@@ -651,21 +666,6 @@ int commit_contains(struct ref_filter *filter, struct commit *commit,
 	return repo_is_descendant_of(the_repository, commit, list);
 }
 
-static int compare_commits_by_gen(const void *_a, const void *_b)
-{
-	const struct commit *a = *(const struct commit * const *)_a;
-	const struct commit *b = *(const struct commit * const *)_b;
-
-	timestamp_t generation_a = commit_graph_generation(a);
-	timestamp_t generation_b = commit_graph_generation(b);
-
-	if (generation_a < generation_b)
-		return -1;
-	if (generation_a > generation_b)
-		return 1;
-	return 0;
-}
-
 int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 4/5] commit-reach: use heuristic in remove_redundant()
  2021-02-01 12:47 ` [PATCH v2 0/5] " Derrick Stolee via GitGitGadget
                     ` (2 preceding siblings ...)
  2021-02-01 12:47   ` [PATCH v2 3/5] commit-reach: move compare_commits_by_gen Derrick Stolee via GitGitGadget
@ 2021-02-01 12:47   ` Derrick Stolee via GitGitGadget
  2021-02-01 20:05     ` Junio C Hamano
  2021-02-01 12:47   ` [PATCH v2 5/5] commit-reach: stale commits may prune generation further Derrick Stolee via GitGitGadget
                     ` (2 subsequent siblings)
  6 siblings, 1 reply; 36+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-01 12:47 UTC (permalink / raw)
  To: git
  Cc: Michael Haggerty, me, peff, gitster, René Scharfe,
	Derrick Stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Reachability algorithms in commit-reach.c frequently benefit from using
the first-parent history as a heuristic for satisfying reachability
queries. The most obvious example was implemented in 4fbcca4e
(commit-reach: make can_all_from_reach... linear, 2018-07-20).

Update the walk in remove_redundant() to use this same heuristic. Here,
we are walking starting at the parents of the input commits. Sort those
parents and walk from the highest generation to lower. Each time, use
the heuristic of searching the first parent history before continuing to
expand the walk.

The order in which we explore the commits matters, so update
compare_commits_by_gen to break generation number ties with commit date.
This has no effect when the commits are in a commit-graph file with
corrected commit dates computed, but it will assist when the commits are
in the region "above" the commit-graph with "infinite" generation
number. Note that we cannot shift to use
compare_commits_by_gen_then_commit_date as the method prototype is
different. We use compare_commits_by_gen for QSORT() as opposed to as a
priority function.

The important piece is to ensure we short-circuit the walk when we find
that there is a single non-redundant commit. This happens frequently
when looking for merge-bases or comparing several tags with 'git
merge-base --independent'. Use a new count 'count_still_independent' and
if that hits 1 we can stop walking.

To update 'count_still_independent' properly, we add use of the RESULT
flag on the input commits. Then we can detect when we reach one of these
commits and decrease the count. We need to remove the RESULT flag at
that moment because we might re-visit that commit when popping the
stack.

We use the STALE flag to mark parents that have been added to the new
walk_start list, but we need to clear that flag before we start walking
so those flags don't halt our depth-first-search walk.

On my copy of the Linux kernel repository, the performance of 'git
merge-base --independent <all-tags>' goes from 1.1 seconds to 0.11
seconds.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-reach.c | 72 +++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 56 insertions(+), 16 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index 7bf52e94429..d3a6e2bdd04 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -29,6 +29,10 @@ static int compare_commits_by_gen(const void *_a, const void *_b)
 		return -1;
 	if (generation_a > generation_b)
 		return 1;
+	if (a->date < b->date)
+		return -1;
+	if (a->date > b->date)
+		return 1;
 	return 0;
 }
 
@@ -231,11 +235,10 @@ static int remove_redundant_no_gen(struct repository *r,
 static int remove_redundant_with_gen(struct repository *r,
 				     struct commit **array, int cnt)
 {
-	int i, count_non_stale = 0;
+	int i, count_non_stale = 0, count_still_independent = cnt;
 	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
 	struct commit **walk_start;
 	size_t walk_start_nr = 0, walk_start_alloc = cnt;
-	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 
 	ALLOC_ARRAY(walk_start, walk_start_alloc);
 
@@ -245,6 +248,7 @@ static int remove_redundant_with_gen(struct repository *r,
 		timestamp_t generation;
 
 		repo_parse_commit(r, array[i]);
+		array[i]->object.flags |= RESULT;
 		parents = array[i]->parents;
 
 		while (parents) {
@@ -253,7 +257,6 @@ static int remove_redundant_with_gen(struct repository *r,
 				parents->item->object.flags |= STALE;
 				ALLOC_GROW(walk_start, walk_start_nr + 1, walk_start_alloc);
 				walk_start[walk_start_nr++] = parents->item;
-				prio_queue_put(&queue, parents->item);
 			}
 			parents = parents->next;
 		}
@@ -264,26 +267,63 @@ static int remove_redundant_with_gen(struct repository *r,
 			min_generation = generation;
 	}
 
-	/* push the STALE bits up to min generation */
-	while (queue.nr) {
-		struct commit_list *parents;
-		struct commit *c = prio_queue_get(&queue);
+	QSORT(walk_start, walk_start_nr, compare_commits_by_gen);
 
-		repo_parse_commit(r, c);
+	/* remove STALE bit for now to allow walking through parents */
+	for (i = 0; i < walk_start_nr; i++)
+		walk_start[i]->object.flags &= ~STALE;
 
-		if (commit_graph_generation(c) < min_generation)
-			continue;
+	/*
+	 * Start walking from the highest generation. Hopefully, it will
+	 * find all other items during the first-parent walk, and we can
+	 * terminate early. Otherwise, we will do the same amount of work
+	 * as before.
+	 */
+	for (i = walk_start_nr - 1; i >= 0 && count_still_independent > 1; i--) {
+		/* push the STALE bits up to min generation */
+		struct commit_list *stack = NULL;
 
-		parents = c->parents;
-		while (parents) {
-			if (!(parents->item->object.flags & STALE)) {
-				parents->item->object.flags |= STALE;
-				prio_queue_put(&queue, parents->item);
+		commit_list_insert(walk_start[i], &stack);
+		walk_start[i]->object.flags |= STALE;
+
+		while (stack) {
+			struct commit_list *parents;
+			struct commit *c = stack->item;
+
+			repo_parse_commit(r, c);
+
+			if (c->object.flags & RESULT) {
+				c->object.flags &= ~RESULT;
+				if (--count_still_independent <= 1)
+					break;
 			}
-			parents = parents->next;
+
+			if (commit_graph_generation(c) < min_generation) {
+				pop_commit(&stack);
+				continue;
+			}
+
+			parents = c->parents;
+			while (parents) {
+				if (!(parents->item->object.flags & STALE)) {
+					parents->item->object.flags |= STALE;
+					commit_list_insert(parents->item, &stack);
+					break;
+				}
+				parents = parents->next;
+			}
+
+			/* pop if all parents have been visited already */
+			if (!parents)
+				pop_commit(&stack);
 		}
+		free_commit_list(stack);
 	}
 
+	/* clear result */
+	for (i = 0; i < cnt; i++)
+		array[i]->object.flags &= ~RESULT;
+
 	/* rearrange array */
 	for (i = count_non_stale = 0; i < cnt; i++) {
 		if (!(array[i]->object.flags & STALE))
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v2 5/5] commit-reach: stale commits may prune generation further
  2021-02-01 12:47 ` [PATCH v2 0/5] " Derrick Stolee via GitGitGadget
                     ` (3 preceding siblings ...)
  2021-02-01 12:47   ` [PATCH v2 4/5] commit-reach: use heuristic in remove_redundant() Derrick Stolee via GitGitGadget
@ 2021-02-01 12:47   ` Derrick Stolee via GitGitGadget
  2021-02-03 15:59     ` Taylor Blau
  2021-02-01 15:48   ` [PATCH v2 0/5] Speed up remove_redundant() Derrick Stolee
  2021-02-19 12:34   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  6 siblings, 1 reply; 36+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-01 12:47 UTC (permalink / raw)
  To: git
  Cc: Michael Haggerty, me, peff, gitster, René Scharfe,
	Derrick Stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The remove_redundant_with_gen() algorithm performs a depth-first-search
to find commits in the 'array' list, starting at the parents of each
commit in 'array'. The result is that commits in 'array' are marked
STALE when they are reachable from another commit in 'array'.

This depth-first-search is fast when commits lie on or near the
first-parent history of the higher commits. The search terminates early
if all but one commit becomes marked STALE.

However, it is possible that there are two independent commits with high
generation number. In that case, the depth-first-search might languish
by searching in lower generations due to the fixed min_generation used
throughout the method.

With the expectation that commits with lower generation are expected to
become STALE more often, we can optimize further by increasing that
min_generation boundary upon discovery of the commit with minimum
generation.

We must first sort the commits in 'array' by generation. We cannot sort
'array' itself since it must preserve relative order among the returned
results (see revision.c:mark_redundant_parents() for an example).

This simplifies the initialization of min_generation, but it also allows
us to increase the new min_generation when we find the commit with
smallest generation remaining.

This requires more than two commits in order to test, so I used the
Linux kernel repository with a few commits that are slightly off of the
first-parent history. I timed the following command:

  git merge-base --independent 2ecedd756908 d2360a398f0b \
	1253935ad801 160bab43419e 0e2209629fec 1d0e16ac1a9e

The first two commits have similar generation and are near the v5.10
tag. Commit 160bab43419e is off of the first-parent history behind v5.5,
while the others are scattered somewhere reachable from v5.9. This is
designed to demonstrate the optimization, as that commit within v5.5
would normally cause a lot of extra commit walking.

Since remove_redundant_with_alg() is called only when at least one of
the input commits has a finite generation number, this algorithm is
tested with a commit-graph generated starting at a number of different
tags, the earliest being v5.5.

commit-graph at v5.5:

 | Method                | Time  |
 |-----------------------+-------|
 | *_no_gen()            | 864ms |
 | *_with_gen() (before) | 858ms |
 | *_with_gen() (after)  | 810ms |

commit-graph at v5.7:

 | Method                | Time  |
 |-----------------------+-------|
 | *_no_gen()            | 625ms |
 | *_with_gen() (before) | 572ms |
 | *_with_gen() (after)  | 517ms |

commit-graph at v5.9:

 | Method                | Time  |
 |-----------------------+-------|
 | *_no_gen()            | 268ms |
 | *_with_gen() (before) | 224ms |
 | *_with_gen() (after)  | 202ms |

commit-graph at v5.10:

 | Method                | Time  |
 |-----------------------+-------|
 | *_no_gen()            |  72ms |
 | *_with_gen() (before) |  37ms |
 | *_with_gen() (after)  |   9ms |

Note that these are only modest improvements for the case where the two
independent commits are not in the commit-graph (not until v5.10). All
algorithms get faster as more commits are indexed, which is not a
surprise. However, the cost of walking extra commits is more and more
prevalent in relative terms as more commits are indexed. Finally, the
last case allows us to jump to the minimum generation between the last
two commits (that are actually independent) so we greatly reduce the
cost in that case.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-reach.c | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index d3a6e2bdd04..c2e0747fea4 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -237,15 +237,27 @@ static int remove_redundant_with_gen(struct repository *r,
 {
 	int i, count_non_stale = 0, count_still_independent = cnt;
 	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
-	struct commit **walk_start;
+	struct commit **walk_start, **sorted;
 	size_t walk_start_nr = 0, walk_start_alloc = cnt;
+	int min_gen_pos = 0;
+
+	/*
+	 * Sort the input by generation number, ascending. This allows
+	 * us to increase the "min_generation" limit when we discover
+	 * the commit with lowest generation is STALE. The index
+	 * min_gen_pos points to the current position within 'array'
+	 * that is not yet known to be STALE.
+	 */
+	ALLOC_ARRAY(sorted, cnt);
+	COPY_ARRAY(sorted, array, cnt);
+	QSORT(sorted, cnt, compare_commits_by_gen);
+	min_generation = commit_graph_generation(sorted[0]);
 
 	ALLOC_ARRAY(walk_start, walk_start_alloc);
 
 	/* Mark all parents of the input as STALE */
 	for (i = 0; i < cnt; i++) {
 		struct commit_list *parents;
-		timestamp_t generation;
 
 		repo_parse_commit(r, array[i]);
 		array[i]->object.flags |= RESULT;
@@ -260,11 +272,6 @@ static int remove_redundant_with_gen(struct repository *r,
 			}
 			parents = parents->next;
 		}
-
-		generation = commit_graph_generation(array[i]);
-
-		if (generation < min_generation)
-			min_generation = generation;
 	}
 
 	QSORT(walk_start, walk_start_nr, compare_commits_by_gen);
@@ -296,6 +303,12 @@ static int remove_redundant_with_gen(struct repository *r,
 				c->object.flags &= ~RESULT;
 				if (--count_still_independent <= 1)
 					break;
+				if (oideq(&c->object.oid, &sorted[min_gen_pos]->object.oid)) {
+					while (min_gen_pos < cnt - 1 &&
+					       (sorted[min_gen_pos]->object.flags & STALE))
+						min_gen_pos++;
+					min_generation = commit_graph_generation(sorted[min_gen_pos]);
+				}
 			}
 
 			if (commit_graph_generation(c) < min_generation) {
@@ -319,6 +332,7 @@ static int remove_redundant_with_gen(struct repository *r,
 		}
 		free_commit_list(stack);
 	}
+	free(sorted);
 
 	/* clear result */
 	for (i = 0; i < cnt; i++)
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/5] Speed up remove_redundant()
  2021-02-01 12:47 ` [PATCH v2 0/5] " Derrick Stolee via GitGitGadget
                     ` (4 preceding siblings ...)
  2021-02-01 12:47   ` [PATCH v2 5/5] commit-reach: stale commits may prune generation further Derrick Stolee via GitGitGadget
@ 2021-02-01 15:48   ` Derrick Stolee
  2021-02-18 23:25     ` Junio C Hamano
  2021-02-19 12:34   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  6 siblings, 1 reply; 36+ messages in thread
From: Derrick Stolee @ 2021-02-01 15:48 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: Michael Haggerty, me, peff, gitster, René Scharfe,
	Derrick Stolee

On 2/1/2021 7:47 AM, Derrick Stolee via GitGitGadget wrote:
> My tests were on the Linux kernel repository, testing this command:
> 
>   git merge-base --independent $(git for-each-ref refs/tags --format="%(refname)")
> 
>        Before: 16.4s
> After Patch 1:  1.1s
> After Patch 2:  0.1s

During my testing of v2, I discovered how remove_redundant()
is being used as part of 'git log --simplify-merges'. So, here
are some performance numbers for the Linux kernel repository
with a full commit-graph and changed-path Bloom filters running

  git log --oneline -n10 --simplify-merges -- kernel/

Benchmark #1: old
  Time (mean ± σ):      5.770 s ±  0.030 s    [User: 5.551 s, System: 0.219 s]
  Range (min … max):    5.728 s …  5.827 s    10 runs
 
Benchmark #2: new
  Time (mean ± σ):      5.030 s ±  0.017 s    [User: 4.805 s, System: 0.224 s]
  Range (min … max):    4.999 s …  5.054 s    10 runs
 
Summary
  'new' ran
    1.15 ± 0.01 times faster than 'old'

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/5] commit-reach: use one walk in remove_redundant()
  2021-02-01 12:47   ` [PATCH v2 2/5] commit-reach: use one walk in remove_redundant() Derrick Stolee via GitGitGadget
@ 2021-02-01 16:12     ` René Scharfe.
  2021-02-01 16:31       ` Derrick Stolee
  0 siblings, 1 reply; 36+ messages in thread
From: René Scharfe. @ 2021-02-01 16:12 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: Michael Haggerty, me, peff, gitster, Derrick Stolee,
	Derrick Stolee, Derrick Stolee

Am 01.02.21 um 13:47 schrieb Derrick Stolee via GitGitGadget:
> @@ -210,12 +204,110 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
>  	for (i = filled = 0; i < cnt; i++)
>  		if (!redundant[i])
>  			array[filled++] = work[i];
> +	for (j = filled, i = 0; i < cnt; i++)
> +		if (redundant[i])
> +			array[j++] = work[i];

This puts the loop back in that you removed in the previous commit.
Intentionally?

>  	free(work);
>  	free(redundant);
>  	free(filled_index);
>  	return filled;
>  }
>
> +static int remove_redundant_with_gen(struct repository *r,
> +				     struct commit **array, int cnt)
> +{
> +	int i, count_non_stale = 0;
> +	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
> +	struct commit **walk_start;
> +	size_t walk_start_nr = 0, walk_start_alloc = cnt;
> +	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
> +
> +	ALLOC_ARRAY(walk_start, walk_start_alloc);
> +
> +	/* Mark all parents of the input as STALE */
> +	for (i = 0; i < cnt; i++) {
> +		struct commit_list *parents;
> +		timestamp_t generation;
> +
> +		repo_parse_commit(r, array[i]);
> +		parents = array[i]->parents;
> +
> +		while (parents) {
> +			repo_parse_commit(r, parents->item);
> +			if (!(parents->item->object.flags & STALE)) {
> +				parents->item->object.flags |= STALE;
> +				ALLOC_GROW(walk_start, walk_start_nr + 1, walk_start_alloc);
> +				walk_start[walk_start_nr++] = parents->item;
> +				prio_queue_put(&queue, parents->item);
> +			}
> +			parents = parents->next;
> +		}
> +
> +		generation = commit_graph_generation(array[i]);
> +
> +		if (generation < min_generation)
> +			min_generation = generation;
> +	}
> +
> +	/* push the STALE bits up to min generation */
> +	while (queue.nr) {
> +		struct commit_list *parents;
> +		struct commit *c = prio_queue_get(&queue);
> +
> +		repo_parse_commit(r, c);
> +
> +		if (commit_graph_generation(c) < min_generation)
> +			continue;
> +
> +		parents = c->parents;
> +		while (parents) {
> +			if (!(parents->item->object.flags & STALE)) {
> +				parents->item->object.flags |= STALE;
> +				prio_queue_put(&queue, parents->item);
> +			}
> +			parents = parents->next;
> +		}
> +	}
> +
> +	/* rearrange array */
> +	for (i = count_non_stale = 0; i < cnt; i++) {
> +		if (!(array[i]->object.flags & STALE))

Here I would have added another condition, count_non_stale != i, to
avoid self-assignment (array[x] = array[x]).  The code works without
it, though.  Not sure if there is a performance benefit to be had --
branch vs. pointer copy.  Probably not worth it..

> +			array[count_non_stale++] = array[i];
> +	}
> +
> +	/* clear marks */
> +	for (i = 0; i < walk_start_nr; i++)
> +		clear_commit_marks(walk_start[i], STALE);

You can replace this loop with a call to clear_commit_marks_many().

> +	free(walk_start);
> +
> +	return count_non_stale;
> +}
> +
> +static int remove_redundant(struct repository *r, struct commit **array, int cnt)
> +{
> +	/*
> +	 * Some commit in the array may be an ancestor of
> +	 * another commit.  Move the independent commits to the
> +	 * beginning of 'array' and return their number. Callers
> +	 * should not rely upon the contents of 'array' after
> +	 * that number.
> +	 */
> +	if (generation_numbers_enabled(r)) {
> +		int i;
> +
> +		/*
> +		 * If we have a single commit with finite generation
> +		 * number, then the _with_gen algorithm is preferred.
> +		 */
> +		for (i = 0; i < cnt; i++) {
> +			if (commit_graph_generation(array[i]) < GENERATION_NUMBER_INFINITY)
> +				return remove_redundant_with_gen(r, array, cnt);
> +		}
> +	}
> +
> +	return remove_redundant_no_gen(r, array, cnt);
> +}
> +
>  static struct commit_list *get_merge_bases_many_0(struct repository *r,
>  						  struct commit *one,
>  						  int n,
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 2/5] commit-reach: use one walk in remove_redundant()
  2021-02-01 16:12     ` René Scharfe.
@ 2021-02-01 16:31       ` Derrick Stolee
  0 siblings, 0 replies; 36+ messages in thread
From: Derrick Stolee @ 2021-02-01 16:31 UTC (permalink / raw)
  To: René Scharfe., Derrick Stolee via GitGitGadget, git
  Cc: Michael Haggerty, me, peff, gitster, Derrick Stolee,
	Derrick Stolee

On 2/1/2021 11:12 AM, René Scharfe. wrote:
> Am 01.02.21 um 13:47 schrieb Derrick Stolee via GitGitGadget:
>> @@ -210,12 +204,110 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
>>  	for (i = filled = 0; i < cnt; i++)
>>  		if (!redundant[i])
>>  			array[filled++] = work[i];
>> +	for (j = filled, i = 0; i < cnt; i++)
>> +		if (redundant[i])
>> +			array[j++] = work[i];
> 
> This puts the loop back in that you removed in the previous commit.
> Intentionally?

Not intentional. Thanks for noticing.

>> +	/* rearrange array */
>> +	for (i = count_non_stale = 0; i < cnt; i++) {
>> +		if (!(array[i]->object.flags & STALE))
> 
> Here I would have added another condition, count_non_stale != i, to
> avoid self-assignment (array[x] = array[x]).  The code works without
> it, though.  Not sure if there is a performance benefit to be had --> branch vs. pointer copy.  Probably not worth it..

You are correct, but I'm going to go on the side of not worth it.

>> +			array[count_non_stale++] = array[i];
>> +	}
>> +
>> +	/* clear marks */
>> +	for (i = 0; i < walk_start_nr; i++)
>> +		clear_commit_marks(walk_start[i], STALE);
> 
> You can replace this loop with a call to clear_commit_marks_many().

Right! Earlier I was using a 'struct commit_list *' which would not
work, but this 'struct commit ** walk_start' does work. Thanks.

-Stolee

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 1/5] commit-reach: reduce requirements for remove_redundant()
  2021-02-01 12:47   ` [PATCH v2 1/5] commit-reach: reduce requirements for remove_redundant() Derrick Stolee via GitGitGadget
@ 2021-02-01 19:51     ` Junio C Hamano
  0 siblings, 0 replies; 36+ messages in thread
From: Junio C Hamano @ 2021-02-01 19:51 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Michael Haggerty, me, peff, René Scharfe,
	Derrick Stolee, Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> Remove a comment at the beggining of remove_redundant() that mentions a
> reordering of the input array to have the initial segment be the
> independent commits and the final segment be the redundant commits.
> While this behavior is followed in remove_redundant(), no callers rely
> on that behavior.
>
> Remove the final loop that copies this final segment and update the
> comment to match the new behavior.
>
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---

Makes sense, especially since this is a file-local/static helper
function, we have reasonably tight control over its callers.

Will queue.  Thanks.

>  commit-reach.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/commit-reach.c b/commit-reach.c
> index e38771ca5a1..9af51fe7e07 100644
> --- a/commit-reach.c
> +++ b/commit-reach.c
> @@ -160,9 +160,10 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
>  {
>  	/*
>  	 * Some commit in the array may be an ancestor of
> -	 * another commit.  Move such commit to the end of
> -	 * the array, and return the number of commits that
> -	 * are independent from each other.
> +	 * another commit.  Move the independent commits to the
> +	 * beginning of 'array' and return their number. Callers
> +	 * should not rely upon the contents of 'array' after
> +	 * that number.
>  	 */
>  	struct commit **work;
>  	unsigned char *redundant;
> @@ -209,9 +210,6 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
>  	for (i = filled = 0; i < cnt; i++)
>  		if (!redundant[i])
>  			array[filled++] = work[i];
> -	for (j = filled, i = 0; i < cnt; i++)
> -		if (redundant[i])
> -			array[j++] = work[i];
>  	free(work);
>  	free(redundant);
>  	free(filled_index);

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 4/5] commit-reach: use heuristic in remove_redundant()
  2021-02-01 12:47   ` [PATCH v2 4/5] commit-reach: use heuristic in remove_redundant() Derrick Stolee via GitGitGadget
@ 2021-02-01 20:05     ` Junio C Hamano
  2021-02-01 21:02       ` Derrick Stolee
  0 siblings, 1 reply; 36+ messages in thread
From: Junio C Hamano @ 2021-02-01 20:05 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Michael Haggerty, me, peff, René Scharfe,
	Derrick Stolee, Derrick Stolee, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> The important piece is to ensure we short-circuit the walk when we find
> that there is a single non-redundant commit. This happens frequently
> when looking for merge-bases or comparing several tags with 'git
> merge-base --independent'. Use a new count 'count_still_independent' and
> if that hits 1 we can stop walking.

That is because when you are left one single thing, it may be able
to reach many other things, but the fact that it by itself won't be
reachable by remaining independent things will not change (because,
that sole remaining independent thing is itself)?

> To update 'count_still_independent' properly, we add use of the RESULT
> flag on the input commits. Then we can detect when we reach one of these
> commits and decrease the count. We need to remove the RESULT flag at
> that moment because we might re-visit that commit when popping the
> stack.
>
> We use the STALE flag to mark parents that have been added to the new
> walk_start list, but we need to clear that flag before we start walking
> so those flags don't halt our depth-first-search walk.
>
> On my copy of the Linux kernel repository, the performance of 'git
> merge-base --independent <all-tags>' goes from 1.1 seconds to 0.11
> seconds.

These two numbers are with commit-graph fully populated with usable
generation numbers, I presume, and it is quite impressive.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 4/5] commit-reach: use heuristic in remove_redundant()
  2021-02-01 20:05     ` Junio C Hamano
@ 2021-02-01 21:02       ` Derrick Stolee
  0 siblings, 0 replies; 36+ messages in thread
From: Derrick Stolee @ 2021-02-01 21:02 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, Michael Haggerty, me, peff, René Scharfe,
	Derrick Stolee, Derrick Stolee

On 2/1/2021 3:05 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> The important piece is to ensure we short-circuit the walk when we find
>> that there is a single non-redundant commit. This happens frequently
>> when looking for merge-bases or comparing several tags with 'git
>> merge-base --independent'. Use a new count 'count_still_independent' and
>> if that hits 1 we can stop walking.
> 
> That is because when you are left one single thing, it may be able
> to reach many other things, but the fact that it by itself won't be
> reachable by remaining independent things will not change (because,
> that sole remaining independent thing is itself)?

Right. If there is only one non-STALE input commit left, then it will
be the only returned result. We will never find that all commits are
redundant because the commit graph is acyclic. The performance
improvement comes from halting the DFS walk: there might be more
commits to walk but they won't change the result.

>> To update 'count_still_independent' properly, we add use of the RESULT
>> flag on the input commits. Then we can detect when we reach one of these
>> commits and decrease the count. We need to remove the RESULT flag at
>> that moment because we might re-visit that commit when popping the
>> stack.
>>
>> We use the STALE flag to mark parents that have been added to the new
>> walk_start list, but we need to clear that flag before we start walking
>> so those flags don't halt our depth-first-search walk.
>>
>> On my copy of the Linux kernel repository, the performance of 'git
>> merge-base --independent <all-tags>' goes from 1.1 seconds to 0.11
>> seconds.
> 
> These two numbers are with commit-graph fully populated with usable
> generation numbers, I presume, and it is quite impressive.
 
Yes, these numbers are with a full commit-graph.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 5/5] commit-reach: stale commits may prune generation further
  2021-02-01 12:47   ` [PATCH v2 5/5] commit-reach: stale commits may prune generation further Derrick Stolee via GitGitGadget
@ 2021-02-03 15:59     ` Taylor Blau
  0 siblings, 0 replies; 36+ messages in thread
From: Taylor Blau @ 2021-02-03 15:59 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, Michael Haggerty, me, peff, gitster, René Scharfe,
	Derrick Stolee, Derrick Stolee, Derrick Stolee

On Mon, Feb 01, 2021 at 12:47:27PM +0000, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> Note that these are only modest improvements for the case where the two
> independent commits are not in the commit-graph (not until v5.10). All
> algorithms get faster as more commits are indexed, which is not a
> surprise. However, the cost of walking extra commits is more and more
> prevalent in relative terms as more commits are indexed. Finally, the
> last case allows us to jump to the minimum generation between the last
> two commits (that are actually independent) so we greatly reduce the
> cost in that case.

Very nice. The explanation and implementation makes sense to me.

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit-reach.c | 28 +++++++++++++++++++++-------
>  1 file changed, 21 insertions(+), 7 deletions(-)
>
> diff --git a/commit-reach.c b/commit-reach.c
> index d3a6e2bdd04..c2e0747fea4 100644
> --- a/commit-reach.c
> +++ b/commit-reach.c
> @@ -237,15 +237,27 @@ static int remove_redundant_with_gen(struct repository *r,
>  {
>  	int i, count_non_stale = 0, count_still_independent = cnt;
>  	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
> -	struct commit **walk_start;
> +	struct commit **walk_start, **sorted;
>  	size_t walk_start_nr = 0, walk_start_alloc = cnt;
> +	int min_gen_pos = 0;
> +
> +	/*
> +	 * Sort the input by generation number, ascending. This allows
> +	 * us to increase the "min_generation" limit when we discover
> +	 * the commit with lowest generation is STALE. The index
> +	 * min_gen_pos points to the current position within 'array'
> +	 * that is not yet known to be STALE.
> +	 */
> +	ALLOC_ARRAY(sorted, cnt);
> +	COPY_ARRAY(sorted, array, cnt);
> +	QSORT(sorted, cnt, compare_commits_by_gen);
> +	min_generation = commit_graph_generation(sorted[0]);

This line caught my eye, but we return early in
commit-reach.c:reduce_heads() before even calling remove_redundant()
(which itself calls remove_redundant_with_gen()), so it's always OK to
assume that we have at least one element in 'array'.

This patch and the others in v2 all look good to me.

Thanks,
Taylor

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/5] Speed up remove_redundant()
  2021-02-01 15:48   ` [PATCH v2 0/5] Speed up remove_redundant() Derrick Stolee
@ 2021-02-18 23:25     ` Junio C Hamano
  2021-02-19 12:17       ` Derrick Stolee
  0 siblings, 1 reply; 36+ messages in thread
From: Junio C Hamano @ 2021-02-18 23:25 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, Michael Haggerty, me, peff,
	René Scharfe, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> On 2/1/2021 7:47 AM, Derrick Stolee via GitGitGadget wrote:
>> My tests were on the Linux kernel repository, testing this command:
>> 
>>   git merge-base --independent $(git for-each-ref refs/tags --format="%(refname)")
>> 
>>        Before: 16.4s
>> After Patch 1:  1.1s
>> After Patch 2:  0.1s
>
> During my testing of v2, I discovered how remove_redundant()
> is being used as part of 'git log --simplify-merges'. So, here
> are some performance numbers for the Linux kernel repository
> with a full commit-graph and changed-path Bloom filters running
>
>   git log --oneline -n10 --simplify-merges -- kernel/
>
> Benchmark #1: old
>   Time (mean ± σ):      5.770 s ±  0.030 s    [User: 5.551 s, System: 0.219 s]
>   Range (min … max):    5.728 s …  5.827 s    10 runs
>  
> Benchmark #2: new
>   Time (mean ± σ):      5.030 s ±  0.017 s    [User: 4.805 s, System: 0.224 s]
>   Range (min … max):    4.999 s …  5.054 s    10 runs
>  
> Summary
>   'new' ran
>     1.15 ± 0.01 times faster than 'old'


This topic seems to have been forgotten?  Is the only thing missing
is an updated 2/5 or is there anything more that is needed?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/5] Speed up remove_redundant()
  2021-02-18 23:25     ` Junio C Hamano
@ 2021-02-19 12:17       ` Derrick Stolee
  2021-02-20  3:32         ` Junio C Hamano
  0 siblings, 1 reply; 36+ messages in thread
From: Derrick Stolee @ 2021-02-19 12:17 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, Michael Haggerty, me, peff,
	René Scharfe, Derrick Stolee

On 2/18/2021 6:25 PM, Junio C Hamano wrote:
> This topic seems to have been forgotten?  Is the only thing missing
> is an updated 2/5 or is there anything more that is needed?
 
I had forgotten about the comments on patch 2/5, sorry. That explains
why it hasn't merged yet...

I'll get right on it.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 0/5] Speed up remove_redundant()
  2021-02-01 12:47 ` [PATCH v2 0/5] " Derrick Stolee via GitGitGadget
                     ` (5 preceding siblings ...)
  2021-02-01 15:48   ` [PATCH v2 0/5] Speed up remove_redundant() Derrick Stolee
@ 2021-02-19 12:34   ` Derrick Stolee via GitGitGadget
  2021-02-19 12:34     ` [PATCH v3 1/5] commit-reach: reduce requirements for remove_redundant() Derrick Stolee via GitGitGadget
                       ` (4 more replies)
  6 siblings, 5 replies; 36+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-19 12:34 UTC (permalink / raw)
  To: git
  Cc: Michael Haggerty, me, peff, gitster, René Scharfe,
	Derrick Stolee, Derrick Stolee

Michael Haggerty pointed out that git merge-base --independent is quite slow
when many commits are provided in the input. This boils down to some
quadratic behavior in remove_redundant() because it calls
paint_down_to_common() in a loop.

This series avoids that by using a single commit walk, pushing a flag that
means "this commit is reachable from a parent of a commit in the list." At
the end of the walk, the input commits without that flag are independent.

There is an extra performance enhancement using the first-parent history as
a heuristic. This helps only when there is a single independent result, as
we can short circuit the walk when all other inputs are found to be
reachable from the top one. It also gets faster when we discover the commit
with lowest generation and can restrict our walk further to the next-lowest
generation number.

My tests were on the Linux kernel repository, testing this command:

  git merge-base --independent $(git for-each-ref refs/tags --format="%(refname)")

       Before: 16.4s
After Patch 1:  1.1s
After Patch 2:  0.1s


This does not change the correctness of the function. It is tested carefully
in t6600-test-reach.c, among existing merge-base tests. There are also more
tests in scripts like t6012-rev-list-simplify.sh that trigger the new logic
under GIT_TEST_COMMIT_GRAPH=1.


Updates in V3
=============

 * Patch 2 is updated to René's comments. Sorry I dropped the ball on this.
   I had the changes locally, but never pushed them until now.


Updates in V2
=============

 * The old algorithm is not entirely removed. It is still faster than the
   new algorithm if all input commits are not in the commit-graph file
   (unless the commit-graph file is really close to the input commits, but
   that is an unreasonable expectation). Now, the new logic only happens if
   there is an input commit with finite generation number.

 * There was a copy of 'array' being used for the final reorder which is now
   removed. We still need to be careful about the RESULT bit.

 * The final patch is new, adding yet another speedup in the form of
   increasing the min_generation cutoff when we find the input commit of
   smallest generation. This can reduce the time significantly when given
   many inputs. This does require copying the 'array' again, so we can sort
   by generation number while preserving the original order of 'array' which
   is required by some callers.

Thanks, -Stolee

Derrick Stolee (5):
  commit-reach: reduce requirements for remove_redundant()
  commit-reach: use one walk in remove_redundant()
  commit-reach: move compare_commits_by_gen
  commit-reach: use heuristic in remove_redundant()
  commit-reach: stale commits may prune generation further

 commit-reach.c | 190 ++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 165 insertions(+), 25 deletions(-)


base-commit: 5a3b130cad0d5c770f766e3af6d32b41766374c0
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-852%2Fderrickstolee%2Fmerge-independent-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-852/derrickstolee/merge-independent-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/852

Range-diff vs v2:

 1:  649f6799e6bf = 1:  649f6799e6bf commit-reach: reduce requirements for remove_redundant()
 2:  2f80ae5fcb00 ! 2:  c2c08989ce60 commit-reach: use one walk in remove_redundant()
     @@ commit-reach.c: struct commit_list *get_octopus_merge_bases(struct commit_list *
       	unsigned char *redundant;
       	int *filled_index;
      @@ commit-reach.c: static int remove_redundant(struct repository *r, struct commit **array, int cnt
     - 	for (i = filled = 0; i < cnt; i++)
     - 		if (!redundant[i])
     - 			array[filled++] = work[i];
     -+	for (j = filled, i = 0; i < cnt; i++)
     -+		if (redundant[i])
     -+			array[j++] = work[i];
     - 	free(work);
     - 	free(redundant);
     - 	free(filled_index);
       	return filled;
       }
       
     @@ commit-reach.c: static int remove_redundant(struct repository *r, struct commit
      +	}
      +
      +	/* clear marks */
     -+	for (i = 0; i < walk_start_nr; i++)
     -+		clear_commit_marks(walk_start[i], STALE);
     ++	clear_commit_marks_many(walk_start_nr, walk_start, STALE);
      +	free(walk_start);
      +
      +	return count_non_stale;
 3:  009f64b53c95 = 3:  1d3c5ebbb632 commit-reach: move compare_commits_by_gen
 4:  83feabeebb5f = 4:  c1fc174d42aa commit-reach: use heuristic in remove_redundant()
 5:  14f0974c9872 = 5:  539db83bd735 commit-reach: stale commits may prune generation further

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH v3 1/5] commit-reach: reduce requirements for remove_redundant()
  2021-02-19 12:34   ` [PATCH v3 " Derrick Stolee via GitGitGadget
@ 2021-02-19 12:34     ` Derrick Stolee via GitGitGadget
  2021-02-19 12:34     ` [PATCH v3 2/5] commit-reach: use one walk in remove_redundant() Derrick Stolee via GitGitGadget
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-19 12:34 UTC (permalink / raw)
  To: git
  Cc: Michael Haggerty, me, peff, gitster, René Scharfe,
	Derrick Stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Remove a comment at the beggining of remove_redundant() that mentions a
reordering of the input array to have the initial segment be the
independent commits and the final segment be the redundant commits.
While this behavior is followed in remove_redundant(), no callers rely
on that behavior.

Remove the final loop that copies this final segment and update the
comment to match the new behavior.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-reach.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index e38771ca5a1f..9af51fe7e078 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -160,9 +160,10 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
 {
 	/*
 	 * Some commit in the array may be an ancestor of
-	 * another commit.  Move such commit to the end of
-	 * the array, and return the number of commits that
-	 * are independent from each other.
+	 * another commit.  Move the independent commits to the
+	 * beginning of 'array' and return their number. Callers
+	 * should not rely upon the contents of 'array' after
+	 * that number.
 	 */
 	struct commit **work;
 	unsigned char *redundant;
@@ -209,9 +210,6 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
 	for (i = filled = 0; i < cnt; i++)
 		if (!redundant[i])
 			array[filled++] = work[i];
-	for (j = filled, i = 0; i < cnt; i++)
-		if (redundant[i])
-			array[j++] = work[i];
 	free(work);
 	free(redundant);
 	free(filled_index);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 2/5] commit-reach: use one walk in remove_redundant()
  2021-02-19 12:34   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  2021-02-19 12:34     ` [PATCH v3 1/5] commit-reach: reduce requirements for remove_redundant() Derrick Stolee via GitGitGadget
@ 2021-02-19 12:34     ` Derrick Stolee via GitGitGadget
  2021-02-19 12:34     ` [PATCH v3 3/5] commit-reach: move compare_commits_by_gen Derrick Stolee via GitGitGadget
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-19 12:34 UTC (permalink / raw)
  To: git
  Cc: Michael Haggerty, me, peff, gitster, René Scharfe,
	Derrick Stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The current implementation of remove_redundant() uses several calls to
paint_down_to_common() to determine that commits are independent of each
other. This leads to quadratic behavior when many inputs are passed to
commands such as 'git merge-base'.

For example, in the Linux kernel repository, I tested the performance
by passing all tags:

 git merge-base --independent $(git for-each-ref refs/tags --format="$(refname)")

(Note: I had to delete the tags v2.6.11-tree and v2.6.11 as they do
not point to commits.)

Here is the performance improvement introduced by this change:

 Before: 16.4s
  After:  1.1s

This performance improvement requires the commit-graph file to be
present. We keep the old algorithm around as remove_redundant_no_gen()
and use it when generation_numbers_enabled() is false. This is similar
to other algorithms within commit-reach.c. The new algorithm is
implemented in remove_redundant_with_gen().

The basic approach is to do one commit walk instead of many. First, scan
all commits in the list and mark their _parents_ with the STALE flag.
This flag will indicate commits that are reachable from one of the
inputs, except not including themselves. Then, walk commits until
covering all commits up to the minimum generation number pushing the
STALE flag throughout.

At the end, we need to clear the STALE bit from all of the commits
we walked. We move the non-stale commits in 'array' to the beginning of
the list, and this might overwrite stale commits. However, we store an
array of commits that started the walk, and use clear_commit_marks() on
each of those starting commits. That method will walk the reachable
commits with the STALE bit and clear them all. This makes the algorithm
safe for re-entry or for other uses of those commits after this walk.

This logic is covered by tests in t6600-test-reach.sh, so the behavior
does not change. This is tested both in the case with a commit-graph and
without.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-reach.c | 104 +++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 96 insertions(+), 8 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index 9af51fe7e078..7a3a1eb1a26e 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -156,15 +156,9 @@ struct commit_list *get_octopus_merge_bases(struct commit_list *in)
 	return ret;
 }
 
-static int remove_redundant(struct repository *r, struct commit **array, int cnt)
+static int remove_redundant_no_gen(struct repository *r,
+				   struct commit **array, int cnt)
 {
-	/*
-	 * Some commit in the array may be an ancestor of
-	 * another commit.  Move the independent commits to the
-	 * beginning of 'array' and return their number. Callers
-	 * should not rely upon the contents of 'array' after
-	 * that number.
-	 */
 	struct commit **work;
 	unsigned char *redundant;
 	int *filled_index;
@@ -216,6 +210,100 @@ static int remove_redundant(struct repository *r, struct commit **array, int cnt
 	return filled;
 }
 
+static int remove_redundant_with_gen(struct repository *r,
+				     struct commit **array, int cnt)
+{
+	int i, count_non_stale = 0;
+	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
+	struct commit **walk_start;
+	size_t walk_start_nr = 0, walk_start_alloc = cnt;
+	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
+
+	ALLOC_ARRAY(walk_start, walk_start_alloc);
+
+	/* Mark all parents of the input as STALE */
+	for (i = 0; i < cnt; i++) {
+		struct commit_list *parents;
+		timestamp_t generation;
+
+		repo_parse_commit(r, array[i]);
+		parents = array[i]->parents;
+
+		while (parents) {
+			repo_parse_commit(r, parents->item);
+			if (!(parents->item->object.flags & STALE)) {
+				parents->item->object.flags |= STALE;
+				ALLOC_GROW(walk_start, walk_start_nr + 1, walk_start_alloc);
+				walk_start[walk_start_nr++] = parents->item;
+				prio_queue_put(&queue, parents->item);
+			}
+			parents = parents->next;
+		}
+
+		generation = commit_graph_generation(array[i]);
+
+		if (generation < min_generation)
+			min_generation = generation;
+	}
+
+	/* push the STALE bits up to min generation */
+	while (queue.nr) {
+		struct commit_list *parents;
+		struct commit *c = prio_queue_get(&queue);
+
+		repo_parse_commit(r, c);
+
+		if (commit_graph_generation(c) < min_generation)
+			continue;
+
+		parents = c->parents;
+		while (parents) {
+			if (!(parents->item->object.flags & STALE)) {
+				parents->item->object.flags |= STALE;
+				prio_queue_put(&queue, parents->item);
+			}
+			parents = parents->next;
+		}
+	}
+
+	/* rearrange array */
+	for (i = count_non_stale = 0; i < cnt; i++) {
+		if (!(array[i]->object.flags & STALE))
+			array[count_non_stale++] = array[i];
+	}
+
+	/* clear marks */
+	clear_commit_marks_many(walk_start_nr, walk_start, STALE);
+	free(walk_start);
+
+	return count_non_stale;
+}
+
+static int remove_redundant(struct repository *r, struct commit **array, int cnt)
+{
+	/*
+	 * Some commit in the array may be an ancestor of
+	 * another commit.  Move the independent commits to the
+	 * beginning of 'array' and return their number. Callers
+	 * should not rely upon the contents of 'array' after
+	 * that number.
+	 */
+	if (generation_numbers_enabled(r)) {
+		int i;
+
+		/*
+		 * If we have a single commit with finite generation
+		 * number, then the _with_gen algorithm is preferred.
+		 */
+		for (i = 0; i < cnt; i++) {
+			if (commit_graph_generation(array[i]) < GENERATION_NUMBER_INFINITY)
+				return remove_redundant_with_gen(r, array, cnt);
+		}
+	}
+
+	return remove_redundant_no_gen(r, array, cnt);
+}
+
 static struct commit_list *get_merge_bases_many_0(struct repository *r,
 						  struct commit *one,
 						  int n,
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 3/5] commit-reach: move compare_commits_by_gen
  2021-02-19 12:34   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  2021-02-19 12:34     ` [PATCH v3 1/5] commit-reach: reduce requirements for remove_redundant() Derrick Stolee via GitGitGadget
  2021-02-19 12:34     ` [PATCH v3 2/5] commit-reach: use one walk in remove_redundant() Derrick Stolee via GitGitGadget
@ 2021-02-19 12:34     ` Derrick Stolee via GitGitGadget
  2021-02-19 12:34     ` [PATCH v3 4/5] commit-reach: use heuristic in remove_redundant() Derrick Stolee via GitGitGadget
  2021-02-19 12:34     ` [PATCH v3 5/5] commit-reach: stale commits may prune generation further Derrick Stolee via GitGitGadget
  4 siblings, 0 replies; 36+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-19 12:34 UTC (permalink / raw)
  To: git
  Cc: Michael Haggerty, me, peff, gitster, René Scharfe,
	Derrick Stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Move this earlier in the file so it can be used by more methods.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-reach.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index 7a3a1eb1a26e..a34c5ba09e81 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -17,6 +17,21 @@
 
 static const unsigned all_flags = (PARENT1 | PARENT2 | STALE | RESULT);
 
+static int compare_commits_by_gen(const void *_a, const void *_b)
+{
+	const struct commit *a = *(const struct commit * const *)_a;
+	const struct commit *b = *(const struct commit * const *)_b;
+
+	timestamp_t generation_a = commit_graph_generation(a);
+	timestamp_t generation_b = commit_graph_generation(b);
+
+	if (generation_a < generation_b)
+		return -1;
+	if (generation_a > generation_b)
+		return 1;
+	return 0;
+}
+
 static int queue_has_nonstale(struct prio_queue *queue)
 {
 	int i;
@@ -647,21 +662,6 @@ int commit_contains(struct ref_filter *filter, struct commit *commit,
 	return repo_is_descendant_of(the_repository, commit, list);
 }
 
-static int compare_commits_by_gen(const void *_a, const void *_b)
-{
-	const struct commit *a = *(const struct commit * const *)_a;
-	const struct commit *b = *(const struct commit * const *)_b;
-
-	timestamp_t generation_a = commit_graph_generation(a);
-	timestamp_t generation_b = commit_graph_generation(b);
-
-	if (generation_a < generation_b)
-		return -1;
-	if (generation_a > generation_b)
-		return 1;
-	return 0;
-}
-
 int can_all_from_reach_with_flag(struct object_array *from,
 				 unsigned int with_flag,
 				 unsigned int assign_flag,
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 4/5] commit-reach: use heuristic in remove_redundant()
  2021-02-19 12:34   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (2 preceding siblings ...)
  2021-02-19 12:34     ` [PATCH v3 3/5] commit-reach: move compare_commits_by_gen Derrick Stolee via GitGitGadget
@ 2021-02-19 12:34     ` Derrick Stolee via GitGitGadget
  2021-02-19 12:34     ` [PATCH v3 5/5] commit-reach: stale commits may prune generation further Derrick Stolee via GitGitGadget
  4 siblings, 0 replies; 36+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-19 12:34 UTC (permalink / raw)
  To: git
  Cc: Michael Haggerty, me, peff, gitster, René Scharfe,
	Derrick Stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

Reachability algorithms in commit-reach.c frequently benefit from using
the first-parent history as a heuristic for satisfying reachability
queries. The most obvious example was implemented in 4fbcca4e
(commit-reach: make can_all_from_reach... linear, 2018-07-20).

Update the walk in remove_redundant() to use this same heuristic. Here,
we are walking starting at the parents of the input commits. Sort those
parents and walk from the highest generation to lower. Each time, use
the heuristic of searching the first parent history before continuing to
expand the walk.

The order in which we explore the commits matters, so update
compare_commits_by_gen to break generation number ties with commit date.
This has no effect when the commits are in a commit-graph file with
corrected commit dates computed, but it will assist when the commits are
in the region "above" the commit-graph with "infinite" generation
number. Note that we cannot shift to use
compare_commits_by_gen_then_commit_date as the method prototype is
different. We use compare_commits_by_gen for QSORT() as opposed to as a
priority function.

The important piece is to ensure we short-circuit the walk when we find
that there is a single non-redundant commit. This happens frequently
when looking for merge-bases or comparing several tags with 'git
merge-base --independent'. Use a new count 'count_still_independent' and
if that hits 1 we can stop walking.

To update 'count_still_independent' properly, we add use of the RESULT
flag on the input commits. Then we can detect when we reach one of these
commits and decrease the count. We need to remove the RESULT flag at
that moment because we might re-visit that commit when popping the
stack.

We use the STALE flag to mark parents that have been added to the new
walk_start list, but we need to clear that flag before we start walking
so those flags don't halt our depth-first-search walk.

On my copy of the Linux kernel repository, the performance of 'git
merge-base --independent <all-tags>' goes from 1.1 seconds to 0.11
seconds.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-reach.c | 72 +++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 56 insertions(+), 16 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index a34c5ba09e81..399f2a8673e0 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -29,6 +29,10 @@ static int compare_commits_by_gen(const void *_a, const void *_b)
 		return -1;
 	if (generation_a > generation_b)
 		return 1;
+	if (a->date < b->date)
+		return -1;
+	if (a->date > b->date)
+		return 1;
 	return 0;
 }
 
@@ -228,11 +232,10 @@ static int remove_redundant_no_gen(struct repository *r,
 static int remove_redundant_with_gen(struct repository *r,
 				     struct commit **array, int cnt)
 {
-	int i, count_non_stale = 0;
+	int i, count_non_stale = 0, count_still_independent = cnt;
 	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
 	struct commit **walk_start;
 	size_t walk_start_nr = 0, walk_start_alloc = cnt;
-	struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
 
 	ALLOC_ARRAY(walk_start, walk_start_alloc);
 
@@ -242,6 +245,7 @@ static int remove_redundant_with_gen(struct repository *r,
 		timestamp_t generation;
 
 		repo_parse_commit(r, array[i]);
+		array[i]->object.flags |= RESULT;
 		parents = array[i]->parents;
 
 		while (parents) {
@@ -250,7 +254,6 @@ static int remove_redundant_with_gen(struct repository *r,
 				parents->item->object.flags |= STALE;
 				ALLOC_GROW(walk_start, walk_start_nr + 1, walk_start_alloc);
 				walk_start[walk_start_nr++] = parents->item;
-				prio_queue_put(&queue, parents->item);
 			}
 			parents = parents->next;
 		}
@@ -261,26 +264,63 @@ static int remove_redundant_with_gen(struct repository *r,
 			min_generation = generation;
 	}
 
-	/* push the STALE bits up to min generation */
-	while (queue.nr) {
-		struct commit_list *parents;
-		struct commit *c = prio_queue_get(&queue);
+	QSORT(walk_start, walk_start_nr, compare_commits_by_gen);
 
-		repo_parse_commit(r, c);
+	/* remove STALE bit for now to allow walking through parents */
+	for (i = 0; i < walk_start_nr; i++)
+		walk_start[i]->object.flags &= ~STALE;
 
-		if (commit_graph_generation(c) < min_generation)
-			continue;
+	/*
+	 * Start walking from the highest generation. Hopefully, it will
+	 * find all other items during the first-parent walk, and we can
+	 * terminate early. Otherwise, we will do the same amount of work
+	 * as before.
+	 */
+	for (i = walk_start_nr - 1; i >= 0 && count_still_independent > 1; i--) {
+		/* push the STALE bits up to min generation */
+		struct commit_list *stack = NULL;
 
-		parents = c->parents;
-		while (parents) {
-			if (!(parents->item->object.flags & STALE)) {
-				parents->item->object.flags |= STALE;
-				prio_queue_put(&queue, parents->item);
+		commit_list_insert(walk_start[i], &stack);
+		walk_start[i]->object.flags |= STALE;
+
+		while (stack) {
+			struct commit_list *parents;
+			struct commit *c = stack->item;
+
+			repo_parse_commit(r, c);
+
+			if (c->object.flags & RESULT) {
+				c->object.flags &= ~RESULT;
+				if (--count_still_independent <= 1)
+					break;
 			}
-			parents = parents->next;
+
+			if (commit_graph_generation(c) < min_generation) {
+				pop_commit(&stack);
+				continue;
+			}
+
+			parents = c->parents;
+			while (parents) {
+				if (!(parents->item->object.flags & STALE)) {
+					parents->item->object.flags |= STALE;
+					commit_list_insert(parents->item, &stack);
+					break;
+				}
+				parents = parents->next;
+			}
+
+			/* pop if all parents have been visited already */
+			if (!parents)
+				pop_commit(&stack);
 		}
+		free_commit_list(stack);
 	}
 
+	/* clear result */
+	for (i = 0; i < cnt; i++)
+		array[i]->object.flags &= ~RESULT;
+
 	/* rearrange array */
 	for (i = count_non_stale = 0; i < cnt; i++) {
 		if (!(array[i]->object.flags & STALE))
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH v3 5/5] commit-reach: stale commits may prune generation further
  2021-02-19 12:34   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (3 preceding siblings ...)
  2021-02-19 12:34     ` [PATCH v3 4/5] commit-reach: use heuristic in remove_redundant() Derrick Stolee via GitGitGadget
@ 2021-02-19 12:34     ` Derrick Stolee via GitGitGadget
  4 siblings, 0 replies; 36+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2021-02-19 12:34 UTC (permalink / raw)
  To: git
  Cc: Michael Haggerty, me, peff, gitster, René Scharfe,
	Derrick Stolee, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

The remove_redundant_with_gen() algorithm performs a depth-first-search
to find commits in the 'array' list, starting at the parents of each
commit in 'array'. The result is that commits in 'array' are marked
STALE when they are reachable from another commit in 'array'.

This depth-first-search is fast when commits lie on or near the
first-parent history of the higher commits. The search terminates early
if all but one commit becomes marked STALE.

However, it is possible that there are two independent commits with high
generation number. In that case, the depth-first-search might languish
by searching in lower generations due to the fixed min_generation used
throughout the method.

With the expectation that commits with lower generation are expected to
become STALE more often, we can optimize further by increasing that
min_generation boundary upon discovery of the commit with minimum
generation.

We must first sort the commits in 'array' by generation. We cannot sort
'array' itself since it must preserve relative order among the returned
results (see revision.c:mark_redundant_parents() for an example).

This simplifies the initialization of min_generation, but it also allows
us to increase the new min_generation when we find the commit with
smallest generation remaining.

This requires more than two commits in order to test, so I used the
Linux kernel repository with a few commits that are slightly off of the
first-parent history. I timed the following command:

  git merge-base --independent 2ecedd756908 d2360a398f0b \
	1253935ad801 160bab43419e 0e2209629fec 1d0e16ac1a9e

The first two commits have similar generation and are near the v5.10
tag. Commit 160bab43419e is off of the first-parent history behind v5.5,
while the others are scattered somewhere reachable from v5.9. This is
designed to demonstrate the optimization, as that commit within v5.5
would normally cause a lot of extra commit walking.

Since remove_redundant_with_alg() is called only when at least one of
the input commits has a finite generation number, this algorithm is
tested with a commit-graph generated starting at a number of different
tags, the earliest being v5.5.

commit-graph at v5.5:

 | Method                | Time  |
 |-----------------------+-------|
 | *_no_gen()            | 864ms |
 | *_with_gen() (before) | 858ms |
 | *_with_gen() (after)  | 810ms |

commit-graph at v5.7:

 | Method                | Time  |
 |-----------------------+-------|
 | *_no_gen()            | 625ms |
 | *_with_gen() (before) | 572ms |
 | *_with_gen() (after)  | 517ms |

commit-graph at v5.9:

 | Method                | Time  |
 |-----------------------+-------|
 | *_no_gen()            | 268ms |
 | *_with_gen() (before) | 224ms |
 | *_with_gen() (after)  | 202ms |

commit-graph at v5.10:

 | Method                | Time  |
 |-----------------------+-------|
 | *_no_gen()            |  72ms |
 | *_with_gen() (before) |  37ms |
 | *_with_gen() (after)  |   9ms |

Note that these are only modest improvements for the case where the two
independent commits are not in the commit-graph (not until v5.10). All
algorithms get faster as more commits are indexed, which is not a
surprise. However, the cost of walking extra commits is more and more
prevalent in relative terms as more commits are indexed. Finally, the
last case allows us to jump to the minimum generation between the last
two commits (that are actually independent) so we greatly reduce the
cost in that case.

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit-reach.c | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/commit-reach.c b/commit-reach.c
index 399f2a8673e0..2ea84d3dc074 100644
--- a/commit-reach.c
+++ b/commit-reach.c
@@ -234,15 +234,27 @@ static int remove_redundant_with_gen(struct repository *r,
 {
 	int i, count_non_stale = 0, count_still_independent = cnt;
 	timestamp_t min_generation = GENERATION_NUMBER_INFINITY;
-	struct commit **walk_start;
+	struct commit **walk_start, **sorted;
 	size_t walk_start_nr = 0, walk_start_alloc = cnt;
+	int min_gen_pos = 0;
+
+	/*
+	 * Sort the input by generation number, ascending. This allows
+	 * us to increase the "min_generation" limit when we discover
+	 * the commit with lowest generation is STALE. The index
+	 * min_gen_pos points to the current position within 'array'
+	 * that is not yet known to be STALE.
+	 */
+	ALLOC_ARRAY(sorted, cnt);
+	COPY_ARRAY(sorted, array, cnt);
+	QSORT(sorted, cnt, compare_commits_by_gen);
+	min_generation = commit_graph_generation(sorted[0]);
 
 	ALLOC_ARRAY(walk_start, walk_start_alloc);
 
 	/* Mark all parents of the input as STALE */
 	for (i = 0; i < cnt; i++) {
 		struct commit_list *parents;
-		timestamp_t generation;
 
 		repo_parse_commit(r, array[i]);
 		array[i]->object.flags |= RESULT;
@@ -257,11 +269,6 @@ static int remove_redundant_with_gen(struct repository *r,
 			}
 			parents = parents->next;
 		}
-
-		generation = commit_graph_generation(array[i]);
-
-		if (generation < min_generation)
-			min_generation = generation;
 	}
 
 	QSORT(walk_start, walk_start_nr, compare_commits_by_gen);
@@ -293,6 +300,12 @@ static int remove_redundant_with_gen(struct repository *r,
 				c->object.flags &= ~RESULT;
 				if (--count_still_independent <= 1)
 					break;
+				if (oideq(&c->object.oid, &sorted[min_gen_pos]->object.oid)) {
+					while (min_gen_pos < cnt - 1 &&
+					       (sorted[min_gen_pos]->object.flags & STALE))
+						min_gen_pos++;
+					min_generation = commit_graph_generation(sorted[min_gen_pos]);
+				}
 			}
 
 			if (commit_graph_generation(c) < min_generation) {
@@ -316,6 +329,7 @@ static int remove_redundant_with_gen(struct repository *r,
 		}
 		free_commit_list(stack);
 	}
+	free(sorted);
 
 	/* clear result */
 	for (i = 0; i < cnt; i++)
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH v2 0/5] Speed up remove_redundant()
  2021-02-19 12:17       ` Derrick Stolee
@ 2021-02-20  3:32         ` Junio C Hamano
  0 siblings, 0 replies; 36+ messages in thread
From: Junio C Hamano @ 2021-02-20  3:32 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, Michael Haggerty, me, peff,
	René Scharfe, Derrick Stolee

Derrick Stolee <stolee@gmail.com> writes:

> On 2/18/2021 6:25 PM, Junio C Hamano wrote:
>> This topic seems to have been forgotten?  Is the only thing missing
>> is an updated 2/5 or is there anything more that is needed?
>  
> I had forgotten about the comments on patch 2/5, sorry. That explains
> why it hasn't merged yet...
>
> I'll get right on it.

Thanks.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2021-02-20  3:32 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-28 16:24 [PATCH 0/3] Speed up remove_redundant() Derrick Stolee via GitGitGadget
2021-01-28 16:24 ` [PATCH 1/3] commit-reach: use one walk in remove_redundant() Derrick Stolee via GitGitGadget
2021-01-28 20:51   ` Junio C Hamano
2021-01-29 17:11     ` René Scharfe
2021-01-31  3:52       ` Derrick Stolee
2021-01-31 10:20         ` René Scharfe
2021-01-31  3:59     ` Derrick Stolee
2021-01-31 20:13       ` Derrick Stolee
2021-01-31 20:25       ` Junio C Hamano
2021-02-01  3:55         ` Derrick Stolee
2021-01-29 17:10   ` René Scharfe
2021-01-28 16:24 ` [PATCH 2/3] commit-reach: move compare_commits_by_gen Derrick Stolee via GitGitGadget
2021-01-28 16:24 ` [PATCH 3/3] commit-reach: use heuristic in remove_redundant() Derrick Stolee via GitGitGadget
2021-01-28 20:20 ` [PATCH 0/3] Speed up remove_redundant() Junio C Hamano
2021-02-01 12:47 ` [PATCH v2 0/5] " Derrick Stolee via GitGitGadget
2021-02-01 12:47   ` [PATCH v2 1/5] commit-reach: reduce requirements for remove_redundant() Derrick Stolee via GitGitGadget
2021-02-01 19:51     ` Junio C Hamano
2021-02-01 12:47   ` [PATCH v2 2/5] commit-reach: use one walk in remove_redundant() Derrick Stolee via GitGitGadget
2021-02-01 16:12     ` René Scharfe.
2021-02-01 16:31       ` Derrick Stolee
2021-02-01 12:47   ` [PATCH v2 3/5] commit-reach: move compare_commits_by_gen Derrick Stolee via GitGitGadget
2021-02-01 12:47   ` [PATCH v2 4/5] commit-reach: use heuristic in remove_redundant() Derrick Stolee via GitGitGadget
2021-02-01 20:05     ` Junio C Hamano
2021-02-01 21:02       ` Derrick Stolee
2021-02-01 12:47   ` [PATCH v2 5/5] commit-reach: stale commits may prune generation further Derrick Stolee via GitGitGadget
2021-02-03 15:59     ` Taylor Blau
2021-02-01 15:48   ` [PATCH v2 0/5] Speed up remove_redundant() Derrick Stolee
2021-02-18 23:25     ` Junio C Hamano
2021-02-19 12:17       ` Derrick Stolee
2021-02-20  3:32         ` Junio C Hamano
2021-02-19 12:34   ` [PATCH v3 " Derrick Stolee via GitGitGadget
2021-02-19 12:34     ` [PATCH v3 1/5] commit-reach: reduce requirements for remove_redundant() Derrick Stolee via GitGitGadget
2021-02-19 12:34     ` [PATCH v3 2/5] commit-reach: use one walk in remove_redundant() Derrick Stolee via GitGitGadget
2021-02-19 12:34     ` [PATCH v3 3/5] commit-reach: move compare_commits_by_gen Derrick Stolee via GitGitGadget
2021-02-19 12:34     ` [PATCH v3 4/5] commit-reach: use heuristic in remove_redundant() Derrick Stolee via GitGitGadget
2021-02-19 12:34     ` [PATCH v3 5/5] commit-reach: stale commits may prune generation further Derrick Stolee via GitGitGadget

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).