git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base'
@ 2018-08-30 12:58 Derrick Stolee via GitGitGadget
  2018-08-30 12:58 ` [PATCH 1/1] commit: don't use generation numbers if not needed Derrick Stolee via GitGitGadget
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-08-30 12:58 UTC (permalink / raw)
  To: git; +Cc: jnareb, Junio C Hamano

As I was testing the release candidate, I stumbled across a regression in
'git merge-base' as a result of the switch to generation numbers. The commit
message in [PATCH 1/1] describes the topology involved, but you can test it
yourself by comparing 'git merge-base v4.8 v4.9' in the Linux kernel. The
regression does not show up when running merge-base for tags at least v4.9,
which is why I didn't see it when I was testing earlier.

The solution is simple, but also will conflict with ds/reachable in next. I
can send a similar patch that applies the same diff into commit-reach.c.

With the integration of generation numbers into most commit walks coming to
a close [1], it will be time to re-investigate other options for
reachability indexes [2]. As I was digging into the issue with this
regression, I discovered a way we can modify our generation numbers and pair
them with commit dates to give us a simple-to-compute, immutable
two-dimensional reachability index that would be immune to this regression.
I will investigate that more and report back, but it is more important to
fix this regression now.

Thanks, -Stolee

[1] https://public-inbox.org/git/pull.25.git.gitgitgadget@gmail.com/[PATCH
0/6] Use generation numbers for --topo-order

[2] https://public-inbox.org/git/86muxcuyod.fsf@gmail.com/[RFC] Other chunks
for commit-graph, part 2 - reachability indexes

Cc: gitster@pobox.comCc: peff@peff.net

Derrick Stolee (1):
  commit: don't use generation numbers if not needed

 commit.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)


base-commit: 2f743933341f276111103550fbf383a34dfcfd38
Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-28%2Fderrickstolee%2Fmerge-base-regression-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-28/derrickstolee/merge-base-regression-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/28
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH 1/1] commit: don't use generation numbers if not needed
  2018-08-30 12:58 [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base' Derrick Stolee via GitGitGadget
@ 2018-08-30 12:58 ` Derrick Stolee via GitGitGadget
  2018-10-06 16:54   ` Jakub Narebski
  2018-08-30 15:26 ` [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base' Junio C Hamano
  2018-10-06 16:09 ` Jakub Narebski
  2 siblings, 1 reply; 5+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2018-08-30 12:58 UTC (permalink / raw)
  To: git; +Cc: jnareb, Junio C Hamano, Derrick Stolee

From: Derrick Stolee <dstolee@microsoft.com>

In 3afc679b "commit: use generations in paint_down_to_common()",
the queue in paint_down_to_common() was changed to use a priority
order based on generation number before commit date. This served
two purposes:

 1. When generation numbers are present, the walk guarantees
    correct topological relationships, regardless of clock skew in
    commit dates.

 2. It enables short-circuiting the walk when the min_generation
    parameter is added in d7c1ec3e "commit: add short-circuit to
    paint_down_to_common()". This short-circuit helps commands
    like 'git branch --contains' from needing to walk to a merge
    base when we know the result is false.

The commit message for 3afc679b includes the following sentence:

    This change does not affect the number of commits that are
    walked during the execution of paint_down_to_common(), only
    the order that those commits are inspected.

This statement is incorrect. Because it changes the order in which
the commits are inspected, it changes the order they are added to
the queue, and hence can change the number of loops before the
queue_has_nonstale() method returns true.

This change makes a concrete difference depending on the topology
of the commit graph. For instance, computing the merge-base between
consecutive versions of the Linux kernel has no effect for versions
after v4.9, but 'git merge-base v4.8 v4.9' presents a performance
regression:

    v2.18.0: 0.122s
v2.19.0-rc1: 0.547s
       HEAD: 0.127s

To determine that this was simply an ordering issue, I inserted
a counter within the while loop of paint_down_to_common() and
found that the loop runs 167,468 times in v2.18.0 and 635,579
times in v2.19.0-rc1.

The topology of this case can be described in a simplified way
here:

  v4.9
   |  \
   |   \
  v4.8  \
   | \   \
   |  \   |
  ...  A  B
   |  /  /
   | /  /
   |/__/
   C

Here, the "..." means "a very long line of commits". By generation
number, A and B have generation one more than C. However, A and B
have commit date higher than most of the commits reachable from
v4.8. When the walk reaches v4.8, we realize that it has PARENT1
and PARENT2 flags, so everything it can reach is marked as STALE,
including A. B has only the PARENT1 flag, so is not STALE.

When paint_down_to_common() is run using
compare_commits_by_commit_date, A and B are removed from the queue
early and C is inserted into the queue. At this point, C and the
rest of the queue entries are marked as STALE. The loop then
terminates.

When paint_down_to_common() is run using
compare_commits_by_gen_then_commit_date, B is removed from the
queue only after the many commits reachable from v4.8 are explored.
This causes the loop to run longer. The reason for this regression
is simple: the queue order is intended to not explore a commit
until everything that _could_ reach that commit is explored. From
the information gathered by the original ordering, we have no
guarantee that there is not a commit D reachable from v4.8 that
can also reach B. We gained absolute correctness in exchange for
a performance regression.

The performance regression is probably the worse option, since
these incorrect results in paint_down_to_common() are rare. The
topology required for the performance regression are less rare,
but still require multiple merge commits where the parents differ
greatly in generation number. In our example above, the commit A
is as important as the commit B to demonstrate the problem, since
otherwise the commit C will sit in the queue as non-stale just as
long in both orders.

The solution provided uses the min_generation parameter to decide
if we should use generation numbers in our ordering. When
min_generation is equal to zero, it means that the caller has no
known cutoff for the walk, so we should rely on our commit-date
heuristic as before; this is the case with merge_bases_many().
When min_generation is non-zero, then the caller knows a valuable
cutoff for the short-circuit mechanism; this is the case with
remove_redundant() and in_merge_bases_many().

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 commit.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 1a6e632185..449c1f4920 100644
--- a/commit.c
+++ b/commit.c
@@ -874,6 +874,9 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n,
 	int i;
 	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
 
+	if (!min_generation)
+		queue.compare = compare_commits_by_commit_date;
+
 	one->object.flags |= PARENT1;
 	if (!n) {
 		commit_list_append(one, &result);
@@ -891,7 +894,7 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n,
 		struct commit_list *parents;
 		int flags;
 
-		if (commit->generation > last_gen)
+		if (min_generation && commit->generation > last_gen)
 			BUG("bad generation skip %8x > %8x at %s",
 			    commit->generation, last_gen,
 			    oid_to_hex(&commit->object.oid));
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base'
  2018-08-30 12:58 [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base' Derrick Stolee via GitGitGadget
  2018-08-30 12:58 ` [PATCH 1/1] commit: don't use generation numbers if not needed Derrick Stolee via GitGitGadget
@ 2018-08-30 15:26 ` Junio C Hamano
  2018-10-06 16:09 ` Jakub Narebski
  2 siblings, 0 replies; 5+ messages in thread
From: Junio C Hamano @ 2018-08-30 15:26 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, jnareb

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> As I was testing the release candidate, I stumbled across a regression in
> 'git merge-base' as a result of the switch to generation numbers. The commit
> message in [PATCH 1/1] describes the topology involved,...

I do not recall having seen this kind of updates during the
pre-release freeze of previous development cycles, and you are
starting a new trend with two findings so far in this cycle.
Hopefully this can become a good tradition.

Thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base'
  2018-08-30 12:58 [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base' Derrick Stolee via GitGitGadget
  2018-08-30 12:58 ` [PATCH 1/1] commit: don't use generation numbers if not needed Derrick Stolee via GitGitGadget
  2018-08-30 15:26 ` [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base' Junio C Hamano
@ 2018-10-06 16:09 ` Jakub Narebski
  2 siblings, 0 replies; 5+ messages in thread
From: Jakub Narebski @ 2018-10-06 16:09 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, Junio C Hamano, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> As I was testing the release candidate, I stumbled across a regression in
> 'git merge-base' as a result of the switch to generation numbers. The commit
> message in [PATCH 1/1] describes the topology involved, but you can test it
> yourself by comparing 'git merge-base v4.8 v4.9' in the Linux kernel. The
> regression does not show up when running merge-base for tags at least v4.9,
> which is why I didn't see it when I was testing earlier.

Strange that it happens; I'll take a look.

> The solution is simple, but also will conflict with ds/reachable in next. I
> can send a similar patch that applies the same diff into commit-reach.c.
>
> With the integration of generation numbers into most commit walks coming to
> a close [1], it will be time to re-investigate other options for
> reachability indexes [2]. As I was digging into the issue with this
> regression, I discovered a way we can modify our generation numbers and pair
> them with commit dates to give us a simple-to-compute, immutable
> two-dimensional reachability index that would be immune to this regression.
> I will investigate that more and report back, but it is more important to
> fix this regression now.

I am interested in what you have created, especially because commit
creation dates are imperfect indicators because of clock skew etc.

>
> Thanks, -Stolee
>
> [1] https://public-inbox.org/git/pull.25.git.gitgitgadget@gmail.com/[PATCH
> 0/6] Use generation numbers for --topo-order
>
> [2] https://public-inbox.org/git/86muxcuyod.fsf@gmail.com/[RFC] Other chunks
> for commit-graph, part 2 - reachability indexes

Since then I have found few more possible approaches:
- working with repository metagraph[1], where all chains of commits were
  replaced by edge, perhaps together with DAG Reduction[2] boosting
  framework on this metagraph
- using FELINE-like index (the Graph+Label approch, also known as online
  search), and for those where this index have false results, use Labels
  only approach[3]

[1] Xian Tang et. al., "An Optimized Labeling Scheme for Reachability
    Queries", CMC, vol.55, no.2, pp.267-283, 2018
[2] Marco Biazzini, Martin Monperrus, Benoit Baudry "On Analyzing the
    Topology of Commit Histories in Decentralized Version Control
    Systems", ICSME 2014 (conference paper)
[3] Junfeng Zhou et. al., "Accelerating reachability query processing
    based on DAG reduction", The VLDB Journal (2018) 27: 271

--
Jakub Narębski

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH 1/1] commit: don't use generation numbers if not needed
  2018-08-30 12:58 ` [PATCH 1/1] commit: don't use generation numbers if not needed Derrick Stolee via GitGitGadget
@ 2018-10-06 16:54   ` Jakub Narebski
  0 siblings, 0 replies; 5+ messages in thread
From: Jakub Narebski @ 2018-10-06 16:54 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, Junio C Hamano, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> In 3afc679b "commit: use generations in paint_down_to_common()",
> the queue in paint_down_to_common() was changed to use a priority
> order based on generation number before commit date. This served
> two purposes:
>
>  1. When generation numbers are present, the walk guarantees
>     correct topological relationships, regardless of clock skew in
>     commit dates.

Which means that we would walk no more commits that there are on the
_longest_ path...

>  2. It enables short-circuiting the walk when the min_generation
>     parameter is added in d7c1ec3e "commit: add short-circuit to
>     paint_down_to_common()". This short-circuit helps commands
>     like 'git branch --contains' from needing to walk to a merge
>     base when we know the result is false.
>
> The commit message for 3afc679b includes the following sentence:
>
>     This change does not affect the number of commits that are
>     walked during the execution of paint_down_to_common(), only
>     the order that those commits are inspected.
>
> This statement is incorrect. Because it changes the order in which
> the commits are inspected, it changes the order they are added to
> the queue, and hence can change the number of loops before the
> queue_has_nonstale() method returns true.

...which does not mean that it would walk no more commits than on
shortest path; thus it can give performance regression if it chooses to
walk longer path first, compared to date-based heuristic if it chooses
shorter one (but can walk few commits more that strictly necessary on
chosen path).

O.K., I can understand that.

>
> This change makes a concrete difference depending on the topology
> of the commit graph. For instance, computing the merge-base between
> consecutive versions of the Linux kernel has no effect for versions
> after v4.9, but 'git merge-base v4.8 v4.9' presents a performance
> regression:
>
>     v2.18.0: 0.122s
> v2.19.0-rc1: 0.547s
>        HEAD: 0.127s

Now I will wonder if the 0.005s difference between v2.18.0 and HEAD
version is within the noise (within the standard deviation), or
not... (just kidding).

>
> To determine that this was simply an ordering issue, I inserted
> a counter within the while loop of paint_down_to_common() and
> found that the loop runs 167,468 times in v2.18.0 and 635,579
> times in v2.19.0-rc1.

Nice analysis.

>
> The topology of this case can be described in a simplified way
> here:
>
>   v4.9
>    |  \
>    |   \
>   v4.8  \
>    | \   \
>    |  \   |
>   ...  A  B
>    |  /  /
>    | /  /
>    |/__/
>    C
>
> Here, the "..." means "a very long line of commits". By generation
> number, A and B have generation one more than C. However, A and B
> have commit date higher than most of the commits reachable from
> v4.8. When the walk reaches v4.8, we realize that it has PARENT1
> and PARENT2 flags, so everything it can reach is marked as STALE,
> including A. B has only the PARENT1 flag, so is not STALE.
>
> When paint_down_to_common() is run using
> compare_commits_by_commit_date, A and B are removed from the queue
> early and C is inserted into the queue. At this point, C and the
> rest of the queue entries are marked as STALE. The loop then
> terminates.
>
> When paint_down_to_common() is run using
> compare_commits_by_gen_then_commit_date, B is removed from the
> queue only after the many commits reachable from v4.8 are explored.
> This causes the loop to run longer. The reason for this regression
> is simple: the queue order is intended to not explore a commit
> until everything that _could_ reach that commit is explored. From
> the information gathered by the original ordering, we have no
> guarantee that there is not a commit D reachable from v4.8 that
> can also reach B.

So the problem is with shortcuts, i.e. merges of short-length branches
(like e.g. fixup topic branches) based on an older commits (like best
practices recommend: base off oldest applicable commit).  Due to the
bottom-up nature of generation numbers examining such shortcut is
postponed, compared to "heuristic" ordering by the creation date.

By problem I mean that generation-number based code gives worse
performance than commitdate-based code.

>                  We gained absolute correctness in exchange for
> a performance regression.

Or rather we gained better worst case performance in exchange for worse
average case performance (thus a performance regression).

> The performance regression is probably the worse option, since
> these incorrect results in paint_down_to_common() are rare. The
> topology required for the performance regression are less rare,
> but still require multiple merge commits where the parents differ
> greatly in generation number. In our example above, the commit A
> is as important as the commit B to demonstrate the problem, since
> otherwise the commit C will sit in the queue as non-stale just as
> long in both orders.

All right, that is good explanation of why the change.  Worst case
performance is rare, performance in case of "shortcuts" is more
important: they are more often (I guess -- are there any tests for
this?).

> The solution provided uses the min_generation parameter to decide
> if we should use generation numbers in our ordering. When
> min_generation is equal to zero, it means that the caller has no
> known cutoff for the walk, so we should rely on our commit-date
> heuristic as before; this is the case with merge_bases_many().
> When min_generation is non-zero, then the caller knows a valuable
> cutoff for the short-circuit mechanism; this is the case with
> remove_redundant() and in_merge_bases_many().

TLDR; use compare_commits_by_commit_date() if there is no min generation
cutoff, compare_commits_by_gen_then_commit_date() otherwise, right?

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/commit.c b/commit.c
> index 1a6e632185..449c1f4920 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -874,6 +874,9 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n,
>  	int i;
>  	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
>  
> +	if (!min_generation)
> +		queue.compare = compare_commits_by_commit_date;
> +
>  	one->object.flags |= PARENT1;
>  	if (!n) {
>  		commit_list_append(one, &result);
> @@ -891,7 +894,7 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n,
>  		struct commit_list *parents;
>  		int flags;
>  
> -		if (commit->generation > last_gen)
> +		if (min_generation && commit->generation > last_gen)
>  			BUG("bad generation skip %8x > %8x at %s",
>  			    commit->generation, last_gen,
>  			    oid_to_hex(&commit->object.oid));

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-10-06 16:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-30 12:58 [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base' Derrick Stolee via GitGitGadget
2018-08-30 12:58 ` [PATCH 1/1] commit: don't use generation numbers if not needed Derrick Stolee via GitGitGadget
2018-10-06 16:54   ` Jakub Narebski
2018-08-30 15:26 ` [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base' Junio C Hamano
2018-10-06 16:09 ` Jakub Narebski

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).