* [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base' @ 2018-08-30 12:58 Derrick Stolee via GitGitGadget 2018-08-30 12:58 ` [PATCH 1/1] commit: don't use generation numbers if not needed Derrick Stolee via GitGitGadget ` (2 more replies) 0 siblings, 3 replies; 5+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-08-30 12:58 UTC (permalink / raw) To: git; +Cc: jnareb, Junio C Hamano As I was testing the release candidate, I stumbled across a regression in 'git merge-base' as a result of the switch to generation numbers. The commit message in [PATCH 1/1] describes the topology involved, but you can test it yourself by comparing 'git merge-base v4.8 v4.9' in the Linux kernel. The regression does not show up when running merge-base for tags at least v4.9, which is why I didn't see it when I was testing earlier. The solution is simple, but also will conflict with ds/reachable in next. I can send a similar patch that applies the same diff into commit-reach.c. With the integration of generation numbers into most commit walks coming to a close [1], it will be time to re-investigate other options for reachability indexes [2]. As I was digging into the issue with this regression, I discovered a way we can modify our generation numbers and pair them with commit dates to give us a simple-to-compute, immutable two-dimensional reachability index that would be immune to this regression. I will investigate that more and report back, but it is more important to fix this regression now. Thanks, -Stolee [1] https://public-inbox.org/git/pull.25.git.gitgitgadget@gmail.com/[PATCH 0/6] Use generation numbers for --topo-order [2] https://public-inbox.org/git/86muxcuyod.fsf@gmail.com/[RFC] Other chunks for commit-graph, part 2 - reachability indexes Cc: gitster@pobox.comCc: peff@peff.net Derrick Stolee (1): commit: don't use generation numbers if not needed commit.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) base-commit: 2f743933341f276111103550fbf383a34dfcfd38 Published-As: https://github.com/gitgitgadget/git/releases/tags/pr-28%2Fderrickstolee%2Fmerge-base-regression-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-28/derrickstolee/merge-base-regression-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/28 -- gitgitgadget ^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH 1/1] commit: don't use generation numbers if not needed 2018-08-30 12:58 [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base' Derrick Stolee via GitGitGadget @ 2018-08-30 12:58 ` Derrick Stolee via GitGitGadget 2018-10-06 16:54 ` Jakub Narebski 2018-08-30 15:26 ` [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base' Junio C Hamano 2018-10-06 16:09 ` Jakub Narebski 2 siblings, 1 reply; 5+ messages in thread From: Derrick Stolee via GitGitGadget @ 2018-08-30 12:58 UTC (permalink / raw) To: git; +Cc: jnareb, Junio C Hamano, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> In 3afc679b "commit: use generations in paint_down_to_common()", the queue in paint_down_to_common() was changed to use a priority order based on generation number before commit date. This served two purposes: 1. When generation numbers are present, the walk guarantees correct topological relationships, regardless of clock skew in commit dates. 2. It enables short-circuiting the walk when the min_generation parameter is added in d7c1ec3e "commit: add short-circuit to paint_down_to_common()". This short-circuit helps commands like 'git branch --contains' from needing to walk to a merge base when we know the result is false. The commit message for 3afc679b includes the following sentence: This change does not affect the number of commits that are walked during the execution of paint_down_to_common(), only the order that those commits are inspected. This statement is incorrect. Because it changes the order in which the commits are inspected, it changes the order they are added to the queue, and hence can change the number of loops before the queue_has_nonstale() method returns true. This change makes a concrete difference depending on the topology of the commit graph. For instance, computing the merge-base between consecutive versions of the Linux kernel has no effect for versions after v4.9, but 'git merge-base v4.8 v4.9' presents a performance regression: v2.18.0: 0.122s v2.19.0-rc1: 0.547s HEAD: 0.127s To determine that this was simply an ordering issue, I inserted a counter within the while loop of paint_down_to_common() and found that the loop runs 167,468 times in v2.18.0 and 635,579 times in v2.19.0-rc1. The topology of this case can be described in a simplified way here: v4.9 | \ | \ v4.8 \ | \ \ | \ | ... A B | / / | / / |/__/ C Here, the "..." means "a very long line of commits". By generation number, A and B have generation one more than C. However, A and B have commit date higher than most of the commits reachable from v4.8. When the walk reaches v4.8, we realize that it has PARENT1 and PARENT2 flags, so everything it can reach is marked as STALE, including A. B has only the PARENT1 flag, so is not STALE. When paint_down_to_common() is run using compare_commits_by_commit_date, A and B are removed from the queue early and C is inserted into the queue. At this point, C and the rest of the queue entries are marked as STALE. The loop then terminates. When paint_down_to_common() is run using compare_commits_by_gen_then_commit_date, B is removed from the queue only after the many commits reachable from v4.8 are explored. This causes the loop to run longer. The reason for this regression is simple: the queue order is intended to not explore a commit until everything that _could_ reach that commit is explored. From the information gathered by the original ordering, we have no guarantee that there is not a commit D reachable from v4.8 that can also reach B. We gained absolute correctness in exchange for a performance regression. The performance regression is probably the worse option, since these incorrect results in paint_down_to_common() are rare. The topology required for the performance regression are less rare, but still require multiple merge commits where the parents differ greatly in generation number. In our example above, the commit A is as important as the commit B to demonstrate the problem, since otherwise the commit C will sit in the queue as non-stale just as long in both orders. The solution provided uses the min_generation parameter to decide if we should use generation numbers in our ordering. When min_generation is equal to zero, it means that the caller has no known cutoff for the walk, so we should rely on our commit-date heuristic as before; this is the case with merge_bases_many(). When min_generation is non-zero, then the caller knows a valuable cutoff for the short-circuit mechanism; this is the case with remove_redundant() and in_merge_bases_many(). Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- commit.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/commit.c b/commit.c index 1a6e632185..449c1f4920 100644 --- a/commit.c +++ b/commit.c @@ -874,6 +874,9 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, int i; uint32_t last_gen = GENERATION_NUMBER_INFINITY; + if (!min_generation) + queue.compare = compare_commits_by_commit_date; + one->object.flags |= PARENT1; if (!n) { commit_list_append(one, &result); @@ -891,7 +894,7 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, struct commit_list *parents; int flags; - if (commit->generation > last_gen) + if (min_generation && commit->generation > last_gen) BUG("bad generation skip %8x > %8x at %s", commit->generation, last_gen, oid_to_hex(&commit->object.oid)); -- gitgitgadget ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH 1/1] commit: don't use generation numbers if not needed 2018-08-30 12:58 ` [PATCH 1/1] commit: don't use generation numbers if not needed Derrick Stolee via GitGitGadget @ 2018-10-06 16:54 ` Jakub Narebski 0 siblings, 0 replies; 5+ messages in thread From: Jakub Narebski @ 2018-10-06 16:54 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget; +Cc: git, Junio C Hamano, Derrick Stolee "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Derrick Stolee <dstolee@microsoft.com> > > In 3afc679b "commit: use generations in paint_down_to_common()", > the queue in paint_down_to_common() was changed to use a priority > order based on generation number before commit date. This served > two purposes: > > 1. When generation numbers are present, the walk guarantees > correct topological relationships, regardless of clock skew in > commit dates. Which means that we would walk no more commits that there are on the _longest_ path... > 2. It enables short-circuiting the walk when the min_generation > parameter is added in d7c1ec3e "commit: add short-circuit to > paint_down_to_common()". This short-circuit helps commands > like 'git branch --contains' from needing to walk to a merge > base when we know the result is false. > > The commit message for 3afc679b includes the following sentence: > > This change does not affect the number of commits that are > walked during the execution of paint_down_to_common(), only > the order that those commits are inspected. > > This statement is incorrect. Because it changes the order in which > the commits are inspected, it changes the order they are added to > the queue, and hence can change the number of loops before the > queue_has_nonstale() method returns true. ...which does not mean that it would walk no more commits than on shortest path; thus it can give performance regression if it chooses to walk longer path first, compared to date-based heuristic if it chooses shorter one (but can walk few commits more that strictly necessary on chosen path). O.K., I can understand that. > > This change makes a concrete difference depending on the topology > of the commit graph. For instance, computing the merge-base between > consecutive versions of the Linux kernel has no effect for versions > after v4.9, but 'git merge-base v4.8 v4.9' presents a performance > regression: > > v2.18.0: 0.122s > v2.19.0-rc1: 0.547s > HEAD: 0.127s Now I will wonder if the 0.005s difference between v2.18.0 and HEAD version is within the noise (within the standard deviation), or not... (just kidding). > > To determine that this was simply an ordering issue, I inserted > a counter within the while loop of paint_down_to_common() and > found that the loop runs 167,468 times in v2.18.0 and 635,579 > times in v2.19.0-rc1. Nice analysis. > > The topology of this case can be described in a simplified way > here: > > v4.9 > | \ > | \ > v4.8 \ > | \ \ > | \ | > ... A B > | / / > | / / > |/__/ > C > > Here, the "..." means "a very long line of commits". By generation > number, A and B have generation one more than C. However, A and B > have commit date higher than most of the commits reachable from > v4.8. When the walk reaches v4.8, we realize that it has PARENT1 > and PARENT2 flags, so everything it can reach is marked as STALE, > including A. B has only the PARENT1 flag, so is not STALE. > > When paint_down_to_common() is run using > compare_commits_by_commit_date, A and B are removed from the queue > early and C is inserted into the queue. At this point, C and the > rest of the queue entries are marked as STALE. The loop then > terminates. > > When paint_down_to_common() is run using > compare_commits_by_gen_then_commit_date, B is removed from the > queue only after the many commits reachable from v4.8 are explored. > This causes the loop to run longer. The reason for this regression > is simple: the queue order is intended to not explore a commit > until everything that _could_ reach that commit is explored. From > the information gathered by the original ordering, we have no > guarantee that there is not a commit D reachable from v4.8 that > can also reach B. So the problem is with shortcuts, i.e. merges of short-length branches (like e.g. fixup topic branches) based on an older commits (like best practices recommend: base off oldest applicable commit). Due to the bottom-up nature of generation numbers examining such shortcut is postponed, compared to "heuristic" ordering by the creation date. By problem I mean that generation-number based code gives worse performance than commitdate-based code. > We gained absolute correctness in exchange for > a performance regression. Or rather we gained better worst case performance in exchange for worse average case performance (thus a performance regression). > The performance regression is probably the worse option, since > these incorrect results in paint_down_to_common() are rare. The > topology required for the performance regression are less rare, > but still require multiple merge commits where the parents differ > greatly in generation number. In our example above, the commit A > is as important as the commit B to demonstrate the problem, since > otherwise the commit C will sit in the queue as non-stale just as > long in both orders. All right, that is good explanation of why the change. Worst case performance is rare, performance in case of "shortcuts" is more important: they are more often (I guess -- are there any tests for this?). > The solution provided uses the min_generation parameter to decide > if we should use generation numbers in our ordering. When > min_generation is equal to zero, it means that the caller has no > known cutoff for the walk, so we should rely on our commit-date > heuristic as before; this is the case with merge_bases_many(). > When min_generation is non-zero, then the caller knows a valuable > cutoff for the short-circuit mechanism; this is the case with > remove_redundant() and in_merge_bases_many(). TLDR; use compare_commits_by_commit_date() if there is no min generation cutoff, compare_commits_by_gen_then_commit_date() otherwise, right? > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > commit.c | 5 ++++- > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/commit.c b/commit.c > index 1a6e632185..449c1f4920 100644 > --- a/commit.c > +++ b/commit.c > @@ -874,6 +874,9 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, > int i; > uint32_t last_gen = GENERATION_NUMBER_INFINITY; > > + if (!min_generation) > + queue.compare = compare_commits_by_commit_date; > + > one->object.flags |= PARENT1; > if (!n) { > commit_list_append(one, &result); > @@ -891,7 +894,7 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n, > struct commit_list *parents; > int flags; > > - if (commit->generation > last_gen) > + if (min_generation && commit->generation > last_gen) > BUG("bad generation skip %8x > %8x at %s", > commit->generation, last_gen, > oid_to_hex(&commit->object.oid)); ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base' 2018-08-30 12:58 [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base' Derrick Stolee via GitGitGadget 2018-08-30 12:58 ` [PATCH 1/1] commit: don't use generation numbers if not needed Derrick Stolee via GitGitGadget @ 2018-08-30 15:26 ` Junio C Hamano 2018-10-06 16:09 ` Jakub Narebski 2 siblings, 0 replies; 5+ messages in thread From: Junio C Hamano @ 2018-08-30 15:26 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget; +Cc: git, jnareb "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > As I was testing the release candidate, I stumbled across a regression in > 'git merge-base' as a result of the switch to generation numbers. The commit > message in [PATCH 1/1] describes the topology involved,... I do not recall having seen this kind of updates during the pre-release freeze of previous development cycles, and you are starting a new trend with two findings so far in this cycle. Hopefully this can become a good tradition. Thanks. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base' 2018-08-30 12:58 [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base' Derrick Stolee via GitGitGadget 2018-08-30 12:58 ` [PATCH 1/1] commit: don't use generation numbers if not needed Derrick Stolee via GitGitGadget 2018-08-30 15:26 ` [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base' Junio C Hamano @ 2018-10-06 16:09 ` Jakub Narebski 2 siblings, 0 replies; 5+ messages in thread From: Jakub Narebski @ 2018-10-06 16:09 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget; +Cc: git, Junio C Hamano, Derrick Stolee "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > As I was testing the release candidate, I stumbled across a regression in > 'git merge-base' as a result of the switch to generation numbers. The commit > message in [PATCH 1/1] describes the topology involved, but you can test it > yourself by comparing 'git merge-base v4.8 v4.9' in the Linux kernel. The > regression does not show up when running merge-base for tags at least v4.9, > which is why I didn't see it when I was testing earlier. Strange that it happens; I'll take a look. > The solution is simple, but also will conflict with ds/reachable in next. I > can send a similar patch that applies the same diff into commit-reach.c. > > With the integration of generation numbers into most commit walks coming to > a close [1], it will be time to re-investigate other options for > reachability indexes [2]. As I was digging into the issue with this > regression, I discovered a way we can modify our generation numbers and pair > them with commit dates to give us a simple-to-compute, immutable > two-dimensional reachability index that would be immune to this regression. > I will investigate that more and report back, but it is more important to > fix this regression now. I am interested in what you have created, especially because commit creation dates are imperfect indicators because of clock skew etc. > > Thanks, -Stolee > > [1] https://public-inbox.org/git/pull.25.git.gitgitgadget@gmail.com/[PATCH > 0/6] Use generation numbers for --topo-order > > [2] https://public-inbox.org/git/86muxcuyod.fsf@gmail.com/[RFC] Other chunks > for commit-graph, part 2 - reachability indexes Since then I have found few more possible approaches: - working with repository metagraph[1], where all chains of commits were replaced by edge, perhaps together with DAG Reduction[2] boosting framework on this metagraph - using FELINE-like index (the Graph+Label approch, also known as online search), and for those where this index have false results, use Labels only approach[3] [1] Xian Tang et. al., "An Optimized Labeling Scheme for Reachability Queries", CMC, vol.55, no.2, pp.267-283, 2018 [2] Marco Biazzini, Martin Monperrus, Benoit Baudry "On Analyzing the Topology of Commit Histories in Decentralized Version Control Systems", ICSME 2014 (conference paper) [3] Junfeng Zhou et. al., "Accelerating reachability query processing based on DAG reduction", The VLDB Journal (2018) 27: 271 -- Jakub Narębski ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2018-10-06 16:54 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-08-30 12:58 [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base' Derrick Stolee via GitGitGadget 2018-08-30 12:58 ` [PATCH 1/1] commit: don't use generation numbers if not needed Derrick Stolee via GitGitGadget 2018-10-06 16:54 ` Jakub Narebski 2018-08-30 15:26 ` [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base' Junio C Hamano 2018-10-06 16:09 ` Jakub Narebski
Code repositories for project(s) associated with this public inbox https://80x24.org/mirrors/git.git This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).