git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jakub Narebski <jnareb@gmail.com>
To: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>,
	Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH 1/1] commit: don't use generation numbers if not needed
Date: Sat, 06 Oct 2018 18:54:01 +0200	[thread overview]
Message-ID: <86woqvxbh2.fsf@gmail.com> (raw)
In-Reply-To: <efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@gmail.com> (Derrick Stolee via GitGitGadget's message of "Thu, 30 Aug 2018 05:58:09 -0700 (PDT)")

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <dstolee@microsoft.com>
>
> In 3afc679b "commit: use generations in paint_down_to_common()",
> the queue in paint_down_to_common() was changed to use a priority
> order based on generation number before commit date. This served
> two purposes:
>
>  1. When generation numbers are present, the walk guarantees
>     correct topological relationships, regardless of clock skew in
>     commit dates.

Which means that we would walk no more commits that there are on the
_longest_ path...

>  2. It enables short-circuiting the walk when the min_generation
>     parameter is added in d7c1ec3e "commit: add short-circuit to
>     paint_down_to_common()". This short-circuit helps commands
>     like 'git branch --contains' from needing to walk to a merge
>     base when we know the result is false.
>
> The commit message for 3afc679b includes the following sentence:
>
>     This change does not affect the number of commits that are
>     walked during the execution of paint_down_to_common(), only
>     the order that those commits are inspected.
>
> This statement is incorrect. Because it changes the order in which
> the commits are inspected, it changes the order they are added to
> the queue, and hence can change the number of loops before the
> queue_has_nonstale() method returns true.

...which does not mean that it would walk no more commits than on
shortest path; thus it can give performance regression if it chooses to
walk longer path first, compared to date-based heuristic if it chooses
shorter one (but can walk few commits more that strictly necessary on
chosen path).

O.K., I can understand that.

>
> This change makes a concrete difference depending on the topology
> of the commit graph. For instance, computing the merge-base between
> consecutive versions of the Linux kernel has no effect for versions
> after v4.9, but 'git merge-base v4.8 v4.9' presents a performance
> regression:
>
>     v2.18.0: 0.122s
> v2.19.0-rc1: 0.547s
>        HEAD: 0.127s

Now I will wonder if the 0.005s difference between v2.18.0 and HEAD
version is within the noise (within the standard deviation), or
not... (just kidding).

>
> To determine that this was simply an ordering issue, I inserted
> a counter within the while loop of paint_down_to_common() and
> found that the loop runs 167,468 times in v2.18.0 and 635,579
> times in v2.19.0-rc1.

Nice analysis.

>
> The topology of this case can be described in a simplified way
> here:
>
>   v4.9
>    |  \
>    |   \
>   v4.8  \
>    | \   \
>    |  \   |
>   ...  A  B
>    |  /  /
>    | /  /
>    |/__/
>    C
>
> Here, the "..." means "a very long line of commits". By generation
> number, A and B have generation one more than C. However, A and B
> have commit date higher than most of the commits reachable from
> v4.8. When the walk reaches v4.8, we realize that it has PARENT1
> and PARENT2 flags, so everything it can reach is marked as STALE,
> including A. B has only the PARENT1 flag, so is not STALE.
>
> When paint_down_to_common() is run using
> compare_commits_by_commit_date, A and B are removed from the queue
> early and C is inserted into the queue. At this point, C and the
> rest of the queue entries are marked as STALE. The loop then
> terminates.
>
> When paint_down_to_common() is run using
> compare_commits_by_gen_then_commit_date, B is removed from the
> queue only after the many commits reachable from v4.8 are explored.
> This causes the loop to run longer. The reason for this regression
> is simple: the queue order is intended to not explore a commit
> until everything that _could_ reach that commit is explored. From
> the information gathered by the original ordering, we have no
> guarantee that there is not a commit D reachable from v4.8 that
> can also reach B.

So the problem is with shortcuts, i.e. merges of short-length branches
(like e.g. fixup topic branches) based on an older commits (like best
practices recommend: base off oldest applicable commit).  Due to the
bottom-up nature of generation numbers examining such shortcut is
postponed, compared to "heuristic" ordering by the creation date.

By problem I mean that generation-number based code gives worse
performance than commitdate-based code.

>                  We gained absolute correctness in exchange for
> a performance regression.

Or rather we gained better worst case performance in exchange for worse
average case performance (thus a performance regression).

> The performance regression is probably the worse option, since
> these incorrect results in paint_down_to_common() are rare. The
> topology required for the performance regression are less rare,
> but still require multiple merge commits where the parents differ
> greatly in generation number. In our example above, the commit A
> is as important as the commit B to demonstrate the problem, since
> otherwise the commit C will sit in the queue as non-stale just as
> long in both orders.

All right, that is good explanation of why the change.  Worst case
performance is rare, performance in case of "shortcuts" is more
important: they are more often (I guess -- are there any tests for
this?).

> The solution provided uses the min_generation parameter to decide
> if we should use generation numbers in our ordering. When
> min_generation is equal to zero, it means that the caller has no
> known cutoff for the walk, so we should rely on our commit-date
> heuristic as before; this is the case with merge_bases_many().
> When min_generation is non-zero, then the caller knows a valuable
> cutoff for the short-circuit mechanism; this is the case with
> remove_redundant() and in_merge_bases_many().

TLDR; use compare_commits_by_commit_date() if there is no min generation
cutoff, compare_commits_by_gen_then_commit_date() otherwise, right?

> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  commit.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/commit.c b/commit.c
> index 1a6e632185..449c1f4920 100644
> --- a/commit.c
> +++ b/commit.c
> @@ -874,6 +874,9 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n,
>  	int i;
>  	uint32_t last_gen = GENERATION_NUMBER_INFINITY;
>  
> +	if (!min_generation)
> +		queue.compare = compare_commits_by_commit_date;
> +
>  	one->object.flags |= PARENT1;
>  	if (!n) {
>  		commit_list_append(one, &result);
> @@ -891,7 +894,7 @@ static struct commit_list *paint_down_to_common(struct commit *one, int n,
>  		struct commit_list *parents;
>  		int flags;
>  
> -		if (commit->generation > last_gen)
> +		if (min_generation && commit->generation > last_gen)
>  			BUG("bad generation skip %8x > %8x at %s",
>  			    commit->generation, last_gen,
>  			    oid_to_hex(&commit->object.oid));

  reply	other threads:[~2018-10-06 16:54 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-30 12:58 [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base' Derrick Stolee via GitGitGadget
2018-08-30 12:58 ` [PATCH 1/1] commit: don't use generation numbers if not needed Derrick Stolee via GitGitGadget
2018-10-06 16:54   ` Jakub Narebski [this message]
2018-08-30 15:26 ` [PATCH 0/1] v2.19.0-rc1 Performance Regression in 'git merge-base' Junio C Hamano
2018-10-06 16:09 ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=86woqvxbh2.fsf@gmail.com \
    --to=jnareb@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).