Re: [PATCH] bisect: loosen halfway() check for a large number of commits

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: "SZEDER Gábor" <szeder.dev@gmail.com>
To: git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>
Subject: Re: [PATCH] bisect: loosen halfway() check for a large number of commits
Date: Thu, 22 Oct 2020 12:40:07 +0200	[thread overview]
Message-ID: <20201022104007.GE24813@szeder.dev> (raw)
In-Reply-To: <20201022103806.26680-1-szeder.dev@gmail.com>

This is only RFC, but forgot to mark it as such in Subject: line,
sorry.


On Thu, Oct 22, 2020 at 12:38:06PM +0200, SZEDER Gábor wrote:
> 'git bisect start ...' and subsequent 'git bisect (good|bad)' commands
> can take quite a while when the given/remaining revision range between
> good and bad commits is big and contains a lot of merge commits, e.g.
> in git.git:
> 
>   $ git rev-list --count v1.6.0..v2.28.0
>   44284
>   $ time git bisect start v2.28.0 v1.6.0
>   Bisecting: 22141 revisions left to test after this (roughly 15 steps)
>   [e197c21807dacadc8305250baa0b9228819189d4] unable_to_lock_die(): rename function from unable_to_lock_index_die()
> 
>   real    0m15.472s
>   user    0m15.220s
>   sys     0m0.255s
> 
> The majority of the runtime is spent in do_find_bisection(), where we
> try to find a commit as close as possible to the halfway point between
> the bad and good revisions, i.e. a commit from which the number of
> reachable commits that are in the good-bad range is half the total
> number of commits in that range.  So we count how many commits are
> reachable in the good-bad range for each commit in that range, which
> is quick and easy for a linear history, even over 300k commits in a
> linear range are handled in ~0.3s on my machine.  Alas, handling merge
> commits is non-trivial and quite expensive as the algorithm used seems
> to be quadratic, causing the long runtime shown above.
> 
> Interestingly, look at what a big difference one additional commit
> can make:
> 
>   $ git rev-list --count v1.6.0^..v2.28.0
>   44285
>   $ time git bisect start v2.28.0 v1.6.0^
>   Bisecting: 22142 revisions left to test after this (roughly 15 steps)
>   [565301e41670825ceedf75220f2918ae76831240] Sync with 2.1.2
> 
>   real  0m5.848s
>   user  0m5.600s
>   sys   0m0.252s
> 
> The difference is caused by one of the optimizations attempting to cut
> down the runtime added in 1c4fea3a40 (git-rev-list --bisect:
> optimization, 2007-03-21):
> 
>     Another small optimization is whenever we find a half-way commit
>     (that is, a commit that can reach exactly half of the commits),
>     we stop giving counts to remaining commits, as we will not find
>     any better commit than we just found.
> 
> In this second 'git bisect start' command we happen to find a commit
> exactly at the halfway point and can return early, but in the first
> case there is no such commit, so we can't return early and end up
> counting the number of reachable commits from all commits in the
> good-bad range.
> 
> However, when we have thousands of commits it's not all that important
> to find the _exact_ halfway point, a few commits more or less doesn't
> make any real difference for the bisection.
> 
> So let's loosen the halfway check to consider commits within about
> 0.1% of the exact halfway point as halfway as well.  This will allow
> us to return early on a bigger good-bad range, even when there is no
> commit exactly at the halfway point, thereby reducing the runtime of
> the first command above considerably, from ~15s to 4.901s.
> Furthermore, even if there is a commit exactly at the halfway point,
> we might still stumble upon a commit within that 0.1% range before
> finding the exact halfway point, allowing us to return a bit earlier,
> slightly reducing the runtime of the second command from 5.848s to
> 5.058s.  Note that this change doesn't affect good-bad ranges
> containing ~2000 commits or less, because that 0.1% tolerance becomes
> zero due to integer arithmetic; however, if the range is that small
> then counting the reachable commits for all commits is already fast
> enough anyway.
> 
> Naturally, this will likely change which commits get picked at each
> bisection step, and, in turn, might change how many bisection steps
> are necessary to find the first bad commit.  If the number of
> necessary bisection steps were to increase often, then this change
> could backfire, because building and testing at each step might take
> much longer than the time spared.  OTOH, if the number of steps were
> to decrease, then it would be a double win.
> 
> So I ran some tests to see how often that happens: picked random good
> and bad starting revisions at least 50k commits apart and a random
> first bad commit in between in git.git, and used 'git bisect run git
> merge-base --is-ancestor HEAD $first_bad_commit' to check the number
> of necessary bisection steps.  After repeating all this 1000 times
> both with and without this patch I found that:
> 
>   - 146 cases needed one more bisection step than before, 149 cases
>     needed one less step, while in the remaining 705 cases the number
>     of steps didn't change.  So the number of bisection steps does
>     indeed change in a non-negligible number of cases, but it seems
>     that the average number of steps doesn't change in the long run.
> 
>   - The first 'git bisect start' command got over 3x faster in 456
>     cases, so this "no commit at the exact halfway point" case seems
>     to be common enough to care about.
> 
> [TODO:
>   - Update comments at callsites mentioning "exact halfway".
>   - Rename function to approx_halfway(), perhaps?]
> ---
>  bisect.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/bisect.c b/bisect.c
> index f5b1368128..1857ce4c75 100644
> --- a/bisect.c
> +++ b/bisect.c
> @@ -105,6 +105,8 @@ static int count_interesting_parents(struct commit *commit, unsigned bisect_flag
>  
>  static inline int halfway(struct commit_list *p, int nr)
>  {
> +	int diff;
> +
>  	/*
>  	 * Don't short-cut something we are not going to return!
>  	 */
> @@ -113,13 +115,22 @@ static inline int halfway(struct commit_list *p, int nr)
>  	if (DEBUG_BISECT)
>  		return 0;
>  	/*
> -	 * 2 and 3 are halfway of 5.
> +	 * For small number of commits 2 and 3 are halfway of 5, and
>  	 * 3 is halfway of 6 but 2 and 4 are not.
>  	 */
> -	switch (2 * weight(p) - nr) {
> +	diff = 2 * weight(p) - nr;
> +	switch (diff) {
>  	case -1: case 0: case 1:
>  		return 1;
>  	default:
> +		/*
> +		 * For large number of commits we are not so strict, it's
> +		 * good enough if it's within ~0.1% of the halfway point,
> +		 * e.g. 5000 is exactly halfway of 10000, but we consider
> +		 * the values [4996, 5004] as halfway as well.
> +		 */
> +		if (abs(diff) < nr / 1024)
> +			return 1;
>  		return 0;
>  	}
>  }
> -- 
> 2.29.0.470.g6462f21d4e
>

next prev parent reply	other threads:[~2020-10-22 10:40 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-10-22 10:38 [PATCH] bisect: loosen halfway() check for a large number of commits SZEDER Gábor
2020-10-22 10:40 ` SZEDER Gábor [this message]
2020-10-22 17:18 ` Junio C Hamano
2020-10-24  7:41   ` Christian Couder
2020-10-25 18:01     ` SZEDER Gábor
2020-11-12 16:19 ` [PATCH v2] " SZEDER Gábor
2020-11-12 18:23   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20201022104007.GE24813@szeder.dev \
    --to=szeder.dev@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).