Re: [PATCH 0/4] Speed up git tag --contains

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Derrick Stolee <stolee@gmail.com>
To: Jeff King <peff@peff.net>, csilvers <csilvers@cs.stanford.edu>
Cc: avarab@gmail.com, jrnieder@gmail.com, drizzd@aon.at,
	git@vger.kernel.org, gitster@pobox.com
Subject: Re: [PATCH 0/4] Speed up git tag --contains
Date: Mon, 12 Mar 2018 09:45:27 -0400	[thread overview]
Message-ID: <63e9c6a8-4efc-6f86-f355-1ec40dd674e4@gmail.com> (raw)
In-Reply-To: <20180303051516.GE27689@sigill.intra.peff.net>

On 3/3/2018 12:15 AM, Jeff King wrote:
> On Fri, Jan 12, 2018 at 10:56:00AM -0800, csilvers wrote:
>
>>> This is a resubmission of Jeff King's patch series to speed up git tag
>>> --contains with some changes. It's been cooking for a while as:
>> Replying to this 6-year-old thread:
>>
>> Is there any chance this could be resurrected?  We are using
>> phabricator, which uses `git branch --contains` as part of its
>> workflow.  Our repo has ~1000 branches on it, and the contains
>> operation is eating up all our CPU (and time).  It would be very
>> helpful to us to make this faster!
>>
>> (The original thread is at
>> https://public-inbox.org/git/E1OU82h-0001xY-3b@closure.thunk.org/
> Sorry, this got thrown on my "to respond" pile and languished.

Thanks for adding me to the thread. It's good to know the pain point 
people are having around commit graph walks.

> There are actually three things that make "git branch --contains" slow.
>
> First, if you're filtering 1000 branches, we'll run 1000 merge-base
> traversals, which may walk over the same commits multiple times.
>
> These days "tag --contains" uses a different algorithm that can look at
> all heads in a single traversal. But the downside is that it's
> depth-first, so it tends to walk down to the roots. That's generally OK
> for tags, since you often have ancient tags that mean getting close to
> the roots anyway.
>
> But for branches, they're more likely to be recent, and you can get away
> without going very deep into the history.
>
> So it's a tradeoff. There's no run-time switch to flip between them, but
> a patch like this:
>
> diff --git a/builtin/branch.c b/builtin/branch.c
> index 8dcc2ed058..4d674e86d5 100644
> --- a/builtin/branch.c
> +++ b/builtin/branch.c
> @@ -404,6 +404,7 @@ static void print_ref_list(struct ref_filter *filter, struct ref_sorting *sortin
>   
>   	memset(&array, 0, sizeof(array));
>   
> +	filter->with_commit_tag_algo = 1;
>   	filter_refs(&array, filter, filter->kind | FILTER_REFS_INCLUDE_BROKEN);
>   
>   	if (filter->verbose)
>
> drops my run of "git branch -a --contains HEAD~100" from 8.6s to
> 0.4s on a repo with ~1800 branches. That sounds good, but on a repo with
> a smaller number of branches, we may actually end up slower (because we
> dig further down in history, and don't benefit from the multiple-branch
> speedup).

It's good to know that we already have an algorithm for the multi-head 
approach. Things like `git branch -vv` are harder to tease out because 
the graph walk is called by the line-format code.

> I tried to do a "best of both" algorithm in:
>
>   https://public-inbox.org/git/20140625233429.GA20457@sigill.intra.peff.net/
>
> which finds arbitrary numbers of merge bases in a single traversal.  It
> did seem to work, but I felt uneasy about some of the corner cases.
> I've been meaning to revisit it, but obviously have never gotten around
> to it.
>
> The second slow thing is that during the traversal we load each commit
> object from disk. The solution there is to keep the parent information
> in a faster cache. I had a few proposals over the years, but I won't
> even bother to dig them up, because there's quite recent and promising
> work in this area from Derrick Stolee:
>
>    https://public-inbox.org/git/1519698787-190494-1-git-send-email-dstolee@microsoft.com/
>
> And finally, the thing that the patches you linked are referencing is
> about using commit timestamps as a proxy for generation numbers. And
> Stolee's patches actually leave room for real, trustable generation
> numbers.
>
> Once we have the serialized commit graph and generation numbers, think
> the final step would just be to teach the "tag --contains" algorithm to
> stop walking down unproductive lines of history. And in fact, I think we
> can forget about the best-of-both multi-tip merge-base idea entirely.
> Because if you can use the generation numbers to avoid going too deep,
> then a depth-first approach is fine. And we'd just want to flip
> git-branch over to using that algorithm by default.

I'll keep this in mind as a target for performance measurements in the 
serialized commit graph patch and the following generation number patch.

Thanks,
-Stolee

next prev parent reply	other threads:[~2018-03-12 13:45 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-06-11 19:04 [PATCH 0/4] Speed up git tag --contains Ævar Arnfjörð Bjarmason
2011-06-11 19:04 ` [PATCH 1/4] tag: speed up --contains calculation Ævar Arnfjörð Bjarmason
2011-06-11 19:04 ` [PATCH 2/4] limit "contains" traversals based on commit timestamp Ævar Arnfjörð Bjarmason
2011-06-11 19:04 ` [PATCH 3/4] default core.clockskew variable to one day Ævar Arnfjörð Bjarmason
2011-06-11 19:04 ` [PATCH 4/4] Why is "git tag --contains" so slow? Ævar Arnfjörð Bjarmason
2011-07-06  6:40 ` [PATCH 0/4] Speed up git tag --contains Jeff King
2011-07-06  6:54   ` Jeff King
2011-07-06 19:06     ` Clemens Buchacher
2011-07-06  6:56   ` Jonathan Nieder
2011-07-06  7:03     ` Jeff King
2011-07-06 14:26       ` generation numbers (was: [PATCH 0/4] Speed up git tag --contains) Jakub Narebski
2011-07-06 15:01         ` Ted Ts'o
2011-07-06 18:12           ` Jeff King
2011-07-06 18:46             ` Jakub Narebski
2011-07-07 18:59               ` Jeff King
2011-07-07 19:34                 ` generation numbers Junio C Hamano
2011-07-07 20:31                   ` Jakub Narebski
2011-07-07 20:52                     ` A Large Angry SCM
2011-07-08  0:29                       ` Junio C Hamano
2011-07-08 22:57                   ` Jeff King
2011-07-06 23:22             ` Junio C Hamano
2011-07-07 19:08               ` Jeff King
2011-07-07 20:10                 ` Jakub Narebski
2018-01-12 18:56   ` [PATCH 0/4] Speed up git tag --contains csilvers
2018-03-03  5:15     ` Jeff King
2018-03-08 23:05       ` csilvers
2018-03-12 13:45       ` Derrick Stolee [this message]
2018-03-12 23:59         ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=63e9c6a8-4efc-6f86-f355-1ec40dd674e4@gmail.com \
    --to=stolee@gmail.com \
    --cc=avarab@gmail.com \
    --cc=csilvers@cs.stanford.edu \
    --cc=drizzd@aon.at \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jrnieder@gmail.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).