git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jonathan Nieder <jrnieder@gmail.com>
To: Jeff King <peff@peff.net>
Cc: tytso@mit.edu, Avery Pennarun <apenwarr@gmail.com>,
	git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>
Subject: Re: [RFC/PATCH 1/4] tag: speed up --contains calculation
Date: Wed, 13 Oct 2010 17:07:53 -0500	[thread overview]
Message-ID: <20101013220753.GA8674@burratino> (raw)
In-Reply-To: <20100705123335.GA25699@sigill.intra.peff.net>

Hi,

Jeff King wrote:

> When we want to know if commit A contains commit B (or any
> one of a set of commits, B through Z),

Let's revive this old thread.  Sorry for the long silence (the last
time this topic was visited was a couple months ago [1] afaict).

> Instead, let's leverage the fact that we are going to use
> the same --contains list for each tag, and mark areas of the
> commit graph is definitely containing those commits, or
> definitely not containing those commits. Later tags can then
> stop traversing as soon as they see a previously calculated
> answer.

Makes sense.  And the resulting timings are exciting.

> ---
> Note that this is a depth first search, whereas we generally traverse in
> a more breadth-first way. So it can actually make things slightly slower
> than the current merge-base code, if:
> 
>   1. You don't have any merge base calculations that involve going very
>      far back in history.
> 
>   2. We depth-first down the wrong side of a merge.
> 
> However, in the usual cases, I think it will perform much better.

Nit: Wouldn't this (or a summary of it) belong in the log message, too,
to avoid headscratching when a person tries to bisect a performance
regression down the line?

>   1. This can probably be a one-liner change to use in "git branch
>      --contains", as well. I didn't measure it, as that already tends to
>      be reasonably fast.

These days "branch -r --contains" has been _sloow_ for me.  So once
this is tuned, that wouldn't be a bad idea, methinks.

>   2. It uses commit marks, which means it doesn't behave well with other
>      traversals. I have no idea if I should be using alternate marks or
>      not.

This is what Junio was referring to with "the use of TMP_MARK smelled
somewhat bad to me", right?

>   3. We never clear the marks, or check them against the "want" list.

This too, I suppose.

> So
>      we just assume that repeated calls to contains will have the same
>      "want" list.

As long as this is advertised, I think it is fine.  If someone needs
to use --contains for multiple purposes in a builtin, that can be
dealt with then (maybe with a separate function to reset the marks).

Maybe it would make sense to keep this static in builtin/tag.c for
now, and defer cleaning up the interface to a separate patch.

> --- a/commit.c
> +++ b/commit.c

Now the meat of the patch.  git has a few algorithms for checking if
one commit is an ancestor of another.

is_descendant_of::

	1. find merge bases.  This is breadth-first search in
	   date order, but it walks through all ancestors as far
	   as I can tell.
	2. make them independent (seems pointless...).
	3. check if ancestor is one of them.

get_revision after setting UNINTERESTING (i.e., rev-list -1 ^tag cmit)::

	1. find interesting and uninteresting commits.  This is
	   breadth-first search in date order.
	2. when the first interesting commit is found, emit it
	   and stop.

Neither is optimized to check a batch of commits to find which contain
the commit of interest.

join_revs of show-branch::

	1. color the tag and cmit.
	2. let colors propagate back to all their ancestors, stopping
	   propagation along a given path when a commit has all colors.
	   This is breadth-first traversal in date order.
	3. for each commit with all colors, propagate that knowledge
	   to all parents we've encountered before.
	4. check what colors cmit has.

limit_to_ancestry::

	1. compute rev-list ^cmit tag in the usual way.
	2. reverse the revision list.
	3. color cmit with TMP_MARK and let that color permeate
	   up the backwards revision list.
	4. repeat! until no progress is made.
	4. check what color tag has.

None of these seem to tolerate timestamp skew. :(

Am I understanding correctly?

If so, just making is_descendant_of less stupid (e.g., not traversing
all history) would presumably help a lot.  Another approach would be
to try the show-branch tactic repeatedly with small enough batches of
tags to fit in the flag bitfield.  Yet another approach would be to
take the limit_to_ancestry one, but if the revision list is not
topologically sorted, it could be slow.

Hopefully your approach will be even better...

> @@ -845,3 +845,45 @@ int commit_tree(const char *msg, unsigned char *tree,
[...]
> +static int contains_recurse(struct commit *candidate,
> +			    const struct commit_list *want)
> +{
> +	struct commit_list *p;
> +
> +	/* was it previously marked as containing a want commit? */
> +	if (candidate->object.flags & TMP_MARK)
> +		return 1;
> +	/* or marked as not possibly containing a want commit? */
> +	if (candidate->object.flags & UNINTERESTING)
> +		return 0;
> +	/* or are we it? */
> +	if (in_commit_list(want, candidate))
> +		return 1;
> +
> +	if (parse_commit(candidate) < 0)
> +		return 0;
> +
> +	/* Otherwise recurse and mark ourselves for future traversals. */
> +	for (p = candidate->parents; p; p = p->next) {
> +		if (contains_recurse(p->item, want)) {
> +			candidate->object.flags |= TMP_MARK;
> +			return 1;
> +		}
> +	}
> +	candidate->object.flags |= UNINTERESTING;

Nice.  We cache the "is wanted" result using TMP_MARK and
UNINTERESTING.  It cannot be slower than is_descendant_of if I
understood correctly because the latter is a little crazy.

Possible unprincipled approach:

 1. Mark the wanted commits with TMP_MARK.
 2. Perform a breadth-first, date-order search starting at
    candidate to find a commit with TMP_MARK.
    a. If not found, mark the commits involved as UNINTERESTING
    b. If found, let the TMP_MARK permeate up using the
       limit_to_ancestry algorithm.
 3. Repeat with each other candidate.
 4. (If needed) walk starting at each candidate to clear the
    temporary flags.

I am not sure if that would be any faster, though.  What do you think?

[...]
> +int contains(struct commit *, const struct commit_list *);

Thanks for a pleasant read.

[1] http://thread.gmane.org/gmane.comp.version-control.git/152607/focus=152701

  reply	other threads:[~2010-10-13 22:11 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-01  0:54 Why is "git tag --contains" so slow? Theodore Ts'o
2010-07-01  0:58 ` Shawn O. Pearce
2010-07-03 23:27   ` Sam Vilain
2010-07-01  1:00 ` Avery Pennarun
2010-07-01 12:17   ` tytso
2010-07-01 15:03     ` Jeff King
2010-07-01 15:38       ` Jeff King
2010-07-02 19:26         ` tytso
2010-07-03  8:06           ` Jeff King
2010-07-04  0:55             ` tytso
2010-07-05 12:27               ` Jeff King
2010-07-05 12:33                 ` [RFC/PATCH 1/4] tag: speed up --contains calculation Jeff King
2010-10-13 22:07                   ` Jonathan Nieder [this message]
2010-10-13 22:56                   ` Clemens Buchacher
2011-02-23 15:51                   ` Ævar Arnfjörð Bjarmason
2011-02-23 16:39                     ` Jeff King
2010-07-05 12:34                 ` [RFC/PATCH 2/4] limit "contains" traversals based on commit timestamp Jeff King
2010-10-13 23:21                   ` Jonathan Nieder
2010-07-05 12:35                 ` [RFC/PATCH 3/4] default core.clockskew variable to one day Jeff King
2010-07-05 12:36                 ` [RFC/PATCH 4/4] name-rev: respect core.clockskew Jeff King
2010-07-05 12:39                 ` Why is "git tag --contains" so slow? Jeff King
2010-10-14 18:59                   ` Jonathan Nieder
2010-10-16 14:32                     ` Clemens Buchacher
2010-10-27 17:11                       ` Jeff King
2010-10-28  8:07                         ` Clemens Buchacher
2010-07-05 14:10                 ` tytso
2010-07-06 11:58                   ` Jeff King
2010-07-06 15:31                     ` Will Palmer
2010-07-06 16:53                       ` tytso
2010-07-08 11:28                         ` Jeff King
2010-07-08 13:21                           ` Will Palmer
2010-07-08 13:54                             ` tytso
2010-07-07 17:45                       ` Jeff King
2010-07-08 10:29                         ` Theodore Tso
2010-07-08 11:12                           ` Jakub Narebski
2010-07-08 19:29                             ` Nicolas Pitre
2010-07-08 19:39                               ` Avery Pennarun
2010-07-08 20:13                                 ` Nicolas Pitre
2010-07-08 21:20                                   ` Jakub Narebski
2010-07-08 21:30                                     ` Sverre Rabbelier
2010-07-08 23:10                                       ` Nicolas Pitre
2010-07-08 23:15                                     ` Nicolas Pitre
2010-07-08 11:31                           ` Jeff King
2010-07-08 14:35                           ` Johan Herland
2010-07-08 19:06                           ` Nicolas Pitre
2010-07-07 17:50                       ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20101013220753.GA8674@burratino \
    --to=jrnieder@gmail.com \
    --cc=apenwarr@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).