git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: tytso@mit.edu
Cc: Avery Pennarun <apenwarr@gmail.com>, git@vger.kernel.org
Subject: Re: Why is "git tag --contains" so slow?
Date: Sat, 3 Jul 2010 04:06:19 -0400	[thread overview]
Message-ID: <20100703080618.GA10483@sigill.intra.peff.net> (raw)
In-Reply-To: <20100702192612.GM1333@thunk.org>

On Fri, Jul 02, 2010 at 03:26:12PM -0400, tytso@mit.edu wrote:

> I just tried your patch, and with a large number of tags (198 tags,
> from v2.6.11 to v2.6.34 with all of the -rc releases of the linux
> kernel), it is indeed faster: 8.5 seconds without the patch versus 2.3
> seconds with the patch.
> 
> However, if I remove a large number of tags (since I know this is
> something that was introduced since 2.6.33, so I made a shared clone
> of the repository but then I removed all of the tags from 2.6.11
> through 2.6.33, so there was only 19 tags in play), the time to
> execute the git tag --contains became 1.3 seconds without the patch,
> versus 2.9 seconds without the patch.
> 
> So with the oldest tags removed, your patch actually made things run
> *slower* (2.3 vs 2.9 seconds, which was counter-intuitive to me), and
> fastest way to speed things up was to restrict the tags that would be
> searched.

I'm guessing that it is caused by the depth-first search that my patch
does. If we follow the "wrong" parent of a merge (i.e., the one that the
commit in question is not on), then we will end up hitting all commits
down to the roots before backtracking and looking down the right
parent.

I noticed that my improved time for "git tag --contains" was similar to
the total time for "git rev-list --all >/dev/null". Can you try timing
that? My suspicion is that it is going to be about 2.9 seconds for you.

So we could potentially improve my patch by doing a breadth-first
search, although that is a bit trickier (since the point is to mark and
prune whole subgraphs of history). But I'm not sure it is worth it in
practice. It will make some lookups quicker, but in most cases you will
end up going to the roots anyway for a negative lookup (in your case it
was only faster because you got rid of all the old tags). A better
strategy is to prune based on commit date so we _never_ go to the roots,
even for a negative hit. That should give similar speedups as a
breadth-first search would to your case, but also make the normal case
much faster by quickly discarding nonsensical paths (e.g., there is not
much point following a commit made 3 years ago to see if it contains a
commit made yesterday).

Try the patch below, which adds a date cutoff similar to the one used in
name-rev. It's much faster in my tests:

  # stock "git tag --contains HEAD~200"
  real    0m2.179s
  user    0m2.156s
  sys     0m0.020s

  # my first patch
  real    0m0.621s
  user    0m0.588s
  sys     0m0.032s

  # this patch
  real    0m0.041s
  user    0m0.036s
  sys     0m0.004s

For non-pathological cases, I expect it to perform equal to stock git at
worst, and in practice much better (for pathological cases, its worst
case is equivalent to "git rev-list --all", which is still not that
bad). Let me know how it does on your case.

The obvious downside is that it stops looking down a path in the face of
extreme clock skew. We could perhaps allow a "--contains=thorough" to
spend a little more time to achieve a better answer (i.e., ignore the
cutoff date).

diff --git a/builtin/tag.c b/builtin/tag.c
index d311491..5768a44 100644
--- a/builtin/tag.c
+++ b/builtin/tag.c
@@ -10,6 +10,8 @@
 #include "builtin.h"
 #include "refs.h"
 #include "tag.h"
+#include "diff.h"
+#include "revision.h"
 #include "run-command.h"
 #include "parse-options.h"
 
@@ -31,6 +33,61 @@ struct tag_filter {
 
 #define PGP_SIGNATURE "-----BEGIN PGP SIGNATURE-----"
 
+static int in_commit_list(const struct commit_list *want, struct commit *c)
+{
+	for (; want; want = want->next)
+		if (!hashcmp(want->item->object.sha1, c->object.sha1))
+			return 1;
+	return 0;
+}
+
+static int contains_recurse(const struct commit_list *want,
+			    struct commit *candidate,
+			    unsigned long cutoff)
+{
+	struct commit_list *p;
+
+	/* was it previously marked as containing a want commit? */
+	if (candidate->object.flags & TMP_MARK)
+		return 1;
+	/* or marked as not possibly containing a want commit? */
+	if (candidate->object.flags & UNINTERESTING)
+		return 0;
+	/* or are we it? */
+	if (in_commit_list(want, candidate))
+		return 1;
+
+	/* stop searching if we go too far back in time */
+	parse_commit(candidate);
+	if (candidate->date < cutoff)
+		return 0;
+
+	/* If not, then try parents, and be sure to mark ourselves
+	 * for future traversals. */
+	for (p = candidate->parents; p; p = p->next) {
+		if (contains_recurse(want, p->item, cutoff)) {
+			candidate->object.flags |= TMP_MARK;
+			return 1;
+		}
+	}
+
+	candidate->object.flags |= UNINTERESTING;
+	return 0;
+}
+
+static int contains(const struct commit_list *want, struct commit *candidate)
+{
+	const struct commit_list *c;
+	unsigned long min_date = ULONG_MAX;
+	for (c = want; c; c = c->next) {
+		parse_commit(c->item);
+		if (c->item->date < min_date)
+			min_date = c->item->date;
+	}
+	/* give a full day of clock skew slop */
+	return contains_recurse(want, candidate, min_date - 86400);
+}
+
 static int show_reference(const char *refname, const unsigned char *sha1,
 			  int flag, void *cb_data)
 {
@@ -49,7 +106,7 @@ static int show_reference(const char *refname, const unsigned char *sha1,
 			commit = lookup_commit_reference_gently(sha1, 1);
 			if (!commit)
 				return 0;
-			if (!is_descendant_of(commit, filter->with_commit))
+			if (!contains(filter->with_commit, commit))
 				return 0;
 		}
 

  reply	other threads:[~2010-07-03  8:06 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-01  0:54 Why is "git tag --contains" so slow? Theodore Ts'o
2010-07-01  0:58 ` Shawn O. Pearce
2010-07-03 23:27   ` Sam Vilain
2010-07-01  1:00 ` Avery Pennarun
2010-07-01 12:17   ` tytso
2010-07-01 15:03     ` Jeff King
2010-07-01 15:38       ` Jeff King
2010-07-02 19:26         ` tytso
2010-07-03  8:06           ` Jeff King [this message]
2010-07-04  0:55             ` tytso
2010-07-05 12:27               ` Jeff King
2010-07-05 12:33                 ` [RFC/PATCH 1/4] tag: speed up --contains calculation Jeff King
2010-10-13 22:07                   ` Jonathan Nieder
2010-10-13 22:56                   ` Clemens Buchacher
2011-02-23 15:51                   ` Ævar Arnfjörð Bjarmason
2011-02-23 16:39                     ` Jeff King
2010-07-05 12:34                 ` [RFC/PATCH 2/4] limit "contains" traversals based on commit timestamp Jeff King
2010-10-13 23:21                   ` Jonathan Nieder
2010-07-05 12:35                 ` [RFC/PATCH 3/4] default core.clockskew variable to one day Jeff King
2010-07-05 12:36                 ` [RFC/PATCH 4/4] name-rev: respect core.clockskew Jeff King
2010-07-05 12:39                 ` Why is "git tag --contains" so slow? Jeff King
2010-10-14 18:59                   ` Jonathan Nieder
2010-10-16 14:32                     ` Clemens Buchacher
2010-10-27 17:11                       ` Jeff King
2010-10-28  8:07                         ` Clemens Buchacher
2010-07-05 14:10                 ` tytso
2010-07-06 11:58                   ` Jeff King
2010-07-06 15:31                     ` Will Palmer
2010-07-06 16:53                       ` tytso
2010-07-08 11:28                         ` Jeff King
2010-07-08 13:21                           ` Will Palmer
2010-07-08 13:54                             ` tytso
2010-07-07 17:45                       ` Jeff King
2010-07-08 10:29                         ` Theodore Tso
2010-07-08 11:12                           ` Jakub Narebski
2010-07-08 19:29                             ` Nicolas Pitre
2010-07-08 19:39                               ` Avery Pennarun
2010-07-08 20:13                                 ` Nicolas Pitre
2010-07-08 21:20                                   ` Jakub Narebski
2010-07-08 21:30                                     ` Sverre Rabbelier
2010-07-08 23:10                                       ` Nicolas Pitre
2010-07-08 23:15                                     ` Nicolas Pitre
2010-07-08 11:31                           ` Jeff King
2010-07-08 14:35                           ` Johan Herland
2010-07-08 19:06                           ` Nicolas Pitre
2010-07-07 17:50                       ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100703080618.GA10483@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=apenwarr@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).