git@vger.kernel.org mailing list mirror (one of many)
 help / Atom feed
* [PATCH 0/4] Speed up git tag --contains
@ 2011-06-11 19:04 Ævar Arnfjörð Bjarmason
  2011-06-11 19:04 ` [PATCH 1/4] tag: speed up --contains calculation Ævar Arnfjörð Bjarmason
                   ` (4 more replies)
  0 siblings, 5 replies; 28+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2011-06-11 19:04 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason

This is a resubmission of Jeff King's patch series to speed up git tag
--contains with some changes. It's been cooking for a while as:

    * jk/tag-contains (2010-07-05) 4 commits
     - Why is "git tag --contains" so slow?
     - default core.clockskew variable to one day
     - limit "contains" traversals based on commit timestamp
     - tag: speed up --contains calculation
    
    The idea of the bottom one is probably Ok, except that the use of object
    flags needs to be rethought, or at least the helper needs to be moved to
    builtin/tag.c to make it clear that it should not be used outside the
    current usage context.

I've moved the relevant code from commit.[ch] to builtin/tag.c as
Junio's comment suggested. So IMO the "tag: speed up --contains
calculation" patch is ready to be applied.

The next two patches look OK to me, but they need some documentation
for the core.clockskew variable, which perhaps should be renamed to
tag.clockskew, or was the plan to use it for other things in the
future?

Is the "Why is "git tag --contains" so slow?" utility something we
want? We'd need some documentation for it, which I could
write. However I couldn't find the magic that turns --all into a
traversal of all revisions, and how that would work with supporting
another --verbose command-line option, to print out the revisions that
have high clock skew. I monkeypatched that in locally and found it
very useful to find the worst-case revisions, which in my case were on
topic branches that could simply be deleted.

In any case I've been running git with this series for a while, and
it's really helpful for a repository I work on with ~10k tags. I'm
willing to help get it accepted into the core.

Jeff King (4):
  tag: speed up --contains calculation
  limit "contains" traversals based on commit timestamp
  default core.clockskew variable to one day
  Why is "git tag --contains" so slow?

 .gitignore     |    1 +
 Makefile       |    1 +
 builtin.h      |    1 +
 builtin/skew.c |   50 ++++++++++++++++++++++++++++++++++++
 builtin/tag.c  |   76 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 git.c          |    1 +
 6 files changed, 129 insertions(+), 1 deletions(-)
 create mode 100644 builtin/skew.c

-- 
1.7.5.3

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/4] tag: speed up --contains calculation
  2011-06-11 19:04 [PATCH 0/4] Speed up git tag --contains Ævar Arnfjörð Bjarmason
@ 2011-06-11 19:04 ` Ævar Arnfjörð Bjarmason
  2011-06-11 19:04 ` [PATCH 2/4] limit "contains" traversals based on commit timestamp Ævar Arnfjörð Bjarmason
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 28+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2011-06-11 19:04 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason

From: Jeff King <peff@peff.net>

When we want to know if commit A contains commit B (or any
one of a set of commits, B through Z), we generally
calculate the merge bases and see if B is a merge base of A
(or for a set, if any of the commits B through Z have that
property).

When we are going to check a series of commits A1 through An
to see whether each contains B (e.g., because we are
deciding which tags to show with "git tag --contains"), we
do a series of merge base calculations. This can be very
expensive, as we repeat a lot of traversal work.

Instead, let's leverage the fact that we are going to use
the same --contains list for each tag, and mark areas of the
commit graph is definitely containing those commits, or
definitely not containing those commits. Later tags can then
stop traversing as soon as they see a previously calculated
answer.

This sped up "git tag --contains HEAD~200" in the linux-2.6
repository from:

  real    0m15.417s
  user    0m15.197s
  sys     0m0.220s

to:

  real    0m5.329s
  user    0m5.144s
  sys     0m0.184s

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 builtin/tag.c |   46 +++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 45 insertions(+), 1 deletions(-)

diff --git a/builtin/tag.c b/builtin/tag.c
index ec926fc..575a03c 100644
--- a/builtin/tag.c
+++ b/builtin/tag.c
@@ -12,6 +12,8 @@
 #include "tag.h"
 #include "run-command.h"
 #include "parse-options.h"
+#include "diff.h"
+#include "revision.h"
 
 static const char * const git_tag_usage[] = {
 	"git tag [-a|-s|-u <key-id>] [-f] [-m <msg>|-F <file>] <tagname> [<head>]",
@@ -29,6 +31,48 @@ struct tag_filter {
 	struct commit_list *with_commit;
 };
 
+static int in_commit_list(const struct commit_list *want, struct commit *c)
+{
+	for (; want; want = want->next)
+		if (!hashcmp(want->item->object.sha1, c->object.sha1))
+			return 1;
+	return 0;
+}
+
+static int contains_recurse(struct commit *candidate,
+			    const struct commit_list *want)
+{
+	struct commit_list *p;
+
+	/* was it previously marked as containing a want commit? */
+	if (candidate->object.flags & TMP_MARK)
+		return 1;
+	/* or marked as not possibly containing a want commit? */
+	if (candidate->object.flags & UNINTERESTING)
+		return 0;
+	/* or are we it? */
+	if (in_commit_list(want, candidate))
+		return 1;
+
+	if (parse_commit(candidate) < 0)
+		return 0;
+
+	/* Otherwise recurse and mark ourselves for future traversals. */
+	for (p = candidate->parents; p; p = p->next) {
+		if (contains_recurse(p->item, want)) {
+			candidate->object.flags |= TMP_MARK;
+			return 1;
+		}
+	}
+	candidate->object.flags |= UNINTERESTING;
+	return 0;
+}
+
+int contains(struct commit *candidate, const struct commit_list *want)
+{
+	return contains_recurse(candidate, want);
+}
+
 static int show_reference(const char *refname, const unsigned char *sha1,
 			  int flag, void *cb_data)
 {
@@ -47,7 +91,7 @@ static int show_reference(const char *refname, const unsigned char *sha1,
 			commit = lookup_commit_reference_gently(sha1, 1);
 			if (!commit)
 				return 0;
-			if (!is_descendant_of(commit, filter->with_commit))
+			if (!contains(commit, filter->with_commit))
 				return 0;
 		}
 
-- 
1.7.5.3

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/4] limit "contains" traversals based on commit timestamp
  2011-06-11 19:04 [PATCH 0/4] Speed up git tag --contains Ævar Arnfjörð Bjarmason
  2011-06-11 19:04 ` [PATCH 1/4] tag: speed up --contains calculation Ævar Arnfjörð Bjarmason
@ 2011-06-11 19:04 ` Ævar Arnfjörð Bjarmason
  2011-06-11 19:04 ` [PATCH 3/4] default core.clockskew variable to one day Ævar Arnfjörð Bjarmason
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 28+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2011-06-11 19:04 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason

From: Jeff King <peff@peff.net>

When looking for commits that contain other commits (e.g.,
via "git tag --contains"), we can end up traversing useless
portions of the graph. For example, if I am looking for a
tag that contains a commit made last week, there is not much
point in traversing portions of the history graph made five
years ago.

This optimization can provide massive speedups. For example,
doing "git tag --contains HEAD~200" in the linux-2.6
repository goes from:

  real    0m5.302s
  user    0m5.116s
  sys     0m0.184s

to:

  real    0m0.030s
  user    0m0.020s
  sys     0m0.008s

The downside is that we will no longer find some answers in
the face of extreme clock skew, as we will stop the
traversal early when seeing commits skewed too far into the
past.

Name-rev already implements a similar optimization, using a
"slop" of one day to allow for a certain amount of clock
skew in commit timestamps. This patch introduces a
"core.clockskew" variable, which allows specifying the
allowable amount of clock skew in seconds.  For safety, it
defaults to "none", causing a full traversal (i.e., no
change in behavior from previous versions).

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 builtin/tag.c |   36 +++++++++++++++++++++++++++++++++---
 1 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/builtin/tag.c b/builtin/tag.c
index 575a03c..0f0d784 100644
--- a/builtin/tag.c
+++ b/builtin/tag.c
@@ -25,6 +25,8 @@ static const char * const git_tag_usage[] = {
 
 static char signingkey[1000];
 
+static int core_clock_skew = -1;
+
 struct tag_filter {
 	const char *pattern;
 	int lines;
@@ -40,7 +42,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
 }
 
 static int contains_recurse(struct commit *candidate,
-			    const struct commit_list *want)
+			    const struct commit_list *want,
+			    unsigned long cutoff)
 {
 	struct commit_list *p;
 
@@ -57,9 +60,13 @@ static int contains_recurse(struct commit *candidate,
 	if (parse_commit(candidate) < 0)
 		return 0;
 
+	/* stop searching if we go too far back in time */
+	if (candidate->date < cutoff)
+		return 0;
+
 	/* Otherwise recurse and mark ourselves for future traversals. */
 	for (p = candidate->parents; p; p = p->next) {
-		if (contains_recurse(p->item, want)) {
+		if (contains_recurse(p->item, want, cutoff)) {
 			candidate->object.flags |= TMP_MARK;
 			return 1;
 		}
@@ -70,7 +77,22 @@ static int contains_recurse(struct commit *candidate,
 
 int contains(struct commit *candidate, const struct commit_list *want)
 {
-	return contains_recurse(candidate, want);
+	unsigned long cutoff = 0;
+
+	if (core_clock_skew >= 0) {
+		const struct commit_list *c;
+		unsigned long min_date = ULONG_MAX;
+		for (c = want; c; c = c->next) {
+			if (parse_commit(c->item) < 0)
+				continue;
+			if (c->item->date < min_date)
+				min_date = c->item->date;
+		}
+		if (min_date > core_clock_skew)
+			cutoff = min_date - core_clock_skew;
+	}
+
+	return contains_recurse(candidate, want, cutoff);
 }
 
 static int show_reference(const char *refname, const unsigned char *sha1,
@@ -277,6 +299,14 @@ static int git_tag_config(const char *var, const char *value, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.clockskew")) {
+		if (!value || !strcmp(value, "none"))
+			core_clock_skew = -1;
+		else
+			core_clock_skew = git_config_int(var, value);
+		return 0;
+	}
+
 	return git_default_config(var, value, cb);
 }
 
-- 
1.7.5.3

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 3/4] default core.clockskew variable to one day
  2011-06-11 19:04 [PATCH 0/4] Speed up git tag --contains Ævar Arnfjörð Bjarmason
  2011-06-11 19:04 ` [PATCH 1/4] tag: speed up --contains calculation Ævar Arnfjörð Bjarmason
  2011-06-11 19:04 ` [PATCH 2/4] limit "contains" traversals based on commit timestamp Ævar Arnfjörð Bjarmason
@ 2011-06-11 19:04 ` Ævar Arnfjörð Bjarmason
  2011-06-11 19:04 ` [PATCH 4/4] Why is "git tag --contains" so slow? Ævar Arnfjörð Bjarmason
  2011-07-06  6:40 ` [PATCH 0/4] Speed up git tag --contains Jeff King
  4 siblings, 0 replies; 28+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2011-06-11 19:04 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason

From: Jeff King <peff@peff.net>

This is the slop value used by name-rev, so presumably is a
reasonable default.

Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 builtin/tag.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/builtin/tag.c b/builtin/tag.c
index 0f0d784..1468813 100644
--- a/builtin/tag.c
+++ b/builtin/tag.c
@@ -25,7 +25,7 @@ static const char * const git_tag_usage[] = {
 
 static char signingkey[1000];
 
-static int core_clock_skew = -1;
+static int core_clock_skew = 86400;
 
 struct tag_filter {
 	const char *pattern;
-- 
1.7.5.3

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 4/4] Why is "git tag --contains" so slow?
  2011-06-11 19:04 [PATCH 0/4] Speed up git tag --contains Ævar Arnfjörð Bjarmason
                   ` (2 preceding siblings ...)
  2011-06-11 19:04 ` [PATCH 3/4] default core.clockskew variable to one day Ævar Arnfjörð Bjarmason
@ 2011-06-11 19:04 ` Ævar Arnfjörð Bjarmason
  2011-07-06  6:40 ` [PATCH 0/4] Speed up git tag --contains Jeff King
  4 siblings, 0 replies; 28+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2011-06-11 19:04 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Jeff King, Ævar Arnfjörð Bjarmason

From: Jeff King <peff@peff.net>

On Mon, Jul 05, 2010 at 08:27:23AM -0400, Jeff King wrote:

> As you probably guessed from the specificity of the number, I wrote a
> short program to actually traverse and find the worst skew. It takes
> about 5 seconds to run (unsurprisingly, since it is doing the same full
> traversal that we end up doing in the above numbers). So we could
> "autoskew" by setting up the configuration on clone, and then
> periodically updating it as part of "git gc".

This patch doesn't implement auto-detection of skew, but is the program
I used to calculate, and would provide the basis for such
auto-detection. It would be interesting to see average skew numbers for
popular repositories. You can run it as "git skew --all".

Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com>
---
 .gitignore     |    1 +
 Makefile       |    1 +
 builtin.h      |    1 +
 builtin/skew.c |   50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 git.c          |    1 +
 5 files changed, 54 insertions(+), 0 deletions(-)
 create mode 100644 builtin/skew.c

diff --git a/.gitignore b/.gitignore
index acffdfa..503ef8b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -137,6 +137,7 @@
 /git-show-branch
 /git-show-index
 /git-show-ref
+/git-skew
 /git-stage
 /git-stash
 /git-status
diff --git a/Makefile b/Makefile
index e40ac0c..4ba5542 100644
--- a/Makefile
+++ b/Makefile
@@ -763,6 +763,7 @@ BUILTIN_OBJS += builtin/send-pack.o
 BUILTIN_OBJS += builtin/shortlog.o
 BUILTIN_OBJS += builtin/show-branch.o
 BUILTIN_OBJS += builtin/show-ref.o
+BUILTIN_OBJS += builtin/skew.o
 BUILTIN_OBJS += builtin/stripspace.o
 BUILTIN_OBJS += builtin/symbolic-ref.o
 BUILTIN_OBJS += builtin/tag.o
diff --git a/builtin.h b/builtin.h
index 0e9da90..0be47ca 100644
--- a/builtin.h
+++ b/builtin.h
@@ -143,5 +143,6 @@ extern int cmd_verify_pack(int argc, const char **argv, const char *prefix);
 extern int cmd_show_ref(int argc, const char **argv, const char *prefix);
 extern int cmd_pack_refs(int argc, const char **argv, const char *prefix);
 extern int cmd_replace(int argc, const char **argv, const char *prefix);
+extern int cmd_skew(int argc, const char **argv, const char *prefix);
 
 #endif
diff --git a/builtin/skew.c b/builtin/skew.c
new file mode 100644
index 0000000..1046f5f
--- /dev/null
+++ b/builtin/skew.c
@@ -0,0 +1,50 @@
+#include "cache.h"
+#include "commit.h"
+#include "diff.h"
+#include "revision.h"
+
+unsigned long worst_skew = 0;
+
+static void check_skew_recurse(struct commit *c, unsigned long when)
+{
+	struct commit_list *p;
+
+	if (c->object.flags & SEEN)
+		return;
+	c->object.flags |= SEEN;
+
+	if (parse_commit(c) < 0)
+		return;
+
+	if (c->date > when) {
+		unsigned long skew = c->date - when;
+		if (skew > worst_skew)
+			worst_skew = skew;
+	}
+
+	for (p = c->parents; p; p = p->next)
+		check_skew_recurse(p->item, c->date < when ? c->date : when);
+}
+
+static void check_skew(struct commit *c)
+{
+	check_skew_recurse(c, time(NULL));
+}
+
+int cmd_skew(int argc, const char **argv, const char *prefix) {
+	struct rev_info revs;
+	int i;
+
+	git_config(git_default_config, NULL);
+	init_revisions(&revs, prefix);
+	argc = setup_revisions(argc, argv, &revs, NULL);
+
+	for (i = 0; i < revs.pending.nr; i++) {
+		struct object *o = revs.pending.objects[i].item;
+		if (o->type == OBJ_COMMIT)
+			check_skew((struct commit *)o);
+	}
+
+	printf("%lu\n", worst_skew);
+	return 0;
+}
diff --git a/git.c b/git.c
index 89721d4..2404bf3 100644
--- a/git.c
+++ b/git.c
@@ -434,6 +434,7 @@ static void handle_internal_command(int argc, const char **argv)
 		{ "version", cmd_version },
 		{ "whatchanged", cmd_whatchanged, RUN_SETUP },
 		{ "write-tree", cmd_write_tree, RUN_SETUP },
+		{ "skew", cmd_skew, RUN_SETUP },
 	};
 	int i;
 	static const char ext[] = STRIP_EXTENSION;
-- 
1.7.5.3

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Speed up git tag --contains
  2011-06-11 19:04 [PATCH 0/4] Speed up git tag --contains Ævar Arnfjörð Bjarmason
                   ` (3 preceding siblings ...)
  2011-06-11 19:04 ` [PATCH 4/4] Why is "git tag --contains" so slow? Ævar Arnfjörð Bjarmason
@ 2011-07-06  6:40 ` Jeff King
  2011-07-06  6:54   ` Jeff King
                     ` (2 more replies)
  4 siblings, 3 replies; 28+ messages in thread
From: Jeff King @ 2011-07-06  6:40 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jonathan Nieder, Clemens Buchacher, git, Junio C Hamano

[+cc people who were interested in earlier iterations of this topic]

On Sat, Jun 11, 2011 at 07:04:07PM +0000, Ævar Arnfjörð Bjarmason wrote:

> This is a resubmission of Jeff King's patch series to speed up git tag
> --contains with some changes. It's been cooking for a while as:

Thanks for resurrecting this. I've been meaning to look at it again, and
somehow an entire year has passed. I've tried to refresh my memory on
the issues, so hopefully I can make coherent comments.

There have been a few responses in the meantime, and one I want to
address is Junio's:

  http://article.gmane.org/gmane.comp.version-control.git/152765

The major points in it are (I'm paraphrasing for brevity, but please
correct me if I'm misrepresenting):

  1. A depth-first algorithm has the problem of going down to the roots
     unnecessarily.

Yes, and that's why with my initial patch, "tag --contains" is on the
same order of time as "git rev-list --tags >/dev/null" (in the worst
case). But with the early return based on commit timestamp, we stop
looking down uninteresting paths early. So the downside isn't speed, but
trusting commit timestamps.

  2. One solution is to use merge-bases to find earlier cutoff points.

It's possible. But doesn't the merge bases algorithm, like all of the
commit walking, rely somewhat on commit timestamps, too? For example,
this thread shows issues with revision limiting:

  http://thread.gmane.org/gmane.comp.version-control.git/72274

I'm not sure about the merge bases algorithm, though. In the face of
skew, I think it can go down a non-optimal path (e.g., going all the way
to the root because one branch has commits skewed to look much older
than they really are, which pushes them to the back of the commit_list
priority queue). But I think it will still find the correct answer.

Two problems with doing a merge-base solution are:

  a. It can still end up hitting the roots, or close to them. The
     merge-base of something recent and a tag from years ago is going
     to have to go through years of history. So using a timestamp cutoff
     is really nice to know that there's no point in digging (on the
     other hand, searching for something from years ago with respect to
     recent tags will always have to dig through all of that history;
     however, I think this is less common than the other way around).

  b. If you are doing the merge-base over many tags at once, it's hard
     to figure out which source tag is actually responsible for hitting
     the merge base.

Which leads us to Junio's final point:

  3. You can do something like show-branch does, and smudge each commit
     with a bitfield that has one bit per tag (e.g., using the object
     flags).

My problem with this is that it doesn't scale algorithmically with many
tags. If we have a constant number of bits, then that reduces the number
of merge-base traversals we have to do by a constant number. Our
constant using the flags field would be 27. And reducing the time by a
factor of 27 is nice, but I suspect something like the 10K-tags example
is still going to be painful, if even one out of the 27 in each
traversal has dig far into history.

Another option is to trade space for time. Do one traversal, but
actually keep a large enough bitfield. For 10K tags, that's about 1K per
commit. So for git.git, that's 30M. For linux-2.6, it's 250M. Which is
getting pretty big. But remember that's an insane number of tags, and we
can also move the slider between time and space (e.g., 5 traversals of
50M each).

> I've moved the relevant code from commit.[ch] to builtin/tag.c as
> Junio's comment suggested. So IMO the "tag: speed up --contains
> calculation" patch is ready to be applied.

The only downside to that is that the code is harder to reuse in "branch
--contains", which could also benefit. I think the multiple merge-base
traversals tend not to be as bad, because branch tips tend to stay
recent, and you tend to ask for recent commits. So even though we dig
through the same commits multiple times, it all stays in recent history.
Whereas tags tend to point to very old things.

Still, that is dependent on your repo setup, including numbers of
branches and how stale they tend to be. It would be nice if we could
always be fast.

> The next two patches look OK to me, but they need some documentation
> for the core.clockskew variable, which perhaps should be renamed to
> tag.clockskew, or was the plan to use it for other things in the
> future?

It was intended to be used elsewhere. I have a patch to use it in
name-rev, which currently just has a hard-coded skew.

The problem with a skew variable like this is that you really don't want
to set it higher than a day or so. Because it affects all parts of the
traversal, not just the parts near the skewed commits. In linux-2.6, for
example, the worst skew is about 100 days. Here are timings for "git tag
--contains HEAD~200" with various core.clockskew values:

  - no clock skew tolerated: .035s
  - 1 day: .034s
  - 100 days of clock: .252s
  - infinite: 5.373s

So we are almost an order of magnitude slower by having set an
appropriate clockskew value. And that's only for 100 days. Some of the
projects have skew on the order of years.

If we can assume that the skewed commits are relatively rare[1], we
might do better to mark individual skewed commits via notes or the
replace mechanism. A simple test shows that doing notes lookups is not
too expensive:

  # pretend we have some fake timestamps
  for i in 20 40 60; do
    git notes add -m "fake timestamp" HEAD~$i
  done

  (best of 5)
  $ time git log --pretty=raw --no-notes >/dev/null
  real    0m3.868s
  user    0m3.796s
  sys     0m0.060s

  (best of 5)
  $ time git log --pretty=raw --show-notes >/dev/null
  real    0m3.878s
  user    0m3.812s
  sys     0m0.052s

And then any code wanting to limit traversal would have to check the
notes to see if the timestamp was valid (in fact, we would do even fewer
lookups, since we only need to check for a bogus timestamp at the edges
of our traversal).

The replace mechanism could be used instead; it has the advantage that
we wouldn't even need to change the traversal code; it would just see
the corrected objects with the right timestamp, and has similar
performance characteristics.

[1] The numbers from Jonathan and Clemens show that in most repos, the
numbers of skewed commits tend to be small (single-digits usually, or
even in the dozens; but much fewer than the total number of commits).

> Is the "Why is "git tag --contains" so slow?" utility something we
> want?

As it is now, I don't think so. Tweaking core.clockskew is slow, as
shown above. And it's not something people should have to do manually. I
have a version, which I'll post in a minute, which actually fills in a
notes tree with the sha1 of commits with bogus timestamps. And then
that tree can be consulted accurately and automatically.

However, if we're going to have a look-aside cache of metadata on each
commit, maybe it is really time to stop thinking about commit timestamps
and start thinking about "generation numbers". This concept has been
brought up before on the list; it's basically:

  1. Root commits have generation = 0.

  2. Other commits have generation = 1 + max(generations of parents).

So it's a strictly increasing number, and you know that, given X > Y, X
cannot possibly be an ancestor of Y.

If this were stored in the commit object, we could use it for all
traversals instead of the commit timestamp, and it would presumably be
more reliable (you could still have a bogus repo, of course, but it
would come from a bug in git, not from importing old history or having
your clock set wrong).

The problem is that existing objects don't have this generation number.
It's easy to calculate, though, and we could in theory use a notes-cache
to store it externally. Obviously the complexity and performance aren't
going to be as good as if it were just in the commit object, but we're
sadly 6 years too late to make that decision.

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Speed up git tag --contains
  2011-07-06  6:40 ` [PATCH 0/4] Speed up git tag --contains Jeff King
@ 2011-07-06  6:54   ` Jeff King
  2011-07-06 19:06     ` Clemens Buchacher
  2011-07-06  6:56   ` Jonathan Nieder
  2018-01-12 18:56   ` [PATCH 0/4] Speed up git tag --contains csilvers
  2 siblings, 1 reply; 28+ messages in thread
From: Jeff King @ 2011-07-06  6:54 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jonathan Nieder, Clemens Buchacher, git, Junio C Hamano

On Wed, Jul 06, 2011 at 02:40:12AM -0400, Jeff King wrote:

> As it is now, I don't think so. Tweaking core.clockskew is slow, as
> shown above. And it's not something people should have to do manually. I
> have a version, which I'll post in a minute, which actually fills in a
> notes tree with the sha1 of commits with bogus timestamps. And then
> that tree can be consulted accurately and automatically.

Here's that patch. I think I did this after our discussion at
GitTogether 2010, and haven't looked at it since. So beware.

Clemens mentioned elsewhere that my skew-detection programs only find
skew in one direction. And I think that may be a problem here. We are
basically finding commits which have a timestamp in the past from their
most recent parent. So that means in a history like this:

  A---B---C--D--E--F

  timestamp(A) = 1
  timestamp(B) = 2
  timestamp(C) = 1
  timestamp(D) = 4
  timestamp(E) = 5

We will see that commit C is bogus, since it is in the past from its
parent. But if a commit skews to the future:

  timestamp(A) = 1
  timestamp(B) = 2
  timestamp(C) = 6
  timestamp(D) = 4
  timestamp(E) = 5

then everything _after_ it will look bogus (D and E, in this case).

>From what we've seen, it seems like skewing into the past is more
common. It seems to come from importing old commits and using their
timestamps as the commit timestamps. It would be nice to find a more
accurate set (I _think_ with future skew like the second example above,
the patch below will not give wrong answers; it will just be overly
pessimal and traverse more commits than it needs to).

---
 .gitignore     |    1 +
 Makefile       |    1 +
 builtin.h      |    1 +
 builtin/skew.c |   56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 commit.c       |   20 +++++++++++++++++---
 git.c          |    1 +
 6 files changed, 77 insertions(+), 3 deletions(-)
 create mode 100644 builtin/skew.c

diff --git a/.gitignore b/.gitignore
index acffdfa..503ef8b 100644
--- a/.gitignore
+++ b/.gitignore
@@ -137,6 +137,7 @@
 /git-show-branch
 /git-show-index
 /git-show-ref
+/git-skew
 /git-stage
 /git-stash
 /git-status
diff --git a/Makefile b/Makefile
index f8c72e1..f6ecc27 100644
--- a/Makefile
+++ b/Makefile
@@ -765,6 +765,7 @@ BUILTIN_OBJS += builtin/send-pack.o
 BUILTIN_OBJS += builtin/shortlog.o
 BUILTIN_OBJS += builtin/show-branch.o
 BUILTIN_OBJS += builtin/show-ref.o
+BUILTIN_OBJS += builtin/skew.o
 BUILTIN_OBJS += builtin/stripspace.o
 BUILTIN_OBJS += builtin/symbolic-ref.o
 BUILTIN_OBJS += builtin/tag.o
diff --git a/builtin.h b/builtin.h
index 0e9da90..0be47ca 100644
--- a/builtin.h
+++ b/builtin.h
@@ -143,5 +143,6 @@ extern int cmd_verify_pack(int argc, const char **argv, const char *prefix);
 extern int cmd_show_ref(int argc, const char **argv, const char *prefix);
 extern int cmd_pack_refs(int argc, const char **argv, const char *prefix);
 extern int cmd_replace(int argc, const char **argv, const char *prefix);
+extern int cmd_skew(int argc, const char **argv, const char *prefix);
 
 #endif
diff --git a/builtin/skew.c b/builtin/skew.c
new file mode 100644
index 0000000..796b02a
--- /dev/null
+++ b/builtin/skew.c
@@ -0,0 +1,56 @@
+#include "cache.h"
+#include "commit.h"
+#include "diff.h"
+#include "revision.h"
+#include "notes-cache.h"
+
+struct notes_cache bogus_timestamps;
+
+static unsigned long check_skew(struct commit *c)
+{
+	struct commit_list *p;
+	unsigned long most_recent;
+
+	if (c->util)
+		return (unsigned long)c->util;
+
+	if (parse_commit(c) < 0)
+		die("unable to parse commit: %s", sha1_to_hex(c->object.sha1));
+
+	most_recent = 0;
+	for (p = c->parents; p; p = p->next) {
+		unsigned long timestamp = check_skew(p->item);
+		if (timestamp > most_recent)
+			most_recent = timestamp;
+	}
+
+	if (c->date + 86400 < most_recent)
+		notes_cache_put(&bogus_timestamps, c->object.sha1, "", 0);
+	else
+		most_recent = c->date;
+
+	c->util = (void *)most_recent;
+	return most_recent;
+}
+
+int cmd_skew(int argc, const char **argv, const char *prefix) {
+	struct rev_info revs;
+	int i;
+
+	git_config(git_default_config, NULL);
+	init_revisions(&revs, prefix);
+	argc = setup_revisions(argc, argv, &revs, NULL);
+
+	notes_cache_init(&bogus_timestamps, "traversal-cutoff-ignore", "v1");
+
+	for (i = 0; i < revs.pending.nr; i++) {
+		struct object *o = revs.pending.objects[i].item;
+		if (o->type == OBJ_COMMIT)
+			check_skew((struct commit *)o);
+	}
+
+	if (notes_cache_write(&bogus_timestamps) < 0)
+		die_errno("unable to write notes tree");
+
+	return 0;
+}
diff --git a/commit.c b/commit.c
index 6647609..1bec48b 100644
--- a/commit.c
+++ b/commit.c
@@ -5,7 +5,7 @@
 #include "utf8.h"
 #include "diff.h"
 #include "revision.h"
-#include "notes.h"
+#include "notes-cache.h"
 
 int core_clock_skew = 86400;
 int save_commit_buffer = 1;
@@ -888,6 +888,19 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
 	return 0;
 }
 
+static int has_bogus_timestamp(unsigned char sha1[20])
+{
+	static int initialized;
+	static struct notes_cache bogus;
+
+	if (!initialized) {
+		notes_cache_init(&bogus, "traversal-cutoff-ignore", "v1");
+		initialized = 1;
+	}
+
+	return get_note(&bogus.tree, sha1) != NULL;
+}
+
 static int contains_recurse(struct commit *candidate,
 			    const struct commit_list *want,
 			    unsigned long cutoff)
@@ -908,8 +921,9 @@ static int contains_recurse(struct commit *candidate,
 		return 0;
 
 	/* stop searching if we go too far back in time */
-	if (candidate->date < cutoff)
-		return 0;
+	if (candidate->date < cutoff &&
+	    !has_bogus_timestamp(candidate->object.sha1))
+			return 0;
 
 	/* Otherwise recurse and mark ourselves for future traversals. */
 	for (p = candidate->parents; p; p = p->next) {
diff --git a/git.c b/git.c
index 8828c18..7adeba7 100644
--- a/git.c
+++ b/git.c
@@ -408,6 +408,7 @@ static void handle_internal_command(int argc, const char **argv)
 		{ "show", cmd_show, RUN_SETUP },
 		{ "show-branch", cmd_show_branch, RUN_SETUP },
 		{ "show-ref", cmd_show_ref, RUN_SETUP },
+		{ "skew", cmd_skew, RUN_SETUP },
 		{ "stage", cmd_add, RUN_SETUP | NEED_WORK_TREE },
 		{ "status", cmd_status, RUN_SETUP | NEED_WORK_TREE },
 		{ "stripspace", cmd_stripspace },
-- 
1.7.6.20.g45f3f.dirty

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Speed up git tag --contains
  2011-07-06  6:40 ` [PATCH 0/4] Speed up git tag --contains Jeff King
  2011-07-06  6:54   ` Jeff King
@ 2011-07-06  6:56   ` Jonathan Nieder
  2011-07-06  7:03     ` Jeff King
  2018-01-12 18:56   ` [PATCH 0/4] Speed up git tag --contains csilvers
  2 siblings, 1 reply; 28+ messages in thread
From: Jonathan Nieder @ 2011-07-06  6:56 UTC (permalink / raw)
  To: Jeff King
  Cc: Ævar Arnfjörð Bjarmason, Clemens Buchacher, git,
	Junio C Hamano

Jeff King wrote:

> The problem is that existing objects don't have this generation number.
> It's easy to calculate, though, and we could in theory use a notes-cache
> to store it externally. Obviously the complexity and performance aren't
> going to be as good as if it were just in the commit object, but we're
> sadly 6 years too late to make that decision.

I am still digesting the rest of what you wrote, but wouldn't this be
easy to do today?  One could just use a notes-cache while prototyping
and if it seems to work well, introduce new loose and packed object
formats that include a field for the cached generation number.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Speed up git tag --contains
  2011-07-06  6:56   ` Jonathan Nieder
@ 2011-07-06  7:03     ` Jeff King
  2011-07-06 14:26       ` generation numbers (was: [PATCH 0/4] Speed up git tag --contains) Jakub Narebski
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff King @ 2011-07-06  7:03 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Ævar Arnfjörð Bjarmason, Clemens Buchacher, git,
	Junio C Hamano

On Wed, Jul 06, 2011 at 01:56:23AM -0500, Jonathan Nieder wrote:

> Jeff King wrote:
> 
> > The problem is that existing objects don't have this generation number.
> > It's easy to calculate, though, and we could in theory use a notes-cache
> > to store it externally. Obviously the complexity and performance aren't
> > going to be as good as if it were just in the commit object, but we're
> > sadly 6 years too late to make that decision.
> 
> I am still digesting the rest of what you wrote, but wouldn't this be
> easy to do today?  One could just use a notes-cache while prototyping
> and if it seems to work well, introduce new loose and packed object
> formats that include a field for the cached generation number.

Yes, that's exactly how to do it. I'm just not sure "introduce new loose
and packed object formats" is "easy to do". Though I'm not sure we need
new formats. It is really just a new header in the commit object. And if
we write the code carefully, we should be able to transparently use
newly-generated objects with the field, and fall back to a notes-cache
(with autogeneration) when it isn't there.

Existing git will ignore the new generation field. It does mean that old
and new git will generate different sha1s for the exact same commit. I
don't know how big a deal this is in practice. It matters a lot more for
blobs and trees. But for commits, even if you are replaying a commit,
you should be updating the commit timestamp, which is going to give a
new sha1.

The other thing I worry about is performance. You are building a full
notes tree and looking up every commit in the traversal. I don't know
how bad that will be (though from my other back-of-the-envelope tests,
it may not actually be that bad; notes were designed to be fast for
exactly this case).

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: generation numbers (was: [PATCH 0/4] Speed up git tag --contains)
  2011-07-06  7:03     ` Jeff King
@ 2011-07-06 14:26       ` Jakub Narebski
  2011-07-06 15:01         ` Ted Ts'o
  0 siblings, 1 reply; 28+ messages in thread
From: Jakub Narebski @ 2011-07-06 14:26 UTC (permalink / raw)
  To: Jeff King
  Cc: Jonathan Nieder, Ævar Arnfjörð Bjarmason,
	Clemens Buchacher, git, Junio C Hamano, Jakub Narebski

Jeff King <peff@peff.net> writes:
> On Wed, Jul 06, 2011 at 01:56:23AM -0500, Jonathan Nieder wrote:
> > Jeff King wrote:
> > 
> > > The problem is that existing objects don't have this generation number.
> > > It's easy to calculate, though, and we could in theory use a notes-cache
> > > to store it externally. Obviously the complexity and performance aren't
> > > going to be as good as if it were just in the commit object, but we're
> > > sadly 6 years too late to make that decision.
> > 
> > I am still digesting the rest of what you wrote, but wouldn't this be
> > easy to do today?  One could just use a notes-cache while prototyping
> > and if it seems to work well, introduce new loose and packed object
> > formats that include a field for the cached generation number.
> 
> Yes, that's exactly how to do it. I'm just not sure "introduce new loose
> and packed object formats" is "easy to do". Though I'm not sure we need
> new formats. It is really just a new header in the commit object. And if
> we write the code carefully, we should be able to transparently use
> newly-generated objects with the field, and fall back to a notes-cache
> (with autogeneration) when it isn't there.

I understand that you would do autogeneration at least when you create
a commit, and at least one of parents does not have generation number.

You can also autogenerate notes-cache when following commits, and
encountering commit object without generation number.  

Or make "git gc" autogenerate cache-notes for generation number,
perhaps with an option (i.e. probably not for "git gc --auto").
 
> Existing git will ignore the new generation field. It does mean that old
> and new git will generate different sha1s for the exact same commit. I
> don't know how big a deal this is in practice. It matters a lot more for
> blobs and trees. But for commits, even if you are replaying a commit,
> you should be updating the commit timestamp, which is going to give a
> new sha1.
> 
> The other thing I worry about is performance. You are building a full
> notes tree and looking up every commit in the traversal. I don't know
> how bad that will be (though from my other back-of-the-envelope tests,
> it may not actually be that bad; notes were designed to be fast for
> exactly this case).

Well, one thing that it would test our notes infrastructure...

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: generation numbers (was: [PATCH 0/4] Speed up git tag --contains)
  2011-07-06 14:26       ` generation numbers (was: [PATCH 0/4] Speed up git tag --contains) Jakub Narebski
@ 2011-07-06 15:01         ` Ted Ts'o
  2011-07-06 18:12           ` Jeff King
  0 siblings, 1 reply; 28+ messages in thread
From: Ted Ts'o @ 2011-07-06 15:01 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Jeff King, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher, git,
	Junio C Hamano

Is it worth it to try to replicate this information across repositories?

Why not just simply have a cache file in the git directory which is
managed somewhat like gitk.cache; call it generation.cache?

						- Ted

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: generation numbers (was: [PATCH 0/4] Speed up git tag --contains)
  2011-07-06 15:01         ` Ted Ts'o
@ 2011-07-06 18:12           ` Jeff King
  2011-07-06 18:46             ` Jakub Narebski
  2011-07-06 23:22             ` Junio C Hamano
  0 siblings, 2 replies; 28+ messages in thread
From: Jeff King @ 2011-07-06 18:12 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Jakub Narebski, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher, git,
	Junio C Hamano

On Wed, Jul 06, 2011 at 11:01:03AM -0400, Ted Ts'o wrote:

> Is it worth it to try to replicate this information across repositories?

Probably not. I suggested notes-cache just because the amount of code is
very trivial.

One problem with notes storage is that it's not well optimized for tiny
pieces of data like this (e.g., the generation number should fit in a
32-bit unsigned int, as its max is the size of the longest single path
in the history graph). But notes are much more general; we will actually
map each commit to a blob object containing the generation number, which
is pretty wasteful.

> Why not just simply have a cache file in the git directory which is
> managed somewhat like gitk.cache; call it generation.cache?

Yeah, that would be fine. With a sorted list of binary sha1s and 32-bit
generation numbers, you're talking about 24 bytes per commit. Or a 6
megabyte cache for linux-2.6.

You'd probably want to be a little clever with updates. If I have
calculated the generation number of every commit, and then do "git
commit; git tag --contains HEAD", you probably don't want to rewrite the
entire cache. You could probably journal a fixed number of entries in an
unsorted file (or even in a parallel directory structure to loose
objects), and then periodically write out the whole sorted list when the
journal gets too big. Or choose a more clever data structure that can do
in-place updates.

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: generation numbers (was: [PATCH 0/4] Speed up git tag --contains)
  2011-07-06 18:12           ` Jeff King
@ 2011-07-06 18:46             ` Jakub Narebski
  2011-07-07 18:59               ` Jeff King
  2011-07-06 23:22             ` Junio C Hamano
  1 sibling, 1 reply; 28+ messages in thread
From: Jakub Narebski @ 2011-07-06 18:46 UTC (permalink / raw)
  To: Jeff King
  Cc: Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher, git,
	Junio C Hamano

On Wed, 6 Jul 2011, Jeff King wrote:
> On Wed, Jul 06, 2011 at 11:01:03AM -0400, Ted Ts'o wrote:
> 
> > Is it worth it to try to replicate this information across repositories?
> 
> Probably not. I suggested notes-cache just because the amount of code is
> very trivial.

Well, generation numbers are universal and would help everybody.  For
new commits with 'generation' header those would be always replicated,
for old commits with 'generation' notes / notes-cache the can be
replicated.
 
> One problem with notes storage is that it's not well optimized for tiny
> pieces of data like this (e.g., the generation number should fit in a
> 32-bit unsigned int, as its max is the size of the longest single path
> in the history graph). But notes are much more general; we will actually
> map each commit to a blob object containing the generation number, which
> is pretty wasteful.

Wasn't textconv-cache using commit-less notes?  The same can be done
for generation notes-cache.  Though it is still wasteful...  By the
way, would we be using text representation (like in 'generation'
commit header) or 32-bit integer binary representation in some
ordering, or variable-length integer (I think git uses them somewhere)?

Nb. I wonder if 32-bit unsigned int would always be enough, for example
Linux kernel + history.

> > Why not just simply have a cache file in the git directory which is
> > managed somewhat like gitk.cache; call it generation.cache?
> 
> Yeah, that would be fine. With a sorted list of binary sha1s and 32-bit
> generation numbers, you're talking about 24 bytes per commit. Or a 6
> megabyte cache for linux-2.6.
> 
> You'd probably want to be a little clever with updates. If I have
> calculated the generation number of every commit, and then do "git
> commit; git tag --contains HEAD", you probably don't want to rewrite the
> entire cache. You could probably journal a fixed number of entries in an
> unsorted file (or even in a parallel directory structure to loose
> objects), and then periodically write out the whole sorted list when the
> journal gets too big. Or choose a more clever data structure that can do
> in-place updates.

And that is the difference between gitk.cache (generated _once_ when starting
gitk, and regenerated on request), and idea of generation.cache

I think it would be simpler to use generation header + generation notes.
Or start with generation notes only.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Speed up git tag --contains
  2011-07-06  6:54   ` Jeff King
@ 2011-07-06 19:06     ` Clemens Buchacher
  0 siblings, 0 replies; 28+ messages in thread
From: Clemens Buchacher @ 2011-07-06 19:06 UTC (permalink / raw)
  To: Jeff King
  Cc: Ævar Arnfjörð Bjarmason, Jonathan Nieder, git,
	Junio C Hamano, Thomas Rast

On Wed, Jul 06, 2011 at 02:54:52AM -0400, Jeff King wrote:
>
> From what we've seen, it seems like skewing into the past is more
> common. It seems to come from importing old commits and using their
> timestamps as the commit timestamps. It would be nice to find a more
> accurate set (I _think_ with future skew like the second example above,
> the patch below will not give wrong answers; it will just be overly
> pessimal and traverse more commits than it needs to).

Yes, and that was indeed my only concern. Since we cannot tell with
certainty if we have skew into the past or into the future, it's
not wrong to always assume skew into the past. It just does not
always produce the shortest run of skewed commits, as you said. And
if skews into the future are rare, then that should not be an
issue.

But considering the complexity behind the timestamp based approach,
which you have demonstrated in your analysis, the generation number
concept looks very attractive to me.

It even has potential for the push/pull transport protocol.
(Unreliable) commit timestamps are currently used while searching
for common commits. And there is still the problem of searching
down the wrong branch, which can be especially bad for repos with
multiple disjoint histories. For example, we shouldn't send any
HAVEs for commits with generation numbers greater than the
generation number of the wanted ref. Or smaller than half that (in
which case downloading the complete pack would probably be faster).

Thomas, IIRC you were working on this. Do you think this could
help?

Clemens

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: generation numbers
  2011-07-06 18:12           ` Jeff King
  2011-07-06 18:46             ` Jakub Narebski
@ 2011-07-06 23:22             ` Junio C Hamano
  2011-07-07 19:08               ` Jeff King
  1 sibling, 1 reply; 28+ messages in thread
From: Junio C Hamano @ 2011-07-06 23:22 UTC (permalink / raw)
  To: Jeff King
  Cc: Ted Ts'o, Jakub Narebski, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher, git

Jeff King <peff@peff.net> writes:

> Yeah, that would be fine. With a sorted list of binary sha1s and 32-bit
> generation numbers, you're talking about 24 bytes per commit. Or a 6
> megabyte cache for linux-2.6.
>
> You'd probably want to be a little clever with updates. If I have
> calculated the generation number of every commit, and then do "git
> commit; git tag --contains HEAD", you probably don't want to rewrite the
> entire cache. You could probably journal a fixed number of entries in an
> unsorted file (or even in a parallel directory structure to loose
> objects), and then periodically write out the whole sorted list when the
> journal gets too big. Or choose a more clever data structure that can do
> in-place updates.

As to the low level implementation detail I agree everything you said, but
I have been wondering how the generation number should intereact with
grafts and replaces. It certainly would be safest whenever you change
grafts (which should be a rare event anyway).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: generation numbers (was: [PATCH 0/4] Speed up git tag --contains)
  2011-07-06 18:46             ` Jakub Narebski
@ 2011-07-07 18:59               ` Jeff King
  2011-07-07 19:34                 ` generation numbers Junio C Hamano
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff King @ 2011-07-07 18:59 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher, git,
	Junio C Hamano

On Wed, Jul 06, 2011 at 08:46:42PM +0200, Jakub Narebski wrote:

> > > Is it worth it to try to replicate this information across repositories?
> > 
> > Probably not. I suggested notes-cache just because the amount of code is
> > very trivial.
> 
> Well, generation numbers are universal and would help everybody.  For
> new commits with 'generation' header those would be always replicated,
> for old commits with 'generation' notes / notes-cache the can be
> replicated.

Sure. But it's not worth trying to transfer them between repositories,
when it only takes a few seconds to generate them locally (approximately
the same amount of time that "git rev-list --all >/dev/null" takes).

> > One problem with notes storage is that it's not well optimized for tiny
> > pieces of data like this (e.g., the generation number should fit in a
> > 32-bit unsigned int, as its max is the size of the longest single path
> > in the history graph). But notes are much more general; we will actually
> > map each commit to a blob object containing the generation number, which
> > is pretty wasteful.
> 
> Wasn't textconv-cache using commit-less notes?  The same can be done
> for generation notes-cache.

No, textconv actually uses parentless commits. But the issue I'm talking
about is not the history storage. It's the value storage. Instead of
pointing to a 32-bit int, we point to a 160-bit sha1 that references an
object containing the value (and the object header is going to be as big
as the value itself). So it's wasteful in storage, and it's wasteful in
the amount of work to do a lookup.

You could "cheat" and instead of storing the sha1 of a blob object in
the notes tree, use the lower 32 bits to store an actual value. I don't
think that currently breaks any assumptions in the notes code, but it
definitely is against the intent of it.

I wrote a patch to calculate and cache commit generation numbers on the
fly, and output them via the "%G" format placeholder. So we can get some
timings (these are from git.git):

  # baseline to compare against; print the commiter timestamp of every
  # commit, which is about how expensive it would be to parse and print
  # an embedded generation number
  $ time git log --format=%ct >/dev/null
  real    0m0.388s
  user    0m0.380s
  sys     0m0.004s

  # and the baseline amount of storage used (fully packed)
  $ du -s .git/objects
  47072   .git/objects

  # now the cost to generate the whole cache; slower, obviously, but it
  # only needs to happen once
  $ time git log --format=%G >/dev/null
  real    0m2.180s
  user    0m1.204s
  sys     0m0.960s

  # at which point everything is loose, and we are wasting tons of space
  $ du -s .git/objects
  171692  .git/objects

  # and traversing with generation lookup is still a bit slower than
  # without it
  $ time git log --format=%G >/dev/null
  real    0m0.822s
  user    0m0.544s
  sys     0m0.272s

  # but repacking helps with the space; now we're using only ~3M
  $ git gc
  $ du -s .git/objects
  50236   .git/objects

  # and traversal is faster, but still about 33% slower than our
  # baseline
  real    0m0.490s
  user    0m0.468s
  sys     0m0.020s

So I suspect we could do better with a data structure optimized for this
type of storage.

> Though it is still wasteful...  By the way, would we be using text
> representation (like in 'generation' commit header) or 32-bit integer
> binary representation in some ordering, or variable-length integer (I
> think git uses them somewhere)?

For a local lookup cache, I would use a fixed-size binary integer just
to keep the lookup data structure simple (then you know the width of
each record ahead of time). For a generation commit header, obviously we
would go with the ascii representation as we do for other headers.

> Nb. I wonder if 32-bit unsigned int would always be enough, for example
> Linux kernel + history.

Yeah, it should be plenty. There are only 250K commits in linux-2.6, and
the absolute worst case generation number is 250K (if they are all
linear). But the history is an n-ary tree, and the highest generation is
actually the height of the tree. So the more branchy the history, the
smaller the height.

The generation of v3.0-rc6, representing 6 years of history, is 79044.
At that rate, the linux-2.6 repository will overflow in a mere 320,000
years.

> I think it would be simpler to use generation header + generation notes.
> Or start with generation notes only.

The patch implementing generation notes is below. The implementation is
quite simple, but it would be nice if it were faster.

---
 commit.c |   81 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 commit.h |    2 +
 pretty.c |    3 ++
 3 files changed, 86 insertions(+), 0 deletions(-)

diff --git a/commit.c b/commit.c
index ac337c7..493517c 100644
--- a/commit.c
+++ b/commit.c
@@ -6,6 +6,7 @@
 #include "diff.h"
 #include "revision.h"
 #include "notes.h"
+#include "notes-cache.h"
 
 int save_commit_buffer = 1;
 
@@ -878,3 +879,83 @@ int commit_tree(const char *msg, unsigned char *tree,
 	strbuf_release(&buffer);
 	return result;
 }
+
+static struct notes_cache generations;
+
+static int generation_from_cache(struct commit *c, unsigned long *g)
+{
+	char *buf, *end;
+	size_t len;
+
+	buf = notes_cache_get(&generations, c->object.sha1, &len);
+	if (!buf)
+		return -1;
+
+	errno = 0;
+	*g = strtoul(buf, &end, 10);
+	if (errno == ERANGE || *end != '\0') {
+		free(buf);
+		return -1;
+	}
+
+	free(buf);
+	return 0;
+}
+
+static void generation_to_cache(struct commit *c, unsigned long g)
+{
+	char buf[64];
+	int len;
+
+	len = snprintf(buf, sizeof(buf), "%lu", g);
+	notes_cache_put(&generations, c->object.sha1, buf, len);
+}
+
+static unsigned long commit_generation_recurse(struct commit *c)
+{
+	struct commit_list *p;
+	unsigned long r;
+
+	if (!generation_from_cache(c, &r))
+		return r;
+
+	if (parse_commit(c) < 0)
+		die("unable to parse commit: %s", sha1_to_hex(c->object.sha1));
+
+	if (!c->parents)
+		return 0;
+
+	r = 0;
+	for (p = c->parents; p; p = p->next) {
+		unsigned long pgen = commit_generation_recurse(p->item);
+		if (pgen > r)
+			r = pgen;
+	}
+	r++;
+
+	generation_to_cache(c, r);
+	return r;
+}
+
+int installed_generation_writer;
+static void write_generation_cache(void)
+{
+	notes_cache_write(&generations);
+}
+
+unsigned long commit_generation(const struct commit *commit)
+{
+	unsigned long r;
+
+	if (!generations.tree.initialized)
+		notes_cache_init(&generations, "generations", "v1");
+
+	r = commit_generation_recurse((struct commit *)commit);
+
+	if (!installed_generation_writer) {
+		atexit(write_generation_cache);
+		installed_generation_writer = 1;
+	}
+
+	return r;
+}
diff --git a/commit.h b/commit.h
index a2d571b..bff6b36 100644
--- a/commit.h
+++ b/commit.h
@@ -176,4 +176,6 @@ extern int commit_tree(const char *msg, unsigned char *tree,
 		struct commit_list *parents, unsigned char *ret,
 		const char *author);
 
+unsigned long commit_generation(const struct commit *commit);
+
 #endif /* COMMIT_H */
diff --git a/pretty.c b/pretty.c
index f45eb54..8f1b321 100644
--- a/pretty.c
+++ b/pretty.c
@@ -965,6 +965,9 @@ static size_t format_commit_one(struct strbuf *sb, const char *placeholder,
 			return 2;
 		}
 		return 0;	/* unknown %g placeholder */
+	case 'G':
+		strbuf_addf(sb, "%lu", commit_generation(commit));
+		return 1;
 	case 'N':
 		if (c->pretty_ctx->show_notes) {
 			format_display_notes(commit->object.sha1, sb,
-- 
1.7.6.7.ge7132.dirty

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: generation numbers
  2011-07-06 23:22             ` Junio C Hamano
@ 2011-07-07 19:08               ` Jeff King
  2011-07-07 20:10                 ` Jakub Narebski
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff King @ 2011-07-07 19:08 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ted Ts'o, Jakub Narebski, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher, git

On Wed, Jul 06, 2011 at 04:22:23PM -0700, Junio C Hamano wrote:

> As to the low level implementation detail I agree everything you said, but
> I have been wondering how the generation number should intereact with
> grafts and replaces. It certainly would be safest whenever you change
> grafts (which should be a rare event anyway).

Ugh. I hadn't even considered grafting. Yeah, grafting or replacing
could make the generation numbers totally wrong. And not just for the
replaced commit, but for everything that builds on top. That's perhaps
an argument against putting them into the commit header at all; once you
graft, everything after will have bogus generation numbers.

So yeah, you would want to clear the cache any time you tweak
replacements or grafts (which I think is what you were saying in your
final sentence).

You could do a hybrid solution, in which you have generation numbers in
the commit header, and an external cache. You need the cache anyway to
support older commits without the header. And then you could use the
built-in generation numbers when there's no grafting or replacing going
on, and the cache otherwise. That keeps the common case (no grafts)
faster.

Still, if we can get the external lookup to be faster than my initial
notes attempt (which really should not be that hard), the performance
difference may not end up that big, and it won't even be worth putting
them into the header at all.

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: generation numbers
  2011-07-07 18:59               ` Jeff King
@ 2011-07-07 19:34                 ` Junio C Hamano
  2011-07-07 20:31                   ` Jakub Narebski
  2011-07-08 22:57                   ` Jeff King
  0 siblings, 2 replies; 28+ messages in thread
From: Junio C Hamano @ 2011-07-07 19:34 UTC (permalink / raw)
  To: Jeff King
  Cc: Jakub Narebski, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher, git,
	Junio C Hamano

Jeff King <peff@peff.net> writes:

> You could "cheat" and instead of storing the sha1 of a blob object in
> the notes tree, use the lower 32 bits to store an actual value. I don't
> think that currently breaks any assumptions in the notes code, but it
> definitely is against the intent of it.

I highly suspect that it would break fsck rather badly.  You may not even
be able to repack a repository with such a notes tree.

> For a local lookup cache, I would use a fixed-size binary integer just
> to keep the lookup data structure simple (then you know the width of
> each record ahead of time). For a generation commit header, obviously we
> would go with the ascii representation as we do for other headers.

Yes.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: generation numbers
  2011-07-07 19:08               ` Jeff King
@ 2011-07-07 20:10                 ` Jakub Narebski
  0 siblings, 0 replies; 28+ messages in thread
From: Jakub Narebski @ 2011-07-07 20:10 UTC (permalink / raw)
  To: Jeff King
  Cc: Junio C Hamano, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher, git

On Thu, 7 Jul 2011, Jeff King wrote:
> On Wed, Jul 06, 2011 at 04:22:23PM -0700, Junio C Hamano wrote:
> 
> > As to the low level implementation detail I agree everything you said, but
> > I have been wondering how the generation number should intereact with
> > grafts and replaces. It certainly would be safest whenever you change
> > grafts (which should be a rare event anyway).
> 
> Ugh. I hadn't even considered grafting. Yeah, grafting or replacing
> could make the generation numbers totally wrong. And not just for the
> replaced commit, but for everything that builds on top. That's perhaps
> an argument against putting them into the commit header at all; once you
> graft, everything after will have bogus generation numbers.
> 
> So yeah, you would want to clear the cache any time you tweak
> replacements or grafts (which I think is what you were saying in your
> final sentence).
> 
> You could do a hybrid solution, in which you have generation numbers in
> the commit header, and an external cache. You need the cache anyway to
> support older commits without the header. And then you could use the
> built-in generation numbers when there's no grafting or replacing going
> on, and the cache otherwise. That keeps the common case (no grafts)
> faster.

Or we could enhance pack protocol (new capability) to send generation
notes cache as a separate stream perhaps.

Or make generation notes cache part of post-downloading work, after
(or while) generating pack index.

> Still, if we can get the external lookup to be faster than my initial
> notes attempt (which really should not be that hard), the performance
> difference may not end up that big, and it won't even be worth putting
> them into the header at all.

I wonder if we can reuse pack index code / format somewhat.

Or perhaps some kind of on-disk hash table; we need O(1) fast lookup,
and ability to update structure 'in place'.

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: generation numbers
  2011-07-07 19:34                 ` generation numbers Junio C Hamano
@ 2011-07-07 20:31                   ` Jakub Narebski
  2011-07-07 20:52                     ` A Large Angry SCM
  2011-07-08 22:57                   ` Jeff King
  1 sibling, 1 reply; 28+ messages in thread
From: Jakub Narebski @ 2011-07-07 20:31 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff King, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher, git

On Thu, 7 Jul 2011, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
> 
> > You could "cheat" and instead of storing the sha1 of a blob object in
> > the notes tree, use the lower 32 bits to store an actual value. I don't
> > think that currently breaks any assumptions in the notes code, but it
> > definitely is against the intent of it.
> 
> I highly suspect that it would break fsck rather badly.  You may not even
> be able to repack a repository with such a notes tree.

Well, we could (ab)use file mode to mark that what would be sha1 actually
stores fixed-width content of a file, like we do with submodules.

This technique is I think quite similar in idea to filesystems storing
contents of small files in file inode, isn't it?

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: generation numbers
  2011-07-07 20:31                   ` Jakub Narebski
@ 2011-07-07 20:52                     ` A Large Angry SCM
  2011-07-08  0:29                       ` Junio C Hamano
  0 siblings, 1 reply; 28+ messages in thread
From: A Large Angry SCM @ 2011-07-07 20:52 UTC (permalink / raw)
  To: Jakub Narebski
  Cc: Junio C Hamano, Jeff King, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher, git

On 07/07/2011 04:31 PM, Jakub Narebski wrote:
> On Thu, 7 Jul 2011, Junio C Hamano wrote:
>> Jeff King<peff@peff.net>  writes:
>>
>>> You could "cheat" and instead of storing the sha1 of a blob object in
>>> the notes tree, use the lower 32 bits to store an actual value. I don't
>>> think that currently breaks any assumptions in the notes code, but it
>>> definitely is against the intent of it.
>>
>> I highly suspect that it would break fsck rather badly.  You may not even
>> be able to repack a repository with such a notes tree.
>
> Well, we could (ab)use file mode to mark that what would be sha1 actually
> stores fixed-width content of a file, like we do with submodules.
>
> This technique is I think quite similar in idea to filesystems storing
> contents of small files in file inode, isn't it?
>

Are the benefits really worth all these hacks?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: generation numbers
  2011-07-07 20:52                     ` A Large Angry SCM
@ 2011-07-08  0:29                       ` Junio C Hamano
  0 siblings, 0 replies; 28+ messages in thread
From: Junio C Hamano @ 2011-07-08  0:29 UTC (permalink / raw)
  To: gitzilla
  Cc: Jakub Narebski, Junio C Hamano, Jeff King, Ted Ts'o,
	Jonathan Nieder, Ævar Arnfjörð Bjarmason,
	Clemens Buchacher, git

A Large Angry SCM <gitzilla@gmail.com> writes:

> On 07/07/2011 04:31 PM, Jakub Narebski wrote:
>> On Thu, 7 Jul 2011, Junio C Hamano wrote:
>>> Jeff King<peff@peff.net>  writes:
>>>
>>>> You could "cheat" and instead of storing the sha1 of a blob object in
>>>> the notes tree, use the lower 32 bits to store an actual value. I don't
>>>> think that currently breaks any assumptions in the notes code, but it
>>>> definitely is against the intent of it.
>>>
>>> I highly suspect that it would break fsck rather badly.  You may not even
>>> be able to repack a repository with such a notes tree.
>>
>> Well, we could (ab)use file mode to mark that what would be sha1 actually
>> stores fixed-width content of a file, like we do with submodules.
>>
>> This technique is I think quite similar in idea to filesystems storing
>> contents of small files in file inode, isn't it?
>
> Are the benefits really worth all these hacks?

Not at all. Don't take everything everybody says about low level
implementation too seriously. Most people do not know what they are
talking about ;-).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: generation numbers
  2011-07-07 19:34                 ` generation numbers Junio C Hamano
  2011-07-07 20:31                   ` Jakub Narebski
@ 2011-07-08 22:57                   ` Jeff King
  1 sibling, 0 replies; 28+ messages in thread
From: Jeff King @ 2011-07-08 22:57 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jakub Narebski, Ted Ts'o, Jonathan Nieder,
	Ævar Arnfjörð Bjarmason, Clemens Buchacher, git

On Thu, Jul 07, 2011 at 12:34:37PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > You could "cheat" and instead of storing the sha1 of a blob object in
> > the notes tree, use the lower 32 bits to store an actual value. I don't
> > think that currently breaks any assumptions in the notes code, but it
> > definitely is against the intent of it.
> 
> I highly suspect that it would break fsck rather badly.  You may not even
> be able to repack a repository with such a notes tree.

True. I think you would have to do the file-mode hack that Jakub
suggested. But that's getting pretty gross. If something isn't big
enough to be in a blob, and especially if we are just caching, it would
be nice to have some lighter-weight caching mechanism.

> > For a local lookup cache, I would use a fixed-size binary integer just
> > to keep the lookup data structure simple (then you know the width of
> > each record ahead of time). For a generation commit header, obviously we
> > would go with the ascii representation as we do for other headers.

So I implemented something like this today. In fact, it's a generic[1]
fast persistent object-data mapping for data of a fixed size. The
on-disk representation is a stream of pairs: binary sha1s followed by
their fixed-size data. Lookup is by binary search (using sha1_entry_pos,
which makes this more or less the same as pack-index lookups).

There's a separate in-memory lookaside table that receives updates.
These are stored as a hash[2] because except for the first run, this
will typically be much smaller than the disk version, and we care more
about insertion speed here. When git exits, the memory and disk versions
are merged into a new cache which atomically replaces the old version
via rename().

Here are the timings I came up with using it on top of my depth-first
contains algorithm.  All runs are for "git tag --contains HEAD~1000" in
the linux-2.6 repo. All times are best-of-five unless otherwise noted.

To get a baseline, I measured the algorithm with no cutoff at all (i.e.,
ffc4b80 in pu), and then with a cutoff based on timestamp with one day
of slop (i.e., de9f14e in pu):

  none:
    real    0m3.139s
    user    0m3.044s
    sys     0m0.092s

  timestamp:
    real    0m0.027s
    user    0m0.024s
    sys     0m0.000s

We can use the "timestamp" value as our goal; it's fast, but not
necessarily correct in the face of skew (and it's about as fast as we
would expect a generation header inside the commit to perform). We can
use "none" as a lower goalpost. It's correct, but slow. If we're slower
than it, then we have totally failed.

Then I tried doing a generation-based cutoff, caching the generations
via notes-cache. Here are those timings:

  notes (1st run):
    real    0m14.153s
    user    0m7.868s
    sys     0m5.392s

  notes (before repack):
    real    0m0.102s
    user    0m0.076s
    sys     0m0.024s

  notes (after repack):
    real    0m0.090s
    user    0m0.072s
    sys     0m0.016s

It's pretty painful to actually generate the cache, mostly because we
end up writing a ton of tree and blob objects. The objects directory
balloons from 503M to 1.1G after that run. Repacking brings that down to
a mere 524M, or 21M spent on the cache.  Not shown in these timings is
the painfully slow "git gc" it took to get there.

So there's a nice speedup over the no-cutoff case, but we're still 3
times as slow as the timestamp case. And the sheer amount of object
cruft (both in terms of wasted space, and wasted time writing and
repacking) is ugly.

Next up is the custom object-cache code:

  custom (1st run):
    real    0m3.769s
    user    0m3.404s
    sys     0m0.360s

  custom:
    real    0m0.035s
    user    0m0.028s
    sys     0m0.004s

You can see that the first run is a bit slower, as we have to touch
every commit to figure out the generations. But it also highlights how
much of the notes-cache version is spent not actually figuring out the
generations, but rather just writing the notes tree.

Subsequent runs are pretty darn fast. It's a tiny bit slower than using
the timestamps, but it's within the noise. The resulting cache file is
5.9M.

So it seems like a good direction to pursue. The only downside I see is
that we may be slower operating in a read-only repository in which
nobody has generated any cache yet. But that seems like a bit of a crazy
case, and even then, it's on par with the no-cutoff-at-all case, so it's
really not that bad. And it's guaranteed to be correct in the face of
skew, as opposed to the fast timestamp case.

-Peff

[1] I intentionally wrote the object caching code in a very generic,
    data-agnostic way. I have a patch series to speed up git-cherry by
    caching patch-ids of commits against their parents. It uses
    notes-cache and already provides some speedup, but I'd like to see
    if I can make it faster with the new code.

[2] Instead of writing my own hash, I hacked decorate.[ch] to have
    "value" semantics. I.e., you can now store values of arbitrary size.
    The existing semantics of storing a "void *" are easy to do on top
    of that. I noticed that fast-export is already encoding uint32_t's
    inside the pointers. This makes that a little more supported, and
    also means that the same hack will work for data larger than a void
    pointer (e.g., patch-id caching will need 20 bytes).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Speed up git tag --contains
  2011-07-06  6:40 ` [PATCH 0/4] Speed up git tag --contains Jeff King
  2011-07-06  6:54   ` Jeff King
  2011-07-06  6:56   ` Jonathan Nieder
@ 2018-01-12 18:56   ` csilvers
  2018-03-03  5:15     ` Jeff King
  2 siblings, 1 reply; 28+ messages in thread
From: csilvers @ 2018-01-12 18:56 UTC (permalink / raw)
  To: Jeff King; +Cc: avarab, jrnieder, drizzd, git, gitster

> This is a resubmission of Jeff King's patch series to speed up git tag
> --contains with some changes. It's been cooking for a while as:

Replying to this 6-year-old thread:

Is there any chance this could be resurrected?  We are using
phabricator, which uses `git branch --contains` as part of its
workflow.  Our repo has ~1000 branches on it, and the contains
operation is eating up all our CPU (and time).  It would be very
helpful to us to make this faster!

(The original thread is at
https://public-inbox.org/git/E1OU82h-0001xY-3b@closure.thunk.org/
)

craig

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Speed up git tag --contains
  2018-01-12 18:56   ` [PATCH 0/4] Speed up git tag --contains csilvers
@ 2018-03-03  5:15     ` Jeff King
  2018-03-08 23:05       ` csilvers
  2018-03-12 13:45       ` Derrick Stolee
  0 siblings, 2 replies; 28+ messages in thread
From: Jeff King @ 2018-03-03  5:15 UTC (permalink / raw)
  To: csilvers; +Cc: avarab, jrnieder, drizzd, git, gitster, Derrick Stolee

On Fri, Jan 12, 2018 at 10:56:00AM -0800, csilvers wrote:

> > This is a resubmission of Jeff King's patch series to speed up git tag
> > --contains with some changes. It's been cooking for a while as:
> 
> Replying to this 6-year-old thread:
> 
> Is there any chance this could be resurrected?  We are using
> phabricator, which uses `git branch --contains` as part of its
> workflow.  Our repo has ~1000 branches on it, and the contains
> operation is eating up all our CPU (and time).  It would be very
> helpful to us to make this faster!
> 
> (The original thread is at
> https://public-inbox.org/git/E1OU82h-0001xY-3b@closure.thunk.org/

Sorry, this got thrown on my "to respond" pile and languished.

There are actually three things that make "git branch --contains" slow.

First, if you're filtering 1000 branches, we'll run 1000 merge-base
traversals, which may walk over the same commits multiple times.

These days "tag --contains" uses a different algorithm that can look at
all heads in a single traversal. But the downside is that it's
depth-first, so it tends to walk down to the roots. That's generally OK
for tags, since you often have ancient tags that mean getting close to
the roots anyway.

But for branches, they're more likely to be recent, and you can get away
without going very deep into the history.

So it's a tradeoff. There's no run-time switch to flip between them, but
a patch like this:

diff --git a/builtin/branch.c b/builtin/branch.c
index 8dcc2ed058..4d674e86d5 100644
--- a/builtin/branch.c
+++ b/builtin/branch.c
@@ -404,6 +404,7 @@ static void print_ref_list(struct ref_filter *filter, struct ref_sorting *sortin
 
 	memset(&array, 0, sizeof(array));
 
+	filter->with_commit_tag_algo = 1;
 	filter_refs(&array, filter, filter->kind | FILTER_REFS_INCLUDE_BROKEN);
 
 	if (filter->verbose)

drops my run of "git branch -a --contains HEAD~100" from 8.6s to
0.4s on a repo with ~1800 branches. That sounds good, but on a repo with
a smaller number of branches, we may actually end up slower (because we
dig further down in history, and don't benefit from the multiple-branch
speedup).

I tried to do a "best of both" algorithm in:

 https://public-inbox.org/git/20140625233429.GA20457@sigill.intra.peff.net/

which finds arbitrary numbers of merge bases in a single traversal.  It
did seem to work, but I felt uneasy about some of the corner cases.
I've been meaning to revisit it, but obviously have never gotten around
to it.

The second slow thing is that during the traversal we load each commit
object from disk. The solution there is to keep the parent information
in a faster cache. I had a few proposals over the years, but I won't
even bother to dig them up, because there's quite recent and promising
work in this area from Derrick Stolee:

  https://public-inbox.org/git/1519698787-190494-1-git-send-email-dstolee@microsoft.com/

And finally, the thing that the patches you linked are referencing is
about using commit timestamps as a proxy for generation numbers. And
Stolee's patches actually leave room for real, trustable generation
numbers.

Once we have the serialized commit graph and generation numbers, think
the final step would just be to teach the "tag --contains" algorithm to
stop walking down unproductive lines of history. And in fact, I think we
can forget about the best-of-both multi-tip merge-base idea entirely.
Because if you can use the generation numbers to avoid going too deep,
then a depth-first approach is fine. And we'd just want to flip
git-branch over to using that algorithm by default.

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Speed up git tag --contains
  2018-03-03  5:15     ` Jeff King
@ 2018-03-08 23:05       ` csilvers
  2018-03-12 13:45       ` Derrick Stolee
  1 sibling, 0 replies; 28+ messages in thread
From: csilvers @ 2018-03-08 23:05 UTC (permalink / raw)
  To: Jeff King; +Cc: avarab, jrnieder, drizzd, git, gitster, stolee

} I had a few proposals over the years, but I won't even bother to dig
} them up, because there's quite recent and promising work in this
} area from Derrick Stolee:

It sounds like the best thing to do is to wait for this, then.

We managed to convert a bunch of our branches to tags, so our
immediate problem has been resolved.  But I'm sure it will come up
again as more branches are created...

carig

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Speed up git tag --contains
  2018-03-03  5:15     ` Jeff King
  2018-03-08 23:05       ` csilvers
@ 2018-03-12 13:45       ` Derrick Stolee
  2018-03-12 23:59         ` Jeff King
  1 sibling, 1 reply; 28+ messages in thread
From: Derrick Stolee @ 2018-03-12 13:45 UTC (permalink / raw)
  To: Jeff King, csilvers; +Cc: avarab, jrnieder, drizzd, git, gitster

On 3/3/2018 12:15 AM, Jeff King wrote:
> On Fri, Jan 12, 2018 at 10:56:00AM -0800, csilvers wrote:
>
>>> This is a resubmission of Jeff King's patch series to speed up git tag
>>> --contains with some changes. It's been cooking for a while as:
>> Replying to this 6-year-old thread:
>>
>> Is there any chance this could be resurrected?  We are using
>> phabricator, which uses `git branch --contains` as part of its
>> workflow.  Our repo has ~1000 branches on it, and the contains
>> operation is eating up all our CPU (and time).  It would be very
>> helpful to us to make this faster!
>>
>> (The original thread is at
>> https://public-inbox.org/git/E1OU82h-0001xY-3b@closure.thunk.org/
> Sorry, this got thrown on my "to respond" pile and languished.

Thanks for adding me to the thread. It's good to know the pain point 
people are having around commit graph walks.

> There are actually three things that make "git branch --contains" slow.
>
> First, if you're filtering 1000 branches, we'll run 1000 merge-base
> traversals, which may walk over the same commits multiple times.
>
> These days "tag --contains" uses a different algorithm that can look at
> all heads in a single traversal. But the downside is that it's
> depth-first, so it tends to walk down to the roots. That's generally OK
> for tags, since you often have ancient tags that mean getting close to
> the roots anyway.
>
> But for branches, they're more likely to be recent, and you can get away
> without going very deep into the history.
>
> So it's a tradeoff. There's no run-time switch to flip between them, but
> a patch like this:
>
> diff --git a/builtin/branch.c b/builtin/branch.c
> index 8dcc2ed058..4d674e86d5 100644
> --- a/builtin/branch.c
> +++ b/builtin/branch.c
> @@ -404,6 +404,7 @@ static void print_ref_list(struct ref_filter *filter, struct ref_sorting *sortin
>   
>   	memset(&array, 0, sizeof(array));
>   
> +	filter->with_commit_tag_algo = 1;
>   	filter_refs(&array, filter, filter->kind | FILTER_REFS_INCLUDE_BROKEN);
>   
>   	if (filter->verbose)
>
> drops my run of "git branch -a --contains HEAD~100" from 8.6s to
> 0.4s on a repo with ~1800 branches. That sounds good, but on a repo with
> a smaller number of branches, we may actually end up slower (because we
> dig further down in history, and don't benefit from the multiple-branch
> speedup).

It's good to know that we already have an algorithm for the multi-head 
approach. Things like `git branch -vv` are harder to tease out because 
the graph walk is called by the line-format code.

> I tried to do a "best of both" algorithm in:
>
>   https://public-inbox.org/git/20140625233429.GA20457@sigill.intra.peff.net/
>
> which finds arbitrary numbers of merge bases in a single traversal.  It
> did seem to work, but I felt uneasy about some of the corner cases.
> I've been meaning to revisit it, but obviously have never gotten around
> to it.
>
> The second slow thing is that during the traversal we load each commit
> object from disk. The solution there is to keep the parent information
> in a faster cache. I had a few proposals over the years, but I won't
> even bother to dig them up, because there's quite recent and promising
> work in this area from Derrick Stolee:
>
>    https://public-inbox.org/git/1519698787-190494-1-git-send-email-dstolee@microsoft.com/
>
> And finally, the thing that the patches you linked are referencing is
> about using commit timestamps as a proxy for generation numbers. And
> Stolee's patches actually leave room for real, trustable generation
> numbers.
>
> Once we have the serialized commit graph and generation numbers, think
> the final step would just be to teach the "tag --contains" algorithm to
> stop walking down unproductive lines of history. And in fact, I think we
> can forget about the best-of-both multi-tip merge-base idea entirely.
> Because if you can use the generation numbers to avoid going too deep,
> then a depth-first approach is fine. And we'd just want to flip
> git-branch over to using that algorithm by default.

I'll keep this in mind as a target for performance measurements in the 
serialized commit graph patch and the following generation number patch.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 0/4] Speed up git tag --contains
  2018-03-12 13:45       ` Derrick Stolee
@ 2018-03-12 23:59         ` Jeff King
  0 siblings, 0 replies; 28+ messages in thread
From: Jeff King @ 2018-03-12 23:59 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: csilvers, avarab, jrnieder, drizzd, git, gitster

On Mon, Mar 12, 2018 at 09:45:27AM -0400, Derrick Stolee wrote:

> > diff --git a/builtin/branch.c b/builtin/branch.c
> > index 8dcc2ed058..4d674e86d5 100644
> > --- a/builtin/branch.c
> > +++ b/builtin/branch.c
> > @@ -404,6 +404,7 @@ static void print_ref_list(struct ref_filter *filter, struct ref_sorting *sortin
> >   	memset(&array, 0, sizeof(array));
> > +	filter->with_commit_tag_algo = 1;
> >   	filter_refs(&array, filter, filter->kind | FILTER_REFS_INCLUDE_BROKEN);
> >   	if (filter->verbose)
> > 
> > drops my run of "git branch -a --contains HEAD~100" from 8.6s to
> > 0.4s on a repo with ~1800 branches. That sounds good, but on a repo with
> > a smaller number of branches, we may actually end up slower (because we
> > dig further down in history, and don't benefit from the multiple-branch
> > speedup).
> 
> It's good to know that we already have an algorithm for the multi-head
> approach. Things like `git branch -vv` are harder to tease out because the
> graph walk is called by the line-format code.

Yeah, the ahead/behind stuff will need some work. Part of it is just
code structuring. We know ahead of time which branches (and their
upstreams) are going to need this ahead/behind computation, so we should
be able to do collect them all for a single call.

But I'm not sure if a general multi-pair ahead/behind is going to be
easy. I don't have even experimental code for that. :)

We have a multi-pair ahead/behind command which we use at GitHub, but it
does each pair separately. It leans heavily on reachability bitmaps, so
the main advantage is that it's able to amortize the cost of loading the
bitmaps (both off disk, but also we sometimes have to walk to complete
the bitmaps).

-Peff

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, back to index

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-06-11 19:04 [PATCH 0/4] Speed up git tag --contains Ævar Arnfjörð Bjarmason
2011-06-11 19:04 ` [PATCH 1/4] tag: speed up --contains calculation Ævar Arnfjörð Bjarmason
2011-06-11 19:04 ` [PATCH 2/4] limit "contains" traversals based on commit timestamp Ævar Arnfjörð Bjarmason
2011-06-11 19:04 ` [PATCH 3/4] default core.clockskew variable to one day Ævar Arnfjörð Bjarmason
2011-06-11 19:04 ` [PATCH 4/4] Why is "git tag --contains" so slow? Ævar Arnfjörð Bjarmason
2011-07-06  6:40 ` [PATCH 0/4] Speed up git tag --contains Jeff King
2011-07-06  6:54   ` Jeff King
2011-07-06 19:06     ` Clemens Buchacher
2011-07-06  6:56   ` Jonathan Nieder
2011-07-06  7:03     ` Jeff King
2011-07-06 14:26       ` generation numbers (was: [PATCH 0/4] Speed up git tag --contains) Jakub Narebski
2011-07-06 15:01         ` Ted Ts'o
2011-07-06 18:12           ` Jeff King
2011-07-06 18:46             ` Jakub Narebski
2011-07-07 18:59               ` Jeff King
2011-07-07 19:34                 ` generation numbers Junio C Hamano
2011-07-07 20:31                   ` Jakub Narebski
2011-07-07 20:52                     ` A Large Angry SCM
2011-07-08  0:29                       ` Junio C Hamano
2011-07-08 22:57                   ` Jeff King
2011-07-06 23:22             ` Junio C Hamano
2011-07-07 19:08               ` Jeff King
2011-07-07 20:10                 ` Jakub Narebski
2018-01-12 18:56   ` [PATCH 0/4] Speed up git tag --contains csilvers
2018-03-03  5:15     ` Jeff King
2018-03-08 23:05       ` csilvers
2018-03-12 13:45       ` Derrick Stolee
2018-03-12 23:59         ` Jeff King

git@vger.kernel.org mailing list mirror (one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox