Why is "git tag --contains" so slow?

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* Why is "git tag --contains" so slow?
@ 2010-07-01  0:54 Theodore Ts'o
  2010-07-01  0:58 ` Shawn O. Pearce
  2010-07-01  1:00 ` Avery Pennarun
  0 siblings, 2 replies; 46+ messages in thread
From: Theodore Ts'o @ 2010-07-01  0:54 UTC (permalink / raw)
  To: git

I'm trying to write a script that can determine the first kernel release
(i.e., a tag of that matchs v2.6.*) that contains a particular commit.

I can do this using "git tag --contains <commit-id>", but it's quite
slow.  It takes something like 8-9 seconds.  (8.5 seconds of user time,
8.6 times of wall clock, to be precise).

I can get the information much faster using "gitk -1 <commit-id>", which
finishes painting the screen in under 2 seconds, but that throws up a
GUI and then a human has to pull the information out using their eyes.
(Yeah, or I could figure out where in the 11,631 lines of Tcl script the
"preceeds" line is calculated, but I figured I'd ask here first.)

Is there a better way of calculating what I want from the command line
using the built-in native git tools?  And if so, why is git tag
--contains apparently 4 times slower than gitk at performing this task?

Thanks,

						- Ted

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-01  0:54 Why is "git tag --contains" so slow? Theodore Ts'o
@ 2010-07-01  0:58 ` Shawn O. Pearce
  2010-07-03 23:27   ` Sam Vilain
  2010-07-01  1:00 ` Avery Pennarun
  1 sibling, 1 reply; 46+ messages in thread
From: Shawn O. Pearce @ 2010-07-01  0:58 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: git

Theodore Ts'o <tytso@mit.edu> wrote:
> 
> I'm trying to write a script that can determine the first kernel release
> (i.e., a tag of that matchs v2.6.*) that contains a particular commit.
> 
> I can do this using "git tag --contains <commit-id>", but it's quite
> slow.  It takes something like 8-9 seconds.  (8.5 seconds of user time,
> 8.6 times of wall clock, to be precise).
> 
> I can get the information much faster using "gitk -1 <commit-id>", which
> finishes painting the screen in under 2 seconds, but that throws up a
> GUI and then a human has to pull the information out using their eyes.
> (Yeah, or I could figure out where in the 11,631 lines of Tcl script the
> "preceeds" line is calculated, but I figured I'd ask here first.)
> 
> Is there a better way of calculating what I want from the command line
> using the built-in native git tools?  And if so, why is git tag
> --contains apparently 4 times slower than gitk at performing this task?

gitk keeps a cache of this stuff in .git/gitk.cache.  I'll bet its
pulling from cache here, which is why it snaps so fast.

Without the cache is expensive, which is what 'git tag --contains'
is doing.  The code walks back from each tag tracing along the commit
parent pointers, keeping track of the nearest tag name that can reach
any given commit.  When it finds your commit, it stops and outputs.

Since this stuff can't change unless the refs change, yes, it can be
cached easily.  But nobody has done caching for this in Git itself,
only in Tcl for gitk.  :-\

-- 
Shawn.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-01  0:54 Why is "git tag --contains" so slow? Theodore Ts'o
  2010-07-01  0:58 ` Shawn O. Pearce
@ 2010-07-01  1:00 ` Avery Pennarun
  2010-07-01 12:17   ` tytso
  1 sibling, 1 reply; 46+ messages in thread
From: Avery Pennarun @ 2010-07-01  1:00 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: git

On Wed, Jun 30, 2010 at 8:54 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> I'm trying to write a script that can determine the first kernel release
> (i.e., a tag of that matchs v2.6.*) that contains a particular commit.
>
> I can do this using "git tag --contains <commit-id>", but it's quite
> slow.  It takes something like 8-9 seconds.  (8.5 seconds of user time,
> 8.6 times of wall clock, to be precise).
>
> I can get the information much faster using "gitk -1 <commit-id>", which
> finishes painting the screen in under 2 seconds, but that throws up a
> GUI and then a human has to pull the information out using their eyes.

There's a big difference between the two: the gitk command you're
using only works if the given commit is *itself* named by a tag, while
'git tag --contains' needs to search the entire history of every tag
to see if the given commit is *inside* it somewhere.

If all you want is to see if a given commit exactly matches a tag,
perhaps you want something like

   git describe --exact-match <commit-id>

Have fun,

Avery

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-01  1:00 ` Avery Pennarun
@ 2010-07-01 12:17   ` tytso
  2010-07-01 15:03     ` Jeff King
  0 siblings, 1 reply; 46+ messages in thread
From: tytso @ 2010-07-01 12:17 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: git

On Wed, Jun 30, 2010 at 09:00:21PM -0400, Avery Pennarun wrote:
> 
> There's a big difference between the two: the gitk command you're
> using only works if the given commit is *itself* named by a tag, while
> 'git tag --contains' needs to search the entire history of every tag
> to see if the given commit is *inside* it somewhere.

Gitk provides both.  What I want is listed on the "Preceeds" line in
the lower right hand box (the sixth line from the top in the
per-commit box).

> If all you want is to see if a given commit exactly matches a tag,
> perhaps you want something like
> 
>    git describe --exact-match <commit-id>

Yeah, I'm not talking about the tag and branch names that show up in
the top gitk box (and which you can also get via git log --annotate).
I'm specifically talking about what you get with:

	git tags --contains <commit-id>

							- Ted

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-01 12:17   ` tytso
@ 2010-07-01 15:03     ` Jeff King
  2010-07-01 15:38       ` Jeff King
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2010-07-01 15:03 UTC (permalink / raw)
  To: tytso; +Cc: Avery Pennarun, git

On Thu, Jul 01, 2010 at 08:17:11AM -0400, tytso@mit.edu wrote:

> Yeah, I'm not talking about the tag and branch names that show up in
> the top gitk box (and which you can also get via git log --annotate).
> I'm specifically talking about what you get with:
> 
> 	git tags --contains <commit-id>

I think there is room for improving the speed of "git tag --contains".
Consider these two similar operations:

  $ time git name-rev HEAD~200
  HEAD~200 tags/v1.7.1-rc2~2

  real    0m0.060s
  user    0m0.048s
  sys     0m0.012s

  $ time git tag --contains HEAD~200
  v1.7.1
  v1.7.1-rc2
  v1.7.1.1
  v1.7.2-rc0
  v1.7.2-rc1

  real    0m2.179s
  user    0m2.156s
  sys     0m0.020s

Obviously the latter is producing the full list of tags, but both should
have to do a traversal starting at each tag ref.

It looks like "git tag --contains" is implemented as something like:

  for ref in `git tag -l`; do
    if `git merge-base $ref $commit | grep $commit`; then
      echo $ref
    fi
  done

except in C. But you can see we'll do a full merge-base calculation for
each tag. And that merge-base calculation may potentially go quite far
back in history.

The name-rev code has a simple heuristic that it stops traversing when
the timestamp of the traversed commit gets more than a day behind that
of $commit. This doesn't do well with clock skew, but can save a lot of
time in practice. We should perhaps consider something like that here.

I suspect we could do even better by sharing information between
traversals. That is, walk the graph from each ref, but retain marks on
commits between each walk. For a commit c, if walking down each parent
gets us to a root without hitting $commit, then we can mark "c" as not
possibly containing the commit. We can cut short any walk when we hit a
"c" that is already marked. So by traversing from "v1.7.1", if we find
that we do not contain $commit, we will have already marked v1.7.0,
v1.6.*, etc, and will not need to traverse them again.

When you do find one that contains $commit, you can also mark it as
"definitely contains", and you can stop a traversal that definitely
contains. So if we see something is in v1.7.0, then the traversal from
v1.7.1 only needs to go as far back as v1.7.0.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-01 15:03     ` Jeff King
@ 2010-07-01 15:38       ` Jeff King
  2010-07-02 19:26         ` tytso
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2010-07-01 15:38 UTC (permalink / raw)
  To: tytso; +Cc: Avery Pennarun, git

On Thu, Jul 01, 2010 at 11:03:31AM -0400, Jeff King wrote:

>   $ time git tag --contains HEAD~200
>   v1.7.1
>   v1.7.1-rc2
>   v1.7.1.1
>   v1.7.2-rc0
>   v1.7.2-rc1
> 
>   real    0m2.179s
>   user    0m2.156s
>   sys     0m0.020s
>
> [...]
>
> I suspect we could do even better by sharing information between
> traversals. That is, walk the graph from each ref, but retain marks on

Here is a quick and dirty patch to implement what I suggested. With it,
I get the same results as above, but it runs between 3 and 4 times as
fast:

  real    0m0.621s
  user    0m0.588s
  sys     0m0.032s

Implementing the timestamp-checking thing that name-rev does would
probably drop it even more (at the expense of less accuracy in the
face of clock skew).

diff --git a/builtin/tag.c b/builtin/tag.c
index d311491..d18c3ed 100644
--- a/builtin/tag.c
+++ b/builtin/tag.c
@@ -10,6 +10,8 @@
 #include "builtin.h"
 #include "refs.h"
 #include "tag.h"
+#include "diff.h"
+#include "revision.h"
 #include "run-command.h"
 #include "parse-options.h"
 
@@ -31,6 +33,42 @@ struct tag_filter {
 
 #define PGP_SIGNATURE "-----BEGIN PGP SIGNATURE-----"
 
+static int in_commit_list(const struct commit_list *want, struct commit *c)
+{
+	for (; want; want = want->next)
+		if (!hashcmp(want->item->object.sha1, c->object.sha1))
+			return 1;
+	return 0;
+}
+
+static int contains(const struct commit_list *want, struct commit *candidate)
+{
+	struct commit_list *p;
+
+	/* was it previously marked as containing a want commit? */
+	if (candidate->object.flags & TMP_MARK)
+		return 1;
+	/* or marked as not possibly containing a want commit? */
+	if (candidate->object.flags & UNINTERESTING)
+		return 0;
+	/* or are we it? */
+	if (in_commit_list(want, candidate))
+		return 1;
+
+	/* If not, then try parents, and be sure to mark ourselves
+	 * for future traversals. */
+	parse_commit(candidate);
+	for (p = candidate->parents; p; p = p->next) {
+		if (contains(want, p->item)) {
+			candidate->object.flags |= TMP_MARK;
+			return 1;
+		}
+	}
+
+	candidate->object.flags |= UNINTERESTING;
+	return 0;
+}
+
 static int show_reference(const char *refname, const unsigned char *sha1,
 			  int flag, void *cb_data)
 {
@@ -49,7 +87,7 @@ static int show_reference(const char *refname, const unsigned char *sha1,
 			commit = lookup_commit_reference_gently(sha1, 1);
 			if (!commit)
 				return 0;
-			if (!is_descendant_of(commit, filter->with_commit))
+			if (!contains(filter->with_commit, commit))
 				return 0;
 		}
 

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-01 15:38       ` Jeff King
@ 2010-07-02 19:26         ` tytso
  2010-07-03  8:06           ` Jeff King
  0 siblings, 1 reply; 46+ messages in thread
From: tytso @ 2010-07-02 19:26 UTC (permalink / raw)
  To: Jeff King; +Cc: Avery Pennarun, git

On Thu, Jul 01, 2010 at 11:38:42AM -0400, Jeff King wrote:
> 
> Here is a quick and dirty patch to implement what I suggested. With it,
> I get the same results as above, but it runs between 3 and 4 times as
> fast:
> 
>   real    0m0.621s
>   user    0m0.588s
>   sys     0m0.032s

I just tried your patch, and with a large number of tags (198 tags,
from v2.6.11 to v2.6.34 with all of the -rc releases of the linux
kernel), it is indeed faster: 8.5 seconds without the patch versus 2.3
seconds with the patch.

However, if I remove a large number of tags (since I know this is
something that was introduced since 2.6.33, so I made a shared clone
of the repository but then I removed all of the tags from 2.6.11
through 2.6.33, so there was only 19 tags in play), the time to
execute the git tag --contains became 1.3 seconds without the patch,
versus 2.9 seconds without the patch.

So with the oldest tags removed, your patch actually made things run
*slower* (2.3 vs 2.9 seconds, which was counter-intuitive to me), and
fastest way to speed things up was to restrict the tags that would be
searched.

Which gives me the technique I should use to solve my immediate
problem, but if the user knows that the patch was merged into mainline
within the last 4 months, maybe what would be better is either a way
to specify a regexp (or list) of tags that the user finds
"interesting" as --contains candidate, or a "--since=4 months"
argument.

	   	      	   	      	      - Ted

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-02 19:26         ` tytso
@ 2010-07-03  8:06           ` Jeff King
  2010-07-04  0:55             ` tytso
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2010-07-03  8:06 UTC (permalink / raw)
  To: tytso; +Cc: Avery Pennarun, git

On Fri, Jul 02, 2010 at 03:26:12PM -0400, tytso@mit.edu wrote:

> I just tried your patch, and with a large number of tags (198 tags,
> from v2.6.11 to v2.6.34 with all of the -rc releases of the linux
> kernel), it is indeed faster: 8.5 seconds without the patch versus 2.3
> seconds with the patch.
> 
> However, if I remove a large number of tags (since I know this is
> something that was introduced since 2.6.33, so I made a shared clone
> of the repository but then I removed all of the tags from 2.6.11
> through 2.6.33, so there was only 19 tags in play), the time to
> execute the git tag --contains became 1.3 seconds without the patch,
> versus 2.9 seconds without the patch.
> 
> So with the oldest tags removed, your patch actually made things run
> *slower* (2.3 vs 2.9 seconds, which was counter-intuitive to me), and
> fastest way to speed things up was to restrict the tags that would be
> searched.

I'm guessing that it is caused by the depth-first search that my patch
does. If we follow the "wrong" parent of a merge (i.e., the one that the
commit in question is not on), then we will end up hitting all commits
down to the roots before backtracking and looking down the right
parent.

I noticed that my improved time for "git tag --contains" was similar to
the total time for "git rev-list --all >/dev/null". Can you try timing
that? My suspicion is that it is going to be about 2.9 seconds for you.

So we could potentially improve my patch by doing a breadth-first
search, although that is a bit trickier (since the point is to mark and
prune whole subgraphs of history). But I'm not sure it is worth it in
practice. It will make some lookups quicker, but in most cases you will
end up going to the roots anyway for a negative lookup (in your case it
was only faster because you got rid of all the old tags). A better
strategy is to prune based on commit date so we _never_ go to the roots,
even for a negative hit. That should give similar speedups as a
breadth-first search would to your case, but also make the normal case
much faster by quickly discarding nonsensical paths (e.g., there is not
much point following a commit made 3 years ago to see if it contains a
commit made yesterday).

Try the patch below, which adds a date cutoff similar to the one used in
name-rev. It's much faster in my tests:

  # stock "git tag --contains HEAD~200"
  real    0m2.179s
  user    0m2.156s
  sys     0m0.020s

  # my first patch
  real    0m0.621s
  user    0m0.588s
  sys     0m0.032s

  # this patch
  real    0m0.041s
  user    0m0.036s
  sys     0m0.004s

For non-pathological cases, I expect it to perform equal to stock git at
worst, and in practice much better (for pathological cases, its worst
case is equivalent to "git rev-list --all", which is still not that
bad). Let me know how it does on your case.

The obvious downside is that it stops looking down a path in the face of
extreme clock skew. We could perhaps allow a "--contains=thorough" to
spend a little more time to achieve a better answer (i.e., ignore the
cutoff date).

diff --git a/builtin/tag.c b/builtin/tag.c
index d311491..5768a44 100644
--- a/builtin/tag.c
+++ b/builtin/tag.c
@@ -10,6 +10,8 @@
 #include "builtin.h"
 #include "refs.h"
 #include "tag.h"
+#include "diff.h"
+#include "revision.h"
 #include "run-command.h"
 #include "parse-options.h"

@@ -31,6 +33,61 @@ struct tag_filter {

 #define PGP_SIGNATURE "-----BEGIN PGP SIGNATURE-----"

+static int in_commit_list(const struct commit_list *want, struct commit *c)
+{
+	for (; want; want = want->next)
+		if (!hashcmp(want->item->object.sha1, c->object.sha1))
+			return 1;
+	return 0;
+}
+
+static int contains_recurse(const struct commit_list *want,
+			    struct commit *candidate,
+			    unsigned long cutoff)
+{
+	struct commit_list *p;
+
+	/* was it previously marked as containing a want commit? */
+	if (candidate->object.flags & TMP_MARK)
+		return 1;
+	/* or marked as not possibly containing a want commit? */
+	if (candidate->object.flags & UNINTERESTING)
+		return 0;
+	/* or are we it? */
+	if (in_commit_list(want, candidate))
+		return 1;
+
+	/* stop searching if we go too far back in time */
+	parse_commit(candidate);
+	if (candidate->date < cutoff)
+		return 0;
+
+	/* If not, then try parents, and be sure to mark ourselves
+	 * for future traversals. */
+	for (p = candidate->parents; p; p = p->next) {
+		if (contains_recurse(want, p->item, cutoff)) {
+			candidate->object.flags |= TMP_MARK;
+			return 1;
+		}
+	}
+
+	candidate->object.flags |= UNINTERESTING;
+	return 0;
+}
+
+static int contains(const struct commit_list *want, struct commit *candidate)
+{
+	const struct commit_list *c;
+	unsigned long min_date = ULONG_MAX;
+	for (c = want; c; c = c->next) {
+		parse_commit(c->item);
+		if (c->item->date < min_date)
+			min_date = c->item->date;
+	}
+	/* give a full day of clock skew slop */
+	return contains_recurse(want, candidate, min_date - 86400);
+}
+
 static int show_reference(const char *refname, const unsigned char *sha1,
 			  int flag, void *cb_data)
 {
@@ -49,7 +106,7 @@ static int show_reference(const char *refname, const unsigned char *sha1,
 			commit = lookup_commit_reference_gently(sha1, 1);
 			if (!commit)
 				return 0;
-			if (!is_descendant_of(commit, filter->with_commit))
+			if (!contains(filter->with_commit, commit))
 				return 0;
 		}

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-01  0:58 ` Shawn O. Pearce
@ 2010-07-03 23:27   ` Sam Vilain
  0 siblings, 0 replies; 46+ messages in thread
From: Sam Vilain @ 2010-07-03 23:27 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Theodore Ts'o, git, Nick Edelen

On Wed, 2010-06-30 at 17:58 -0700, Shawn O. Pearce wrote:
> > I can do this using "git tag --contains <commit-id>", but it's quite
> > slow.  It takes something like 8-9 seconds.  (8.5 seconds of user time,
> > 8.6 times of wall clock, to be precise).
> > 
> > I can get the information much faster using "gitk -1 <commit-id>", which
> > [...]
> > using the built-in native git tools?  And if so, why is git tag
> > --contains apparently 4 times slower than gitk at performing this task?
> 
> gitk keeps a cache of this stuff in .git/gitk.cache.  I'll bet its
> pulling from cache here, which is why it snaps so fast.
> 
> Without the cache is expensive, which is what 'git tag --contains'
> is doing.  The code walks back from each tag tracing along the commit
> parent pointers, keeping track of the nearest tag name that can reach
> any given commit.  When it finds your commit, it stops and outputs.
> 
> Since this stuff can't change unless the refs change, yes, it can be
> cached easily.  But nobody has done caching for this in Git itself,
> only in Tcl for gitk.  :-\

I think possibly rev-cache would improve this case.  Maybe it could also
be used for gitk.  Perhaps Nick you might like to say more, and comment
on what the outstanding work is to do?

Sam

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-03  8:06           ` Jeff King
@ 2010-07-04  0:55             ` tytso
  2010-07-05 12:27               ` Jeff King
  0 siblings, 1 reply; 46+ messages in thread
From: tytso @ 2010-07-04  0:55 UTC (permalink / raw)
  To: Jeff King; +Cc: Avery Pennarun, git

On Sat, Jul 03, 2010 at 04:06:19AM -0400, Jeff King wrote:
> 
> I noticed that my improved time for "git tag --contains" was similar to
> the total time for "git rev-list --all >/dev/null". Can you try timing
> that? My suspicion is that it is going to be about 2.9 seconds for you.

I'm at home, so getting access to my work machine is a bit of a pain,
so I replicated the experiment at home.

% time /tmp/git.1.7.2-rc1 tag --contains 307ae18 | wc -l
13

real	0m13.706s
user	0m13.542s
sys	0m0.150s

% time /tmp/git.patch.1 tag --contains 307ae18 | wc -l
13

real	0m2.869s
user	0m2.703s
sys	0m0.163s

% time /tmp/git.patch.1 rev-list --all  > /dev/null

real   0m3.074s
user   0m2.920s
sys    0m0.147s

> Try the patch below, which adds a date cutoff similar to the one used in
> name-rev. It's much faster in my tests:

Yep, much faster indeed:

% time /tmp/git.patch.2 tag --contains 307ae18 | wc -l
13

real	0m0.054s
user	0m0.030s
sys	0m0.023s

> The obvious downside is that it stops looking down a path in the face of
> extreme clock skew. We could perhaps allow a "--contains=thorough" to
> spend a little more time to achieve a better answer (i.e., ignore the
> cutoff date).

Yep, it does blow up in the face of the extreme clock skew in some of
the ext4 commits in the Linux kernel tree.  (Sorry about that; mea
culpa, I didn't realize at the time this was a problem, and it was my
workflow using the guilt program which happened to introduce them.)

In any case, because of the ext4 commits, I can show a test case which
doesn't work well with your date cutoff patch:

#!/bin/sh

for i in $(git log --reverse --oneline v2.6.32..v2.6.35-rc3 fs/ext4 fs/jbd2 |
      	       awk '{print $1}')
do
    echo -n "$i "
    /tmp/git.patch.2 tag --contains $i | head -1
done

You won't see any problems after v2.6.34; I fixed my workflow once I
was told it was causing git problems.

If you replace this with the unpatched git, or with git with your
first patch, it will be of course much slower, but it will print out
all of the tags that would be expected given a topographical
examination of the commit graph.  Whether this is a bug in git or a bug
that I introduced into the Linux kernel git tree is of course an open
question.  However, if we do want to allow git operations to work
quickly --- and I agree this is a good thing; the speedups that this
can allow are quite significant --- maybe we should teach "git commit"
to either print a warning or outright refuse to introduce clock skews
in the first place.

(Or maybe we have git config options that can enable or disable
optimizations that depend on the lack of clock skews; but I could
understand people not wanting to maintian the extra code paths.)

	   	      	      	 	  - Ted

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-04  0:55             ` tytso
@ 2010-07-05 12:27               ` Jeff King
  2010-07-05 12:33                 ` [RFC/PATCH 1/4] tag: speed up --contains calculation Jeff King
                                   ` (5 more replies)
  0 siblings, 6 replies; 46+ messages in thread
From: Jeff King @ 2010-07-05 12:27 UTC (permalink / raw)
  To: tytso; +Cc: Avery Pennarun, git

On Sat, Jul 03, 2010 at 08:55:43PM -0400, tytso@mit.edu wrote:

> > I noticed that my improved time for "git tag --contains" was similar to
> > the total time for "git rev-list --all >/dev/null". Can you try timing
> > that? My suspicion is that it is going to be about 2.9 seconds for you.
> 
> I'm at home, so getting access to my work machine is a bit of a pain,
> so I replicated the experiment at home.

Thanks. Those numbers confirm what I had been thinking.

> Yep, it does blow up in the face of the extreme clock skew in some of
> the ext4 commits in the Linux kernel tree.  (Sorry about that; mea
> culpa, I didn't realize at the time this was a problem, and it was my
> workflow using the guilt program which happened to introduce them.)

Yes, I was thinking specifically of those commits when I warned about
clock skew. :)

> In any case, because of the ext4 commits, I can show a test case which
> doesn't work well with your date cutoff patch:

Not surprising. I think you will find that "git name-rev" (or "git
describe --contains", which simply calls name-rev) will have similar
problems.

> (Or maybe we have git config options that can enable or disable
> optimizations that depend on the lack of clock skews; but I could
> understand people not wanting to maintian the extra code paths.)

I think the best thing we can do is provide a "how much clock skew to
tolerate" variable, and give it a sane default. Then people who know
they have skewed repositories can make the correctness-optimization
tradeoff as they see fit.

The extra code is very minor. It's really only a line or two of code
when calculating the cutoff date to convert "be thorough" into a cutoff
date of 0.

The real question is what that default should be. Name-rev already uses
86400 seconds. The worst skew in git.git is 8 seconds. The worst skew in
linux-2.6.git is 8622098 (about 100 days). For reference, here are my
timings on "git tag --contains HEAD~200" for various allowable clock
skew values:

  0 (don't allow even a second of clock skew): .035s
  86400 (one day of clock skew allowed): .034s
  8622098 (the worst skew in linux-2.6): .252s
  infinite (never cutoff for clock skew): 5.373s

So anything below a day is pointless and lost in the noise. Even 100
days yields quite a satisfactory speedup from the current code, but
obviously that number is totally arbitrary based on one repo.

As you probably guessed from the specificity of the number, I wrote a
short program to actually traverse and find the worst skew. It takes
about 5 seconds to run (unsurprisingly, since it is doing the same full
traversal that we end up doing in the above numbers). So we could
"autoskew" by setting up the configuration on clone, and then
periodically updating it as part of "git gc".

That is perhaps over-engineering (and would add a few seconds to a
clone), but I like that it would Just Work without the user doing
anything.

I'll follow this mail up with a series that implements a cleaned-up
version of the previous patches in this thread, and we'll see what
others think.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC/PATCH 1/4] tag: speed up --contains calculation
  2010-07-05 12:27               ` Jeff King
@ 2010-07-05 12:33                 ` Jeff King
  2010-10-13 22:07                   ` Jonathan Nieder
                                     ` (2 more replies)
  2010-07-05 12:34                 ` [RFC/PATCH 2/4] limit "contains" traversals based on commit timestamp Jeff King
                                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 46+ messages in thread
From: Jeff King @ 2010-07-05 12:33 UTC (permalink / raw)
  To: tytso; +Cc: Avery Pennarun, git

When we want to know if commit A contains commit B (or any
one of a set of commits, B through Z), we generally
calculate the merge bases and see if B is a merge base of A
(or for a set, if any of the commits B through Z have that
property).

When we are going to check a series of commits A1 through An
to see whether each contains B (e.g., because we are
deciding which tags to show with "git tag --contains"), we
do a series of merge base calculations. This can be very
expensive, as we repeat a lot of traversal work.

Instead, let's leverage the fact that we are going to use
the same --contains list for each tag, and mark areas of the
commit graph is definitely containing those commits, or
definitely not containing those commits. Later tags can then
stop traversing as soon as they see a previously calculated
answer.

This sped up "git tag --contains HEAD~200" in the linux-2.6
repository from:

  real    0m15.417s
  user    0m15.197s
  sys     0m0.220s

to:

  real    0m5.329s
  user    0m5.144s
  sys     0m0.184s

Signed-off-by: Jeff King <peff@peff.net>
---
Note that this is a depth first search, whereas we generally traverse in
a more breadth-first way. So it can actually make things slightly slower
than the current merge-base code, if:

  1. You don't have any merge base calculations that involve going very
     far back in history.

  2. We depth-first down the wrong side of a merge.

However, in the usual cases, I think it will perform much better.
Comparing an old tag with a recent commit means we have to look at a lot
of the commit graph, anyway.

If anybody has a clever idea for making it breadth-first, I would be
happy to hear it.

Other caveats:

  1. This can probably be a one-liner change to use in "git branch
     --contains", as well. I didn't measure it, as that already tends to
     be reasonably fast.

  2. It uses commit marks, which means it doesn't behave well with other
     traversals. I have no idea if I should be using alternate marks or
     not.

  3. We never clear the marks, or check them against the "want" list. So
     we just assume that repeated calls to contains will have the same
     "want" list.

 builtin/tag.c |    2 +-
 commit.c      |   42 ++++++++++++++++++++++++++++++++++++++++++
 commit.h      |    2 ++
 3 files changed, 45 insertions(+), 1 deletions(-)

diff --git a/builtin/tag.c b/builtin/tag.c
index d311491..c200e1e 100644
--- a/builtin/tag.c
+++ b/builtin/tag.c
@@ -49,7 +49,7 @@ static int show_reference(const char *refname, const unsigned char *sha1,
 			commit = lookup_commit_reference_gently(sha1, 1);
 			if (!commit)
 				return 0;
-			if (!is_descendant_of(commit, filter->with_commit))
+			if (!contains(commit, filter->with_commit))
 				return 0;
 		}
 
diff --git a/commit.c b/commit.c
index e9b0750..20354c6 100644
--- a/commit.c
+++ b/commit.c
@@ -845,3 +845,45 @@ int commit_tree(const char *msg, unsigned char *tree,
 	strbuf_release(&buffer);
 	return result;
 }
+
+static int in_commit_list(const struct commit_list *want, struct commit *c)
+{
+	for (; want; want = want->next)
+		if (!hashcmp(want->item->object.sha1, c->object.sha1))
+			return 1;
+	return 0;
+}
+
+static int contains_recurse(struct commit *candidate,
+			    const struct commit_list *want)
+{
+	struct commit_list *p;
+
+	/* was it previously marked as containing a want commit? */
+	if (candidate->object.flags & TMP_MARK)
+		return 1;
+	/* or marked as not possibly containing a want commit? */
+	if (candidate->object.flags & UNINTERESTING)
+		return 0;
+	/* or are we it? */
+	if (in_commit_list(want, candidate))
+		return 1;
+
+	if (parse_commit(candidate) < 0)
+		return 0;
+
+	/* Otherwise recurse and mark ourselves for future traversals. */
+	for (p = candidate->parents; p; p = p->next) {
+		if (contains_recurse(p->item, want)) {
+			candidate->object.flags |= TMP_MARK;
+			return 1;
+		}
+	}
+	candidate->object.flags |= UNINTERESTING;
+	return 0;
+}
+
+int contains(struct commit *candidate, const struct commit_list *want)
+{
+	return contains_recurse(candidate, want);
+}
diff --git a/commit.h b/commit.h
index eb2b8ac..1a7299e 100644
--- a/commit.h
+++ b/commit.h
@@ -153,6 +153,8 @@ extern struct commit_list *get_shallow_commits(struct object_array *heads,
 int is_descendant_of(struct commit *, struct commit_list *);
 int in_merge_bases(struct commit *, struct commit **, int);
 
+int contains(struct commit *, const struct commit_list *);
+
 extern int interactive_add(int argc, const char **argv, const char *prefix);
 extern int run_add_interactive(const char *revision, const char *patch_mode,
 			       const char **pathspec);
-- 
1.7.2.rc1.209.g2a36c

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC/PATCH 2/4] limit "contains" traversals based on commit timestamp
  2010-07-05 12:27               ` Jeff King
  2010-07-05 12:33                 ` [RFC/PATCH 1/4] tag: speed up --contains calculation Jeff King
@ 2010-07-05 12:34                 ` Jeff King
  2010-10-13 23:21                   ` Jonathan Nieder
  2010-07-05 12:35                 ` [RFC/PATCH 3/4] default core.clockskew variable to one day Jeff King
                                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2010-07-05 12:34 UTC (permalink / raw)
  To: tytso; +Cc: Avery Pennarun, git

When looking for commits that contain other commits (e.g.,
via "git tag --contains"), we can end up traversing useless
portions of the graph. For example, if I am looking for a
tag that contains a commit made last week, there is not much
point in traversing portions of the history graph made five
years ago.

This optimization can provide massive speedups. For example,
doing "git tag --contains HEAD~200" in the linux-2.6
repository goes from:

  real    0m5.302s
  user    0m5.116s
  sys     0m0.184s

to:

  real    0m0.030s
  user    0m0.020s
  sys     0m0.008s

The downside is that we will no longer find some answers in
the face of extreme clock skew, as we will stop the
traversal early when seeing commits skewed too far into the
past.

Name-rev already implements a similar optimization, using a
"slop" of one day to allow for a certain amount of clock
skew in commit timestamps. This patch introduces a
"core.clockskew" variable, which allows specifying the
allowable amount of clock skew in seconds.  For safety, it
defaults to "none", causing a full traversal (i.e., no
change in behavior from previous versions).

Signed-off-by: Jeff King <peff@peff.net>
---
 cache.h  |    1 +
 commit.c |   27 ++++++++++++++++++++++++---
 config.c |    8 ++++++++
 3 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/cache.h b/cache.h
index c9fa3df..2dfb636 100644
--- a/cache.h
+++ b/cache.h
@@ -551,6 +551,7 @@ extern int read_replace_refs;
 extern int fsync_object_files;
 extern int core_preload_index;
 extern int core_apply_sparse_checkout;
+extern int core_clock_skew;
 
 enum safe_crlf {
 	SAFE_CRLF_FALSE = 0,
diff --git a/commit.c b/commit.c
index 20354c6..c849b50 100644
--- a/commit.c
+++ b/commit.c
@@ -7,6 +7,7 @@
 #include "revision.h"
 #include "notes.h"
 
+int core_clock_skew = -1;
 int save_commit_buffer = 1;
 
 const char *commit_type = "commit";
@@ -855,7 +856,8 @@ static int in_commit_list(const struct commit_list *want, struct commit *c)
 }
 
 static int contains_recurse(struct commit *candidate,
-			    const struct commit_list *want)
+			    const struct commit_list *want,
+			    unsigned long cutoff)
 {
 	struct commit_list *p;
 
@@ -872,9 +874,13 @@ static int contains_recurse(struct commit *candidate,
 	if (parse_commit(candidate) < 0)
 		return 0;
 
+	/* stop searching if we go too far back in time */
+	if (candidate->date < cutoff)
+		return 0;
+
 	/* Otherwise recurse and mark ourselves for future traversals. */
 	for (p = candidate->parents; p; p = p->next) {
-		if (contains_recurse(p->item, want)) {
+		if (contains_recurse(p->item, want, cutoff)) {
 			candidate->object.flags |= TMP_MARK;
 			return 1;
 		}
@@ -885,5 +891,20 @@ static int contains_recurse(struct commit *candidate,
 
 int contains(struct commit *candidate, const struct commit_list *want)
 {
-	return contains_recurse(candidate, want);
+	unsigned long cutoff = 0;
+
+	if (core_clock_skew >= 0) {
+		const struct commit_list *c;
+		unsigned long min_date = ULONG_MAX;
+		for (c = want; c; c = c->next) {
+			if (parse_commit(c->item) < 0)
+				continue;
+			if (c->item->date < min_date)
+				min_date = c->item->date;
+		}
+		if (min_date > core_clock_skew)
+			cutoff = min_date - core_clock_skew;
+	}
+
+	return contains_recurse(candidate, want, cutoff);
 }
diff --git a/config.c b/config.c
index cdcf583..7a18bc9 100644
--- a/config.c
+++ b/config.c
@@ -595,6 +595,14 @@ static int git_default_core_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.clockskew")) {
+		if (!value || !strcmp(value, "none"))
+			core_clock_skew = -1;
+		else
+			core_clock_skew = git_config_int(var, value);
+		return 0;
+	}
+
 	/* Add other config variables here and to Documentation/config.txt. */
 	return 0;
 }
-- 
1.7.2.rc1.209.g2a36c

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC/PATCH 3/4] default core.clockskew variable to one day
  2010-07-05 12:27               ` Jeff King
  2010-07-05 12:33                 ` [RFC/PATCH 1/4] tag: speed up --contains calculation Jeff King
  2010-07-05 12:34                 ` [RFC/PATCH 2/4] limit "contains" traversals based on commit timestamp Jeff King
@ 2010-07-05 12:35                 ` Jeff King
  2010-07-05 12:36                 ` [RFC/PATCH 4/4] name-rev: respect core.clockskew Jeff King
                                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2010-07-05 12:35 UTC (permalink / raw)
  To: tytso; +Cc: Avery Pennarun, git

This is the slop value used by name-rev, so presumably is a
reasonable default.

Signed-off-by: Jeff King <peff@peff.net>
---
Actually, I think auto-detecting the skew in a repo during "gc" might be
an even better default. See:

  http://article.gmane.org/gmane.comp.version-control.git/150260

earlier in this thread for some numbers and discussion.

 commit.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/commit.c b/commit.c
index c849b50..4cbe756 100644
--- a/commit.c
+++ b/commit.c
@@ -7,7 +7,7 @@
 #include "revision.h"
 #include "notes.h"
 
-int core_clock_skew = -1;
+int core_clock_skew = 86400;
 int save_commit_buffer = 1;
 
 const char *commit_type = "commit";
-- 
1.7.2.rc1.209.g2a36c

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC/PATCH 4/4] name-rev: respect core.clockskew
  2010-07-05 12:27               ` Jeff King
                                   ` (2 preceding siblings ...)
  2010-07-05 12:35                 ` [RFC/PATCH 3/4] default core.clockskew variable to one day Jeff King
@ 2010-07-05 12:36                 ` Jeff King
  2010-07-05 12:39                 ` Why is "git tag --contains" so slow? Jeff King
  2010-07-05 14:10                 ` tytso
  5 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2010-07-05 12:36 UTC (permalink / raw)
  To: tytso; +Cc: Avery Pennarun, git

Before core.clockskew existed, we were just hard-coded to
86400 seconds. This allows users to tweak the value as
appropriate.

Signed-off-by: Jeff King <peff@peff.net>
---
I think it makes sense to adapt this to core.clockskew. Especially if
our default stays at 86400, then there is no change in behavior. But
even if it does change, I think having it tweakable (and especially if
we automatically calculate per-repo skew) makes sense.

 builtin/name-rev.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/builtin/name-rev.c b/builtin/name-rev.c
index 06a38ac..5c6d712 100644
--- a/builtin/name-rev.c
+++ b/builtin/name-rev.c
@@ -5,8 +5,6 @@
 #include "refs.h"
 #include "parse-options.h"
 
-#define CUTOFF_DATE_SLOP 86400 /* one day */
-
 typedef struct rev_name {
 	const char *tip_name;
 	int generation;
@@ -270,8 +268,10 @@ int cmd_name_rev(int argc, const char **argv, const char *prefix)
 		add_object_array((struct object *)commit, *argv, &revs);
 	}
 
-	if (cutoff)
-		cutoff = cutoff - CUTOFF_DATE_SLOP;
+	if (cutoff && core_clock_skew >= 0)
+		cutoff = cutoff - core_clock_skew;
+	else
+		cutoff = 0;
 	for_each_ref(name_ref, &data);
 
 	if (transform_stdin) {
-- 
1.7.2.rc1.209.g2a36c

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-05 12:27               ` Jeff King
                                   ` (3 preceding siblings ...)
  2010-07-05 12:36                 ` [RFC/PATCH 4/4] name-rev: respect core.clockskew Jeff King
@ 2010-07-05 12:39                 ` Jeff King
  2010-10-14 18:59                   ` Jonathan Nieder
  2010-07-05 14:10                 ` tytso
  5 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2010-07-05 12:39 UTC (permalink / raw)
  To: tytso; +Cc: Avery Pennarun, git

On Mon, Jul 05, 2010 at 08:27:23AM -0400, Jeff King wrote:

> As you probably guessed from the specificity of the number, I wrote a
> short program to actually traverse and find the worst skew. It takes
> about 5 seconds to run (unsurprisingly, since it is doing the same full
> traversal that we end up doing in the above numbers). So we could
> "autoskew" by setting up the configuration on clone, and then
> periodically updating it as part of "git gc".

This patch doesn't implement auto-detection of skew, but is the program
I used to calculate, and would provide the basis for such
auto-detection. It would be interesting to see average skew numbers for
popular repositories. You can run it as "git skew --all".

diff --git a/.gitignore b/.gitignore
index 14e2b6b..90aff17 100644
--- a/.gitignore
+++ b/.gitignore
@@ -132,6 +132,7 @@
 /git-show-branch
 /git-show-index
 /git-show-ref
+/git-skew
 /git-stage
 /git-stash
 /git-status
diff --git a/Makefile b/Makefile
index 9aca8a1..e673bdf 100644
--- a/Makefile
+++ b/Makefile
@@ -725,6 +725,7 @@ BUILTIN_OBJS += builtin/send-pack.o
 BUILTIN_OBJS += builtin/shortlog.o
 BUILTIN_OBJS += builtin/show-branch.o
 BUILTIN_OBJS += builtin/show-ref.o
+BUILTIN_OBJS += builtin/skew.o
 BUILTIN_OBJS += builtin/stripspace.o
 BUILTIN_OBJS += builtin/symbolic-ref.o
 BUILTIN_OBJS += builtin/tag.o
diff --git a/builtin.h b/builtin.h
index ed6ee26..5f5dc0a 100644
--- a/builtin.h
+++ b/builtin.h
@@ -141,5 +141,6 @@ extern int cmd_verify_pack(int argc, const char **argv, const char *prefix);
 extern int cmd_show_ref(int argc, const char **argv, const char *prefix);
 extern int cmd_pack_refs(int argc, const char **argv, const char *prefix);
 extern int cmd_replace(int argc, const char **argv, const char *prefix);
+extern int cmd_skew(int argc, const char **argv, const char *prefix);
 
 #endif
diff --git a/builtin/skew.c b/builtin/skew.c
new file mode 100644
index 0000000..1046f5f
--- /dev/null
+++ b/builtin/skew.c
@@ -0,0 +1,50 @@
+#include "cache.h"
+#include "commit.h"
+#include "diff.h"
+#include "revision.h"
+
+unsigned long worst_skew = 0;
+
+static void check_skew_recurse(struct commit *c, unsigned long when)
+{
+	struct commit_list *p;
+
+	if (c->object.flags & SEEN)
+		return;
+	c->object.flags |= SEEN;
+
+	if (parse_commit(c) < 0)
+		return;
+
+	if (c->date > when) {
+		unsigned long skew = c->date - when;
+		if (skew > worst_skew)
+			worst_skew = skew;
+	}
+
+	for (p = c->parents; p; p = p->next)
+		check_skew_recurse(p->item, c->date < when ? c->date : when);
+}
+
+static void check_skew(struct commit *c)
+{
+	check_skew_recurse(c, time(NULL));
+}
+
+int cmd_skew(int argc, const char **argv, const char *prefix) {
+	struct rev_info revs;
+	int i;
+
+	git_config(git_default_config, NULL);
+	init_revisions(&revs, prefix);
+	argc = setup_revisions(argc, argv, &revs, NULL);
+
+	for (i = 0; i < revs.pending.nr; i++) {
+		struct object *o = revs.pending.objects[i].item;
+		if (o->type == OBJ_COMMIT)
+			check_skew((struct commit *)o);
+	}
+
+	printf("%lu\n", worst_skew);
+	return 0;
+}
diff --git a/git.c b/git.c
index 265fa09..8a77fe3 100644
--- a/git.c
+++ b/git.c
@@ -399,6 +399,7 @@ static void handle_internal_command(int argc, const char **argv)
 		{ "verify-pack", cmd_verify_pack },
 		{ "show-ref", cmd_show_ref, RUN_SETUP },
 		{ "pack-refs", cmd_pack_refs, RUN_SETUP },
+		{ "skew", cmd_skew, RUN_SETUP },
 	};
 	int i;
 	static const char ext[] = STRIP_EXTENSION;

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-05 12:27               ` Jeff King
                                   ` (4 preceding siblings ...)
  2010-07-05 12:39                 ` Why is "git tag --contains" so slow? Jeff King
@ 2010-07-05 14:10                 ` tytso
  2010-07-06 11:58                   ` Jeff King
  5 siblings, 1 reply; 46+ messages in thread
From: tytso @ 2010-07-05 14:10 UTC (permalink / raw)
  To: Jeff King; +Cc: Avery Pennarun, git

On Mon, Jul 05, 2010 at 08:27:23AM -0400, Jeff King wrote:
> As you probably guessed from the specificity of the number, I wrote a
> short program to actually traverse and find the worst skew. It takes
> about 5 seconds to run (unsurprisingly, since it is doing the same full
> traversal that we end up doing in the above numbers). So we could
> "autoskew" by setting up the configuration on clone, and then
> periodically updating it as part of "git gc".
> 
> That is perhaps over-engineering (and would add a few seconds to a
> clone), but I like that it would Just Work without the user doing
> anything.

As time progresses, the clock skew breakage should be less likely to
be hit by a typical developer, right?  That is, unless you are
specifically referencing one of the commits which were skewed, two
years from now, the chances of someone (who isn't doing code
archeology) of getting hit by a problem should be small, right?  This
seems to be definitely the case with "git tag --contains"; would it be
true for git name-rev and the other places that depend on (roughly)
increasing commit times?

If so, I could imagine the automagic scheme choosing a default that
only finds the worst skew in the past N months.  This would speed up
things up for users who are using repositories that have skews in the
distant past, at the cost of introducing potentially confusuing edge
cases for people doing code archeology.

I'm not sure this is a good tradeoff, but given in practice how rarely
most developers go back in time more than say, 12-24 months, maybe
it's worth doing.  What do you think?

     	   	   				- Ted

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-05 14:10                 ` tytso
@ 2010-07-06 11:58                   ` Jeff King
  2010-07-06 15:31                     ` Will Palmer
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2010-07-06 11:58 UTC (permalink / raw)
  To: tytso; +Cc: Avery Pennarun, git

On Mon, Jul 05, 2010 at 10:10:12AM -0400, tytso@mit.edu wrote:

> As time progresses, the clock skew breakage should be less likely to
> be hit by a typical developer, right?  That is, unless you are
> specifically referencing one of the commits which were skewed, two
> years from now, the chances of someone (who isn't doing code
> archeology) of getting hit by a problem should be small, right?  This

It's not about directly referencing skewed commits. It's about
traversing history that contains skewed commits. So if I have a history
like:

  A -- B -- C -- D

and "B" is skewed, then I will generally give up on finding "A" when
searching backwards from "C" or "D", or their descendants. So as time
moves forward, you will continue to have your old tags pointing to "C"
or "D", but also tags pointing to their descendants. Doing "git tag
--contains A" will continue to be inaccurate, since it will continue to
look for "A" from "C" and "D", but also from newer tags, all of which
involve traversing the skewed "B".

What I think is true is that people will be less likely to look at "A"
as time goes on, as code it introduced presumably becomes less relevant
(either bugs are shaken out, or it gets replaced, or whatever). And
obviously looking at "C" from "D", the skew in "B" will be irrelevant.

So I think typical developers become less likely to hit the issue as
time goes on, but software archaeologists will hit it forever.

> If so, I could imagine the automagic scheme choosing a default that
> only finds the worst skew in the past N months.  This would speed up
> things up for users who are using repositories that have skews in the
> distant past, at the cost of introducing potentially confusuing edge
> cases for people doing code archeology.

How do you decide, when looking for commits that have bogus timestamps,
which ones happened in the past N months? Certainly you can do some
statistical analysis to pick out anomalous ones. And you could perhaps
favor future skewing over past skewing, since that skew doesn't tend to
impact traversal cutoffs (and large past skewing seems to be more
common). But that is getting kind of complex.

> I'm not sure this is a good tradeoff, but given in practice how rarely
> most developers go back in time more than say, 12-24 months, maybe
> it's worth doing.  What do you think?

I'm not sure. I am tempted to just default it to 86400 and go no
further.  People who care about archaeology can turn off traversal
cutoffs if they like, and as the skewed history ages, people get less
likely to look at it. We could also pick half a year or some high number
as the default allowable. The performance increase is still quite
noticeable there, and it covers the only large skew we know about. I'd
be curious to see if other projects have skew, and how much.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-06 11:58                   ` Jeff King
@ 2010-07-06 15:31                     ` Will Palmer
  2010-07-06 16:53                       ` tytso
                                         ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Will Palmer @ 2010-07-06 15:31 UTC (permalink / raw)
  To: Jeff King; +Cc: tytso, Avery Pennarun, git

On Tue, 2010-07-06 at 07:58 -0400, Jeff King wrote:
> On Mon, Jul 05, 2010 at 10:10:12AM -0400, tytso@mit.edu wrote:
> > I'm not sure this is a good tradeoff, but given in practice how rarely
> > most developers go back in time more than say, 12-24 months, maybe
> > it's worth doing.  What do you think?
> 
> I'm not sure. I am tempted to just default it to 86400 and go no
> further.  People who care about archaeology can turn off traversal
> cutoffs if they like, and as the skewed history ages, people get less
> likely to look at it. We could also pick half a year or some high number
> as the default allowable. The performance increase is still quite
> noticeable there, and it covers the only large skew we know about. I'd
> be curious to see if other projects have skew, and how much.
> 
> -Peff

Is it wrong to expect that git perform poorly in the edge-cases (hugely
skewed timestamps), but that it perform /correctly/ in all cases?

Clearly, marking already-traversed histories was the right thing to do,
and if I read correctly, made a good improvement on its own. But you
seem to have crossed a line at some point between "optimization" and
"potentially giving the wrong answer because it's faster"

Or am I just misunderstanding here?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-06 15:31                     ` Will Palmer
@ 2010-07-06 16:53                       ` tytso
  2010-07-08 11:28                         ` Jeff King
  2010-07-07 17:45                       ` Jeff King
  2010-07-07 17:50                       ` Jeff King
  2 siblings, 1 reply; 46+ messages in thread
From: tytso @ 2010-07-06 16:53 UTC (permalink / raw)
  To: Will Palmer; +Cc: Jeff King, Avery Pennarun, git

On Tue, Jul 06, 2010 at 04:31:43PM +0100, Will Palmer wrote:
> Is it wrong to expect that git perform poorly in the edge-cases (hugely
> skewed timestamps), but that it perform /correctly/ in all cases?
> 
> Clearly, marking already-traversed histories was the right thing to do,
> and if I read correctly, made a good improvement on its own. But you
> seem to have crossed a line at some point between "optimization" and
> "potentially giving the wrong answer because it's faster"

When "it's faster" is between 100-1000 times faster, I think we have
to look at things a bit more closely.  That's the difference between a
command being usable and not usable.

We would be much better off if our tools enforced the fact that
committer times were always increasing.  If from the beginning, we had
introduced checks so that "git commit" refused to create new commits
where the committer time was before its parent commit(s), and
git-receive-pack refused to accept packs that contained
non-monotonically increasing commits or commits that occurred in the
future according to its system clock, then these optimizations would
be completely valid.

But we didn't, and we do have skew in some repositories.  So the
question is what to do going forward?  One solution might be to
enforce this moving forward, and to have varying levels of strictness
in enforcing this constraint.  So for completely new repositories,
this becomes a non-brainer.  For older repositories, Jeff's idea of
having a tunable parameter so that results are correct given a maximum
clock skew --- which can be determined --- will allow us to have
correctness _and_ performance.

						- Ted

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-06 15:31                     ` Will Palmer
  2010-07-06 16:53                       ` tytso
@ 2010-07-07 17:45                       ` Jeff King
  2010-07-08 10:29                         ` Theodore Tso
  2010-07-07 17:50                       ` Jeff King
  2 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2010-07-07 17:45 UTC (permalink / raw)
  To: Will Palmer; +Cc: tytso, Avery Pennarun, git

On Tue, Jul 06, 2010 at 04:31:43PM +0100, Will Palmer wrote:

> Is it wrong to expect that git perform poorly in the edge-cases (hugely
> skewed timestamps), but that it perform /correctly/ in all cases?

When you put it that way, yes, I think the ideal behavior is that we are
always correct, fast in the common case, and slow for edge cases.

But there are some fuzzy areas here:

  1. Is clock skew more than a day (or some number of seconds, whatever
     it may be) an edge case, or is it simply breakage in the repo? In
     an ideal world, it would simply be an edge case that causes
     slowdown. But git already implements traversal-cutoff based on
     date, at least in "git name-rev", so this is not a new issue. I
     know we also look at the date for some other traversals (in fact, I
     see lots of other cutoffs that don't even seem to have the one-day
     slop). But I haven't looked closely enough to see just what will
     break with a huge skew.

  2. Not implementing traversal-cutoff based on date doesn't slow just
     edge cases. It slows _all_ cases, by a factor of a hundred, on the
     off chance that you might have an edge case.

     By pre-calculating the per-repo max skew during clone and gc, you
     can turn it into your "always correct, slower for edge cases". But:

       a. It makes clone and gc a bit slower, even for non-edge cases.

       b. It is not _always_ correct, as you have a lag period between
          when skew is introduced into your repo (either by commit or by
          fetch) and when you gc it. But it's closer.

     And of course it's just complex, and I tend to shy away from
     complexity when I can. The question to me comes back to (1) above.
     Is massive clock skew a breakage that should produce a few
     incorrect results, or is it something we should always handle?

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-06 15:31                     ` Will Palmer
  2010-07-06 16:53                       ` tytso
  2010-07-07 17:45                       ` Jeff King
@ 2010-07-07 17:50                       ` Jeff King
  2 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2010-07-07 17:50 UTC (permalink / raw)
  To: Will Palmer; +Cc: tytso, Avery Pennarun, git

On Tue, Jul 06, 2010 at 04:31:43PM +0100, Will Palmer wrote:

> Clearly, marking already-traversed histories was the right thing to do,
> and if I read correctly, made a good improvement on its own. But you
> seem to have crossed a line at some point between "optimization" and
> "potentially giving the wrong answer because it's faster"

As a side-note to what I said in my other email, the sinful thing here
is not this optimization (my patch 2/4). It's _defaulting_ the
optimization to on (my patch 3/4). With just 2/4, it's something that a
user can enable to get better speed if they feel confident there is no
large skew in their repo.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-07 17:45                       ` Jeff King
@ 2010-07-08 10:29                         ` Theodore Tso
  2010-07-08 11:12                           ` Jakub Narebski
                                             ` (3 more replies)
  0 siblings, 4 replies; 46+ messages in thread
From: Theodore Tso @ 2010-07-08 10:29 UTC (permalink / raw)
  To: Jeff King; +Cc: Will Palmer, Avery Pennarun, git


On Jul 7, 2010, at 1:45 PM, Jeff King wrote:

>     And of course it's just complex, and I tend to shy away from
>     complexity when I can. The question to me comes back to (1) above.
>     Is massive clock skew a breakage that should produce a few
>     incorrect results, or is it something we should always handle?

Going back to the question that kicked off this thread, I wonder if there
is some way that cacheing could be used to speed up the all cases,
or at lest the edge cases, without imposing as much latency as tracking
the max skew?   i.e., some thing like gitk's gitk.cache file.  For bonus
points, it could be a cache file that is used by both gitk and git tag
--contains, git branch --contains, and git name-rev.

Does that sound like reasonable idea?

--Ted

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-08 10:29                         ` Theodore Tso
@ 2010-07-08 11:12                           ` Jakub Narebski
  2010-07-08 19:29                             ` Nicolas Pitre
  2010-07-08 11:31                           ` Jeff King
                                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 46+ messages in thread
From: Jakub Narebski @ 2010-07-08 11:12 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Jeff King, Will Palmer, Avery Pennarun, git

Theodore Tso <tytso@MIT.EDU> writes:
> On Jul 7, 2010, at 1:45 PM, Jeff King wrote:
> 
> >     And of course it's just complex, and I tend to shy away from
> >     complexity when I can. The question to me comes back to (1) above.
> >     Is massive clock skew a breakage that should produce a few
> >     incorrect results, or is it something we should always handle?
> 
> Going back to the question that kicked off this thread, I wonder if there
> is some way that cacheing could be used to speed up the all cases,
> or at lest the edge cases, without imposing as much latency as tracking
> the max skew?   i.e., some thing like gitk's gitk.cache file.  For bonus
> points, it could be a cache file that is used by both gitk and git tag
> --contains, git branch --contains, and git name-rev.
> 
> Does that sound like reasonable idea?

By the way, what had happened to the rev-cache project from GSoC 2009?

-- 
Jakub Narebski
Poland
ShadeHawk on #git

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-06 16:53                       ` tytso
@ 2010-07-08 11:28                         ` Jeff King
  2010-07-08 13:21                           ` Will Palmer
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2010-07-08 11:28 UTC (permalink / raw)
  To: tytso; +Cc: Will Palmer, Avery Pennarun, git

On Tue, Jul 06, 2010 at 12:53:36PM -0400, tytso@mit.edu wrote:

> We would be much better off if our tools enforced the fact that
> committer times were always increasing.  If from the beginning, we had
> introduced checks so that "git commit" refused to create new commits
> where the committer time was before its parent commit(s), and
> git-receive-pack refused to accept packs that contained
> non-monotonically increasing commits or commits that occurred in the
> future according to its system clock, then these optimizations would
> be completely valid.
> 
> But we didn't, and we do have skew in some repositories.  So the
> question is what to do going forward?  One solution might be to
> enforce this moving forward, and to have varying levels of strictness
> in enforcing this constraint.  So for completely new repositories,

Whatever we do with the optimization, I do agree with your suggestion at
least for "git commit" to avoid making such commits.  Rejecting them
during fetchs and pushes would be a nice, too, but should probably just
be a warning at first, in case you have to pull from somebody with an
older git. You would also have to consider whether some small skew is
acceptable during a pull. Something like a 5-second skew is not a big
deal. If you had thousands of such skews in a row, yes, it would be bad,
but that is not likely to happen (and I can't think of some way to
maliciously create such a history that would not otherwise be
unusable).

> this becomes a non-brainer.  For older repositories, Jeff's idea of
> having a tunable parameter so that results are correct given a maximum
> clock skew --- which can be determined --- will allow us to have
> correctness _and_ performance.

Yeah. I think the real question is what the default for that parameter
should be: pessimistic but always correct, optimistic but possibly
incorrect in the face of skew, or auto-tuned per-repository.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-08 10:29                         ` Theodore Tso
  2010-07-08 11:12                           ` Jakub Narebski
@ 2010-07-08 11:31                           ` Jeff King
  2010-07-08 14:35                           ` Johan Herland
  2010-07-08 19:06                           ` Nicolas Pitre
  3 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2010-07-08 11:31 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Nick Edelen, Will Palmer, Avery Pennarun, git

On Thu, Jul 08, 2010 at 06:29:04AM -0400, Theodore Tso wrote:

> On Jul 7, 2010, at 1:45 PM, Jeff King wrote:
> 
> >     And of course it's just complex, and I tend to shy away from
> >     complexity when I can. The question to me comes back to (1) above.
> >     Is massive clock skew a breakage that should produce a few
> >     incorrect results, or is it something we should always handle?
> 
> Going back to the question that kicked off this thread, I wonder if there
> is some way that cacheing could be used to speed up the all cases,
> or at lest the edge cases, without imposing as much latency as tracking
> the max skew?   i.e., some thing like gitk's gitk.cache file.  For bonus
> points, it could be a cache file that is used by both gitk and git tag
> --contains, git branch --contains, and git name-rev.
> 
> Does that sound like reasonable idea?

It sounds reasonable to me. It just needs somebody to design and
implement it. :)

I don't know anything about how gitk.cache works. Caching tag refs is
not that hard, because the refs tend not to change. Doing arbitrary "X
contains Y" caching is much harder, and I think you would have to build
it on Nick's rev-cache work. Maybe he (cc'd) has some ideas.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-08 11:28                         ` Jeff King
@ 2010-07-08 13:21                           ` Will Palmer
  2010-07-08 13:54                             ` tytso
  0 siblings, 1 reply; 46+ messages in thread
From: Will Palmer @ 2010-07-08 13:21 UTC (permalink / raw)
  To: Jeff King; +Cc: tytso, Avery Pennarun, git

On Thu, 2010-07-08 at 07:28 -0400, Jeff King wrote:
> On Tue, Jul 06, 2010 at 12:53:36PM -0400, tytso@mit.edu wrote:
> Whatever we do with the optimization, I do agree with your suggestion at
> least for "git commit" to avoid making such commits.  Rejecting them
> during fetchs and pushes would be a nice, too, but should probably just
> be a warning at first, in case you have to pull from somebody with an
> older git...[snip]
> Yeah. I think the real question is what the default for that parameter
> should be: pessimistic but always correct, optimistic but possibly
> incorrect in the face of skew, or auto-tuned per-repository.
> -Peff

I think these two go hand-in-hand, and would resolve most of my issues
with it. Auto-tune, starting pessimistically, but then using something
more-optimized after something like gc has detected that it's okay. On
pull from an older repository (which I see as happening very frequently,
I add remotes much more often than I do a straight "clone"), a warning
and an auto-tune to something which would account for the newly-fetched
bad data.

My only other objection is more wishy-washy and/or lazy: currently a
"commit" doesn't need to know anything at all about what it references
in order to be considered a valid object, but saying "the time of commit
needs to be equal to or greater than the parent commit" means that a
tool.. and by "tool" I mean "wretched abuse of cat-file and sed", which
is sometimes just faster to throw-together than filter-branch ..needs to
be more aware of what it's doing. Yes, it's a horrible abuse, but I was
always under the impression that low-level abuse of the system is
something which git supports, by virtue of having such a simple model.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-08 13:21                           ` Will Palmer
@ 2010-07-08 13:54                             ` tytso
  0 siblings, 0 replies; 46+ messages in thread
From: tytso @ 2010-07-08 13:54 UTC (permalink / raw)
  To: Will Palmer; +Cc: Jeff King, Avery Pennarun, git

On Thu, Jul 08, 2010 at 02:21:35PM +0100, Will Palmer wrote:
> I think these two go hand-in-hand, and would resolve most of my issues
> with it. Auto-tune, starting pessimistically, but then using something
> more-optimized after something like gc has detected that it's okay. On
> pull from an older repository (which I see as happening very frequently,
> I add remotes much more often than I do a straight "clone"), a warning
> and an auto-tune to something which would account for the newly-fetched
> bad data.

Well, keep in mind that we've been using a one-day "maximum clock skew
or you start getting incorrect results" for quite a while.  We just
haven't necessarily publicized this fact or added enforcement to "git
commit" and git-receive-pack.  And I've been introducing clock skews
of about 3 months (roughly equal to the time between Linux stable
releases) into the Linux kernel repository for some 1-2 years (because
I just didn't know about the clock skew dependency and there was no
enforcement of the same), and the number of times people have noticed,
or at least complained, has been relatively small.  So like it or not,
the default of "one day clock skew or you get incorrect results" is
the status quo.

Your idea of being utterly pessimistic until the next "git gc" is
interesting, since it doesn't slow down the "git clone step".
Unfortunately, it also means that commands like "git name-rev" will be
several hundred times slower than they currently are (which probably
makes them unusable) until the first "git gc" --- this is the fact
that we've been depending on the clock skew being less than one day
for quite some time already; again, it _is_ the status quo.

And, "git gc" takes quite a bit longer than scanning for the maximum
skew after doing a "git clone".  Given that "git clone" generally
takes a good long time anyway, adding 1-3 seconds after a clone seems
to be the better way to go.  I think the biggest issue with it is the
complexity concern which Jeff has raised.

Given that, it would seem to me that (a) auto-tuning when cloning, (b)
defaulting to one-day and then auto-tuning on "git gc", or (c)
defaulting to one-day and not auto-tuning at all (the status quo) seem
to be the three most reasonable, listed in order of my preference.

As far as enforcement is concerned, we have choices of (a) add
warnings, which later become hard errors if the skew is greater than
some configurable window, (b) scan for skews during commits and
receive-packs, and update the auto-tune based on the skews found, or
(c) do nothing (the status quo).  (b) is the most user friendly, but
it adds more complexity, which is why I think (a) edges it out as a
better choice, but fortunately, I'm not the one who's paid the big
bucks to make these sorts of decisions.  :-)

> My only other objection is more wishy-washy and/or lazy: currently a
> "commit" doesn't need to know anything at all about what it references
> in order to be considered a valid object, but saying "the time of commit
> needs to be equal to or greater than the parent commit" means that a
> tool.. and by "tool" I mean "wretched abuse of cat-file and sed", which
> is sometimes just faster to throw-together than filter-branch ..needs to
> be more aware of what it's doing. Yes, it's a horrible abuse, but I was
> always under the impression that low-level abuse of the system is
> something which git supports, by virtue of having such a simple model.

I don't know that it's that much more difficult.  See the patch I
proposed for "guilt"; it was a ten line patch, and if you were willing
to be even more "quick and wretched", I could have made it a 3 or 4
line patch.

						- Ted

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-08 10:29                         ` Theodore Tso
  2010-07-08 11:12                           ` Jakub Narebski
  2010-07-08 11:31                           ` Jeff King
@ 2010-07-08 14:35                           ` Johan Herland
  2010-07-08 19:06                           ` Nicolas Pitre
  3 siblings, 0 replies; 46+ messages in thread
From: Johan Herland @ 2010-07-08 14:35 UTC (permalink / raw)
  To: Theodore Tso; +Cc: git, Jeff King, Will Palmer, Avery Pennarun

[-- Attachment #1: Type: text/plain, Size: 1638 bytes --]

On Thursday 08 July 2010, Theodore Tso wrote:
> On Jul 7, 2010, at 1:45 PM, Jeff King wrote:
> >     And of course it's just complex, and I tend to shy away from
> >     complexity when I can. The question to me comes back to (1)
> > above. Is massive clock skew a breakage that should produce a few
> > incorrect results, or is it something we should always handle?
>
> Going back to the question that kicked off this thread, I wonder if
> there is some way that cacheing could be used to speed up the all
> cases, or at lest the edge cases, without imposing as much latency as
> tracking the max skew?   i.e., some thing like gitk's gitk.cache
> file.  For bonus points, it could be a cache file that is used by
> both gitk and git tag --contains, git branch --contains, and git
> name-rev.
>
> Does that sound like reasonable idea?

Here's a quick-and-dirty POC which builds a mapping from commits to 
their children and stores it using git notes [1], and then uses that to 
implement 'git tag --contains <commit>' by traversing _forwards_ from 
<commit> and printing all tags we encounter along the way [2].


[1]: The attached "build_childnotes.py" script builds this mapping. 
Invoke as follows:

	git log --all --format="%H,%P" |
	./build_childnotes.py |
	git fast-import

[2]: The attached "git_tag_contains.py" script traverses the notes 
printing out tags along the way. Invoke it as follows:

	git_tag_contains.py <commit>

The second script is way too slow, and really needs to use "git 
cat-file --batch" to not fork a process for every commit in history...


...Johan

-- 
Johan Herland, <johan@herland.net>
www.herland.net

[-- Attachment #2: build_childnotes.py --]
[-- Type: application/x-python, Size: 551 bytes --]

[-- Attachment #3: git_tag_contains.py --]
[-- Type: application/x-python, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-08 10:29                         ` Theodore Tso
                                             ` (2 preceding siblings ...)
  2010-07-08 14:35                           ` Johan Herland
@ 2010-07-08 19:06                           ` Nicolas Pitre
  3 siblings, 0 replies; 46+ messages in thread
From: Nicolas Pitre @ 2010-07-08 19:06 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Jeff King, Will Palmer, Avery Pennarun, git

On Thu, 8 Jul 2010, Theodore Tso wrote:

> 
> On Jul 7, 2010, at 1:45 PM, Jeff King wrote:
> 
> >     And of course it's just complex, and I tend to shy away from
> >     complexity when I can. The question to me comes back to (1) above.
> >     Is massive clock skew a breakage that should produce a few
> >     incorrect results, or is it something we should always handle?
> 
> Going back to the question that kicked off this thread, I wonder if there
> is some way that cacheing could be used to speed up the all cases,
> or at lest the edge cases, without imposing as much latency as tracking
> the max skew?   i.e., some thing like gitk's gitk.cache file.  For bonus
> points, it could be a cache file that is used by both gitk and git tag
> --contains, git branch --contains, and git name-rev.
> 
> Does that sound like reasonable idea?

I don't think any caching would be as good as fixing the fundamental 
issue.

Git is fast, sure.  But it could be way faster yet in its graph 
traversal.  And my pack v4 format is meant to overcome all those 
obstacles that Git currently has to work through in order to walk its 
commit graph.  Once one realize that most of the commit object headers 
are SHA1 reference which need no be compressed with zlib as it is done 
now, and that the author and committer info can be factored out in a 
dictionary table, and that even those SHA1 references can be substituted 
with an index value into the pack index file (a bit like the OFS variant 
of the delta object), meaning that even the object lookup could be 
bypassed, then it would be possible to make graph traversal a magnitude 
cheaper in terms of computing cycles and memory touched.

The pack format v4 has been brewing in my head for... well... years now.  
And that is good because I've improved on the original v4 design even 
more since then. And I even found some time to write more code lately.  
I have the new object encoding code almost working for trees and 
commits.  My Git hacking time is still limited so this is progressing 
slowly though.

Just to say that I don't think any kind of caching might be necessary in 
the end, as it is possible to encode object data in a pack in a way that 
ought to be about as fast to access as a separate cache would.  So if 
someone is pondering about working on a cache layer, then I'd have one 
alternate suggestion or two for that person.  ;-)

Nicolas

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-08 11:12                           ` Jakub Narebski
@ 2010-07-08 19:29                             ` Nicolas Pitre
  2010-07-08 19:39                               ` Avery Pennarun
  0 siblings, 1 reply; 46+ messages in thread
From: Nicolas Pitre @ 2010-07-08 19:29 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Theodore Tso, Jeff King, Will Palmer, Avery Pennarun, git

On Thu, 8 Jul 2010, Jakub Narebski wrote:

> Theodore Tso <tytso@MIT.EDU> writes:
> > On Jul 7, 2010, at 1:45 PM, Jeff King wrote:
> > 
> > >     And of course it's just complex, and I tend to shy away from
> > >     complexity when I can. The question to me comes back to (1) above.
> > >     Is massive clock skew a breakage that should produce a few
> > >     incorrect results, or is it something we should always handle?
> > 
> > Going back to the question that kicked off this thread, I wonder if there
> > is some way that cacheing could be used to speed up the all cases,
> > or at lest the edge cases, without imposing as much latency as tracking
> > the max skew?   i.e., some thing like gitk's gitk.cache file.  For bonus
> > points, it could be a cache file that is used by both gitk and git tag
> > --contains, git branch --contains, and git name-rev.
> > 
> > Does that sound like reasonable idea?
> 
> By the way, what had happened to the rev-cache project from GSoC 2009?

I think the person who worked on it did so in good faith, and that the 
result is well done.

But I personally cannot convince myself this is fundamentally the right 
solution to the issue.  Maintaining another data structure to work 
around defficiencies in the primary data structure doesn't sound right 
to me.

I might be looking at this from my own perspective as one of the few 
people who hacked extensively on the Git pack format from the very 
beginning.  But I do see a way for the pack format to encode commit and 
tree objects so that walking them would be a simple lookup in the pack 
index file where both the SHA1 and offset in the pack for each parent 
can be immediately retrieved.  Same for tree references.  No deflating 
required, no binary search, just simple dereferences.  And the pack size 
would even shrink as a side effect.

Nicolas

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-08 19:29                             ` Nicolas Pitre
@ 2010-07-08 19:39                               ` Avery Pennarun
  2010-07-08 20:13                                 ` Nicolas Pitre
  0 siblings, 1 reply; 46+ messages in thread
From: Avery Pennarun @ 2010-07-08 19:39 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jakub Narebski, Theodore Tso, Jeff King, Will Palmer, git

On Thu, Jul 8, 2010 at 3:29 PM, Nicolas Pitre <nico@fluxnic.net> wrote:
> On Thu, 8 Jul 2010, Jakub Narebski wrote:
>> By the way, what had happened to the rev-cache project from GSoC 2009?
>
> I think the person who worked on it did so in good faith, and that the
> result is well done.
>
> But I personally cannot convince myself this is fundamentally the right
> solution to the issue.  Maintaining another data structure to work
> around defficiencies in the primary data structure doesn't sound right
> to me.
>
> I might be looking at this from my own perspective as one of the few
> people who hacked extensively on the Git pack format from the very
> beginning.  But I do see a way for the pack format to encode commit and
> tree objects so that walking them would be a simple lookup in the pack
> index file where both the SHA1 and offset in the pack for each parent
> can be immediately retrieved.  Same for tree references.  No deflating
> required, no binary search, just simple dereferences.  And the pack size
> would even shrink as a side effect.

One trick that bup uses is an additional file that sits alongside the
pack and acts as an index.  In bup's case, this is to work around
deficiencies in the .idx file format when using ridiculously huge
numbers of objects (hundreds of millions) across a large number of
packfiles.  But the same concept could apply here: instead of doing
something like rev-cache, you could just construct the "efficient"
part of the packv4 format (which I gather is entirely related to
commit messages), and store it alongside each pack.

This would allow people to incrementally modify git to use the new,
efficient commit object storage, without breaking backward
compatibility with earlier versions of git.  (Just as bup can index
huge numbers of packed objects but still stores them in the plain git
pack format.)

Just a thought.  Thinking of it this way might make it easier to get
over the inertia of an entirely new packfile format.

Have fun,

Avery

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-08 19:39                               ` Avery Pennarun
@ 2010-07-08 20:13                                 ` Nicolas Pitre
  2010-07-08 21:20                                   ` Jakub Narebski
  0 siblings, 1 reply; 46+ messages in thread
From: Nicolas Pitre @ 2010-07-08 20:13 UTC (permalink / raw)
  To: Avery Pennarun; +Cc: Jakub Narebski, Theodore Tso, Jeff King, Will Palmer, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2425 bytes --]

On Thu, 8 Jul 2010, Avery Pennarun wrote:

> On Thu, Jul 8, 2010 at 3:29 PM, Nicolas Pitre <nico@fluxnic.net> wrote:
> > I might be looking at this from my own perspective as one of the few
> > people who hacked extensively on the Git pack format from the very
> > beginning.  But I do see a way for the pack format to encode commit and
> > tree objects so that walking them would be a simple lookup in the pack
> > index file where both the SHA1 and offset in the pack for each parent
> > can be immediately retrieved.  Same for tree references.  No deflating
> > required, no binary search, just simple dereferences.  And the pack size
> > would even shrink as a side effect.
> 
> One trick that bup uses is an additional file that sits alongside the
> pack and acts as an index.  In bup's case, this is to work around
> deficiencies in the .idx file format when using ridiculously huge
> numbers of objects (hundreds of millions) across a large number of
> packfiles.  But the same concept could apply here: instead of doing
> something like rev-cache, you could just construct the "efficient"
> part of the packv4 format (which I gather is entirely related to
> commit messages), and store it alongside each pack.

No.  I want the essential information in an efficient encoding _inside_ 
the pack, actually replacing the existing encoding.  One of the goal is 
also to reduce repository size, not to grow it.

> This would allow people to incrementally modify git to use the new,
> efficient commit object storage, without breaking backward
> compatibility with earlier versions of git.  (Just as bup can index
> huge numbers of packed objects but still stores them in the plain git
> pack format.)

Initially, what I'm aiming for is for pack-objects to produce the new 
format, for index-pack to grok it, and for sha1_file:unpack_entry() to 
simply regenerate the canonical object format whenever a pack v4 object 
is encountered.  Also pack-objects would be able to revert the object 
encoding to the current format on the fly when it is serving a fetch 
request to a client which is not pack v4 aware, just like we do now with 
the ofs-delta capability.

Once that stage is reached, I'll submit the lot and hope that other 
people will help incrementally converting part of Git to benefit from 
native access to the pack v4 data.  The tree object walk code would be 
the first obvious candidate.  And so on.

Nicolas

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-08 20:13                                 ` Nicolas Pitre
@ 2010-07-08 21:20                                   ` Jakub Narebski
  2010-07-08 21:30                                     ` Sverre Rabbelier
  2010-07-08 23:15                                     ` Nicolas Pitre
  0 siblings, 2 replies; 46+ messages in thread
From: Jakub Narebski @ 2010-07-08 21:20 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Avery Pennarun, Theodore Tso, Jeff King, Will Palmer, git

Dnia czwartek 8. lipca 2010 22:13, Nicolas Pitre napisał:
> On Thu, 8 Jul 2010, Avery Pennarun wrote:
> > On Thu, Jul 8, 2010 at 3:29 PM, Nicolas Pitre <nico@fluxnic.net> wrote:

> > > I might be looking at this from my own perspective as one of the few
> > > people who hacked extensively on the Git pack format from the very
> > > beginning.  But I do see a way for the pack format to encode commit and
> > > tree objects so that walking them would be a simple lookup in the pack
> > > index file where both the SHA1 and offset in the pack for each parent
> > > can be immediately retrieved.  Same for tree references.  No deflating
> > > required, no binary search, just simple dereferences.  And the pack size
> > > would even shrink as a side effect.
> > 
> > One trick that bup uses is an additional file that sits alongside the
> > pack and acts as an index.  In bup's case, this is to work around
> > deficiencies in the .idx file format when using ridiculously huge
> > numbers of objects (hundreds of millions) across a large number of
> > packfiles.  But the same concept could apply here: instead of doing
> > something like rev-cache, you could just construct the "efficient"
> > part of the packv4 format (which I gather is entirely related to
> > commit messages), and store it alongside each pack.
> 
> No.  I want the essential information in an efficient encoding _inside_ 
> the pack, actually replacing the existing encoding.  One of the goal is 
> also to reduce repository size, not to grow it.

That's a good idea.
 
> > This would allow people to incrementally modify git to use the new,
> > efficient commit object storage, without breaking backward
> > compatibility with earlier versions of git.  (Just as bup can index
> > huge numbers of packed objects but still stores them in the plain git
> > pack format.)
> 
> Initially, what I'm aiming for is for pack-objects to produce the new 
> format, for index-pack to grok it, and for sha1_file:unpack_entry() to 
> simply regenerate the canonical object format whenever a pack v4 object 
> is encountered.  Also pack-objects would be able to revert the object 
> encoding to the current format on the fly when it is serving a fetch 
> request to a client which is not pack v4 aware, just like we do now with 
> the ofs-delta capability.
> 
> Once that stage is reached, I'll submit the lot and hope that other 
> people will help incrementally converting part of Git to benefit from 
> native access to the pack v4 data.  The tree object walk code would be 
> the first obvious candidate.  And so on.

If I remember correctly with pack v4 some operations like getting size
of tree object needs encoding to current format, so they are slower than
they should be (and perhaps a bit slower than current implementation).
But that should be I think rare (well, unless one streams to 
'git cat-file --batch / --batch-check').

Would pack v4 need index v4?

By the way, rev-cache project was started mainly to make "counting
objects" part of clone / fetch faster.  Would pack v4 offer the same
without rev-cache?

-- 
Jakub Narebski
Poland

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-08 21:20                                   ` Jakub Narebski
@ 2010-07-08 21:30                                     ` Sverre Rabbelier
  2010-07-08 23:10                                       ` Nicolas Pitre
  2010-07-08 23:15                                     ` Nicolas Pitre
  1 sibling, 1 reply; 46+ messages in thread
From: Sverre Rabbelier @ 2010-07-08 21:30 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Jakub Narebski, Avery Pennarun, Theodore Tso, Jeff King,
	Will Palmer, git

Heya,

2010/7/8 Jakub Narebski <jnareb@gmail.com>:
> By the way, rev-cache project was started mainly to make "counting
> objects" part of clone / fetch faster.  Would pack v4 offer the same
> without rev-cache?

Speaking of pack v4 features...

I remember from GitTogether08 that one hope was that the metadata
would be in a different file, such that it could be retrieved
separately (allowing shallow clones to create and push new commits).
Is this not something that is part of the design goals for packv4?

-- 
Cheers,

Sverre Rabbelier

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-08 21:30                                     ` Sverre Rabbelier
@ 2010-07-08 23:10                                       ` Nicolas Pitre
  0 siblings, 0 replies; 46+ messages in thread
From: Nicolas Pitre @ 2010-07-08 23:10 UTC (permalink / raw)
  To: Sverre Rabbelier
  Cc: Jakub Narebski, Avery Pennarun, Theodore Tso, Jeff King,
	Will Palmer, git

[-- Attachment #1: Type: TEXT/PLAIN, Size: 674 bytes --]

On Thu, 8 Jul 2010, Sverre Rabbelier wrote:

> Heya,
> 
> 2010/7/8 Jakub Narebski <jnareb@gmail.com>:
> > By the way, rev-cache project was started mainly to make "counting
> > objects" part of clone / fetch faster.  Would pack v4 offer the same
> > without rev-cache?
> 
> Speaking of pack v4 features...
> 
> I remember from GitTogether08 that one hope was that the metadata
> would be in a different file, such that it could be retrieved
> separately (allowing shallow clones to create and push new commits).
> Is this not something that is part of the design goals for packv4?

No.  And I don't see why a pack format change would need to be involved 
either.


Nicolas

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-08 21:20                                   ` Jakub Narebski
  2010-07-08 21:30                                     ` Sverre Rabbelier
@ 2010-07-08 23:15                                     ` Nicolas Pitre
  1 sibling, 0 replies; 46+ messages in thread
From: Nicolas Pitre @ 2010-07-08 23:15 UTC (permalink / raw)
  To: Jakub Narebski; +Cc: Avery Pennarun, Theodore Tso, Jeff King, Will Palmer, git

On Thu, 8 Jul 2010, Jakub Narebski wrote:

> If I remember correctly with pack v4 some operations like getting size
> of tree object needs encoding to current format, so they are slower than
> they should be (and perhaps a bit slower than current implementation).
> But that should be I think rare (well, unless one streams to 
> 'git cat-file --batch / --batch-check').

I didn't write the code to recreate canonical object format yet, but I 
suspect that the pack v4 object encoding will have to carry the size for 
the canonical version.

> Would pack v4 need index v4?

So far, no.

> By the way, rev-cache project was started mainly to make "counting
> objects" part of clone / fetch faster.  Would pack v4 offer the same
> without rev-cache?

That's indeed the main goal.


Nicolas

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC/PATCH 1/4] tag: speed up --contains calculation
  2010-07-05 12:33                 ` [RFC/PATCH 1/4] tag: speed up --contains calculation Jeff King
@ 2010-10-13 22:07                   ` Jonathan Nieder
  2010-10-13 22:56                   ` Clemens Buchacher
  2011-02-23 15:51                   ` Ævar Arnfjörð Bjarmason
  2 siblings, 0 replies; 46+ messages in thread
From: Jonathan Nieder @ 2010-10-13 22:07 UTC (permalink / raw)
  To: Jeff King; +Cc: tytso, Avery Pennarun, git, Junio C Hamano

Hi,

Jeff King wrote:

> When we want to know if commit A contains commit B (or any
> one of a set of commits, B through Z),

Let's revive this old thread.  Sorry for the long silence (the last
time this topic was visited was a couple months ago [1] afaict).

> Instead, let's leverage the fact that we are going to use
> the same --contains list for each tag, and mark areas of the
> commit graph is definitely containing those commits, or
> definitely not containing those commits. Later tags can then
> stop traversing as soon as they see a previously calculated
> answer.

Makes sense.  And the resulting timings are exciting.

> ---
> Note that this is a depth first search, whereas we generally traverse in
> a more breadth-first way. So it can actually make things slightly slower
> than the current merge-base code, if:
> 
>   1. You don't have any merge base calculations that involve going very
>      far back in history.
> 
>   2. We depth-first down the wrong side of a merge.
> 
> However, in the usual cases, I think it will perform much better.

Nit: Wouldn't this (or a summary of it) belong in the log message, too,
to avoid headscratching when a person tries to bisect a performance
regression down the line?

>   1. This can probably be a one-liner change to use in "git branch
>      --contains", as well. I didn't measure it, as that already tends to
>      be reasonably fast.

These days "branch -r --contains" has been _sloow_ for me.  So once
this is tuned, that wouldn't be a bad idea, methinks.

>   2. It uses commit marks, which means it doesn't behave well with other
>      traversals. I have no idea if I should be using alternate marks or
>      not.

This is what Junio was referring to with "the use of TMP_MARK smelled
somewhat bad to me", right?

>   3. We never clear the marks, or check them against the "want" list.

This too, I suppose.

> So
>      we just assume that repeated calls to contains will have the same
>      "want" list.

As long as this is advertised, I think it is fine.  If someone needs
to use --contains for multiple purposes in a builtin, that can be
dealt with then (maybe with a separate function to reset the marks).

Maybe it would make sense to keep this static in builtin/tag.c for
now, and defer cleaning up the interface to a separate patch.

> --- a/commit.c
> +++ b/commit.c

Now the meat of the patch.  git has a few algorithms for checking if
one commit is an ancestor of another.

is_descendant_of::

	1. find merge bases.  This is breadth-first search in
	   date order, but it walks through all ancestors as far
	   as I can tell.
	2. make them independent (seems pointless...).
	3. check if ancestor is one of them.

get_revision after setting UNINTERESTING (i.e., rev-list -1 ^tag cmit)::

	1. find interesting and uninteresting commits.  This is
	   breadth-first search in date order.
	2. when the first interesting commit is found, emit it
	   and stop.

Neither is optimized to check a batch of commits to find which contain
the commit of interest.

join_revs of show-branch::

	1. color the tag and cmit.
	2. let colors propagate back to all their ancestors, stopping
	   propagation along a given path when a commit has all colors.
	   This is breadth-first traversal in date order.
	3. for each commit with all colors, propagate that knowledge
	   to all parents we've encountered before.
	4. check what colors cmit has.

limit_to_ancestry::

	1. compute rev-list ^cmit tag in the usual way.
	2. reverse the revision list.
	3. color cmit with TMP_MARK and let that color permeate
	   up the backwards revision list.
	4. repeat! until no progress is made.
	4. check what color tag has.

None of these seem to tolerate timestamp skew. :(

Am I understanding correctly?

If so, just making is_descendant_of less stupid (e.g., not traversing
all history) would presumably help a lot.  Another approach would be
to try the show-branch tactic repeatedly with small enough batches of
tags to fit in the flag bitfield.  Yet another approach would be to
take the limit_to_ancestry one, but if the revision list is not
topologically sorted, it could be slow.

Hopefully your approach will be even better...

> @@ -845,3 +845,45 @@ int commit_tree(const char *msg, unsigned char *tree,
[...]
> +static int contains_recurse(struct commit *candidate,
> +			    const struct commit_list *want)
> +{
> +	struct commit_list *p;
> +
> +	/* was it previously marked as containing a want commit? */
> +	if (candidate->object.flags & TMP_MARK)
> +		return 1;
> +	/* or marked as not possibly containing a want commit? */
> +	if (candidate->object.flags & UNINTERESTING)
> +		return 0;
> +	/* or are we it? */
> +	if (in_commit_list(want, candidate))
> +		return 1;
> +
> +	if (parse_commit(candidate) < 0)
> +		return 0;
> +
> +	/* Otherwise recurse and mark ourselves for future traversals. */
> +	for (p = candidate->parents; p; p = p->next) {
> +		if (contains_recurse(p->item, want)) {
> +			candidate->object.flags |= TMP_MARK;
> +			return 1;
> +		}
> +	}
> +	candidate->object.flags |= UNINTERESTING;

Nice.  We cache the "is wanted" result using TMP_MARK and
UNINTERESTING.  It cannot be slower than is_descendant_of if I
understood correctly because the latter is a little crazy.

Possible unprincipled approach:

 1. Mark the wanted commits with TMP_MARK.
 2. Perform a breadth-first, date-order search starting at
    candidate to find a commit with TMP_MARK.
    a. If not found, mark the commits involved as UNINTERESTING
    b. If found, let the TMP_MARK permeate up using the
       limit_to_ancestry algorithm.
 3. Repeat with each other candidate.
 4. (If needed) walk starting at each candidate to clear the
    temporary flags.

I am not sure if that would be any faster, though.  What do you think?

[...]
> +int contains(struct commit *, const struct commit_list *);

Thanks for a pleasant read.

[1] http://thread.gmane.org/gmane.comp.version-control.git/152607/focus=152701

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC/PATCH 1/4] tag: speed up --contains calculation
  2010-07-05 12:33                 ` [RFC/PATCH 1/4] tag: speed up --contains calculation Jeff King
  2010-10-13 22:07                   ` Jonathan Nieder
@ 2010-10-13 22:56                   ` Clemens Buchacher
  2011-02-23 15:51                   ` Ævar Arnfjörð Bjarmason
  2 siblings, 0 replies; 46+ messages in thread
From: Clemens Buchacher @ 2010-10-13 22:56 UTC (permalink / raw)
  To: Jeff King; +Cc: tytso, Avery Pennarun, git

[-- Attachment #1: Type: text/plain, Size: 591 bytes --]

On Mon, Jul 05, 2010 at 08:33:36AM -0400, Jeff King wrote:
> 
> This sped up "git tag --contains HEAD~200" in the linux-2.6
> repository from:
> 
>   real    0m15.417s
>   user    0m15.197s
>   sys     0m0.220s

FWIW, in the staging tree they have a new tag for almost every day.
My linux repo currently has 677 tags. This same command would
probably take hours. I have not had the patience yet to let it
finish.

With your patch I get this.

$ time git tag --contains HEAD~200
v2.6.36-rc3
v2.6.36-rc4
v2.6.36-rc5

real	0m8.047s
user	0m7.755s
sys	0m0.268s

Clemens

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC/PATCH 2/4] limit "contains" traversals based on commit timestamp
  2010-07-05 12:34                 ` [RFC/PATCH 2/4] limit "contains" traversals based on commit timestamp Jeff King
@ 2010-10-13 23:21                   ` Jonathan Nieder
  0 siblings, 0 replies; 46+ messages in thread
From: Jonathan Nieder @ 2010-10-13 23:21 UTC (permalink / raw)
  To: Jeff King; +Cc: tytso, Avery Pennarun, git, Junio C Hamano

Jeff King wrote:

> Name-rev already implements a similar optimization, using a
> "slop" of one day to allow for a certain amount of clock
> skew in commit timestamps. This patch introduces a
> "core.clockskew" variable, which allows specifying the
> allowable amount of clock skew in seconds.  For safety, it
> defaults to "none", causing a full traversal (i.e., no
> change in behavior from previous versions).

Tests?

Actually just a short example script to try would be helpful,
if anyone has one handy (yes, I am terribly lazy).  Such a script
would be useful for figuring out which commands ought to be
updated to respect core_clock_skew.  rev-list is one.

> --- a/commit.c
> +++ b/commit.c
[...]
> @@ -872,9 +874,13 @@ static int contains_recurse(struct commit *candidate,
>  	if (parse_commit(candidate) < 0)
>  		return 0;
>  
> +	/* stop searching if we go too far back in time */
> +	if (candidate->date < cutoff)
> +		return 0;
> +

Nice idea.

> @@ -885,5 +891,20 @@ static int contains_recurse(struct commit *candidate,
>  
>  int contains(struct commit *candidate, const struct commit_list *want)
>  {
> -	return contains_recurse(candidate, want);
> +	unsigned long cutoff = 0;
> +
> +	if (core_clock_skew >= 0) {
> +		const struct commit_list *c;
> +		unsigned long min_date = ULONG_MAX;
> +		for (c = want; c; c = c->next) {
> +			if (parse_commit(c->item) < 0)
> +				continue;

Why ignore these errors?  Will they be noticed later?

The rest of the patch looks good to me.  I am not thrilled with
making the user figure out an acceptable "[core] clockskew" value
(and am not sure it makes much sense as a tunable setting), but
it is better than the status quo, so...

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-07-05 12:39                 ` Why is "git tag --contains" so slow? Jeff King
@ 2010-10-14 18:59                   ` Jonathan Nieder
  2010-10-16 14:32                     ` Clemens Buchacher
  0 siblings, 1 reply; 46+ messages in thread
From: Jonathan Nieder @ 2010-10-14 18:59 UTC (permalink / raw)
  To: Jeff King; +Cc: tytso, Avery Pennarun, git

Jeff King wrote:

> This patch doesn't implement auto-detection of skew, but is the program
> I used to calculate, and would provide the basis for such
> auto-detection. It would be interesting to see average skew numbers for
> popular repositories. You can run it as "git skew --all".

Fun.  The story this tells me is that most repositories would benefit
more from special-casing a few commits than from a "[core] clockskew"
setting.

Caveat: the repos mentioned below are not canonical at all.  They are
just some repos I happened to have around, some tracking multiple upstreams.

project # skewed        maximum skew    notes
------- --------        ------------    -----
gtk+    13              13 hrs          worst example seems to be tz related

systemd 3               1 hr            all three come from a single 65min
                                        jump.  Same committer before+after.

dpkg    2               76 days         automatic import from manually
                                        written changelog

glibc   2               31 months       (tag) commits imported to wrong
                                        branches by "git cvsimport", I think

git     4               7 minutes       all four come from a single merge
                                        commit, parents have same committer
                                        as the merge.  probably clock drift
                                        corrected with something like ntp.

guilt   22              96 minutes      from the known (fixed) bug, I assume

linux-2.6 1226          99 days         probably the old guilt bug (e.g., in
                                        ext4, blackfin trees)

sed     0

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-10-14 18:59                   ` Jonathan Nieder
@ 2010-10-16 14:32                     ` Clemens Buchacher
  2010-10-27 17:11                       ` Jeff King
  0 siblings, 1 reply; 46+ messages in thread
From: Clemens Buchacher @ 2010-10-16 14:32 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jeff King, tytso, Avery Pennarun, git

[-- Attachment #1: Type: text/plain, Size: 681 bytes --]

On Thu, Oct 14, 2010 at 01:59:45PM -0500, Jonathan Nieder wrote:
> 
> project # skewed        maximum skew    notes
> ------- --------        ------------    -----
> gtk+    13              13 hrs          worst example seems to be tz related

It really is kind of fun.

wine                    1       14 days
mesa                    4930    150 days

xf86-video-ati          13      ~2.5 hrs
xf86-video-intel        26      8 hrs
xf86-video-nouveau      12       10 hrs

fluxbox                 0       0
metacity                0       0
openbox                 8       4 hrs

giggle                  2       21 min
glibc                   2       931 days

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-10-16 14:32                     ` Clemens Buchacher
@ 2010-10-27 17:11                       ` Jeff King
  2010-10-28  8:07                         ` Clemens Buchacher
  0 siblings, 1 reply; 46+ messages in thread
From: Jeff King @ 2010-10-27 17:11 UTC (permalink / raw)
  To: Clemens Buchacher; +Cc: Jonathan Nieder, tytso, Avery Pennarun, git

On Sat, Oct 16, 2010 at 04:32:26PM +0200, Clemens Buchacher wrote:

> On Thu, Oct 14, 2010 at 01:59:45PM -0500, Jonathan Nieder wrote:
> > 
> > project # skewed        maximum skew    notes
> > ------- --------        ------------    -----
> > gtk+    13              13 hrs          worst example seems to be tz related
> 
> It really is kind of fun.
> 
> wine                    1       14 days

Thanks both of you for the extra data points. If you don't mind, would
you consider running my updated git-skew below on your test cases (or
tweaking your skew detectors, since you both seem to be getting a '#
skewed' column that mine didn't output). Specifically, I am interested
in long runs of skewed commits, since one potential solution would be to
just accept a fixed number of slop commits (rather than accepting
commits within a certain slop time).

For the kernel the longest run looks to be about 80 commits. This should
yield much better performance than handling the worst skew by time
(which was 100 days, during which many more than 80 commits are usually
used).

Patch below. NB: thinking on this more, I think my program's results are
slightly inaccurate. There are corner skew cases we can miss by the use
of marking commits SEEN (i.e., skew you would see if you got to a commit
by another path). However, the problem gets intractably large if you
don't mark commits (you end up traversing every possible path through
the graph).

---
 .gitignore     |    1 +
 Makefile       |    1 +
 builtin.h      |    1 +
 builtin/skew.c |   62 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 git.c          |    1 +
 5 files changed, 66 insertions(+), 0 deletions(-)
 create mode 100644 builtin/skew.c

diff --git a/.gitignore b/.gitignore
index 20560b8..d696c78 100644
--- a/.gitignore
+++ b/.gitignore
@@ -132,6 +132,7 @@
 /git-show-branch
 /git-show-index
 /git-show-ref
+/git-skew
 /git-stage
 /git-stash
 /git-status
diff --git a/Makefile b/Makefile
index 1f1ce04..25f94b0 100644
--- a/Makefile
+++ b/Makefile
@@ -739,6 +739,7 @@ BUILTIN_OBJS += builtin/send-pack.o
 BUILTIN_OBJS += builtin/shortlog.o
 BUILTIN_OBJS += builtin/show-branch.o
 BUILTIN_OBJS += builtin/show-ref.o
+BUILTIN_OBJS += builtin/skew.o
 BUILTIN_OBJS += builtin/stripspace.o
 BUILTIN_OBJS += builtin/symbolic-ref.o
 BUILTIN_OBJS += builtin/tag.o
diff --git a/builtin.h b/builtin.h
index f2a25a0..e01ac4c 100644
--- a/builtin.h
+++ b/builtin.h
@@ -140,5 +140,6 @@ extern int cmd_verify_pack(int argc, const char **argv, const char *prefix);
 extern int cmd_show_ref(int argc, const char **argv, const char *prefix);
 extern int cmd_pack_refs(int argc, const char **argv, const char *prefix);
 extern int cmd_replace(int argc, const char **argv, const char *prefix);
+extern int cmd_skew(int argc, const char **argv, const char *prefix);
 
 #endif
diff --git a/builtin/skew.c b/builtin/skew.c
new file mode 100644
index 0000000..169a9f4
--- /dev/null
+++ b/builtin/skew.c
@@ -0,0 +1,62 @@
+#include "cache.h"
+#include "commit.h"
+#include "diff.h"
+#include "revision.h"
+
+unsigned long worst_skew = 0;
+unsigned char worst_skew_sha1[20];
+unsigned long worst_run = 0;
+unsigned char worst_run_sha1[20];
+
+static void check_skew_recurse(struct commit *c, unsigned long when, int counter)
+{
+	struct commit_list *p;
+
+	if (c->object.flags & SEEN)
+		return;
+	c->object.flags |= SEEN;
+
+	if (parse_commit(c) < 0)
+		return;
+
+	if (c->date > when) {
+		unsigned long skew = c->date - when;
+		if (skew > worst_skew) {
+			worst_skew = skew;
+			hashcpy(worst_skew_sha1, c->object.sha1);
+		}
+		if (++counter > worst_run) {
+			worst_run = counter;
+			hashcpy(worst_run_sha1, c->object.sha1);
+		}
+	}
+	else
+		counter = 0;
+
+	for (p = c->parents; p; p = p->next)
+		check_skew_recurse(p->item, c->date < when ? c->date : when, counter);
+}
+
+static void check_skew(struct commit *c)
+{
+	check_skew_recurse(c, time(NULL), 0);
+}
+
+int cmd_skew(int argc, const char **argv, const char *prefix) {
+	struct rev_info revs;
+	int i;
+
+	git_config(git_default_config, NULL);
+	init_revisions(&revs, prefix);
+	argc = setup_revisions(argc, argv, &revs, NULL);
+
+	for (i = 0; i < revs.pending.nr; i++) {
+		struct object *o = revs.pending.objects[i].item;
+		if (o->type == OBJ_COMMIT)
+			check_skew((struct commit *)o);
+	}
+
+	printf("worst skew: %lu (%s)\n", worst_skew, sha1_to_hex(worst_skew_sha1));
+	printf("longest run: %lu (%s)\n", worst_run, sha1_to_hex(worst_run_sha1));
+	return 0;
+}
diff --git a/git.c b/git.c
index 0409ac9..1041858 100644
--- a/git.c
+++ b/git.c
@@ -405,6 +405,7 @@ static void handle_internal_command(int argc, const char **argv)
 		{ "verify-pack", cmd_verify_pack },
 		{ "show-ref", cmd_show_ref, RUN_SETUP },
 		{ "pack-refs", cmd_pack_refs, RUN_SETUP },
+		{ "skew", cmd_skew, RUN_SETUP },
 	};
 	int i;
 	static const char ext[] = STRIP_EXTENSION;
-- 
1.7.3.2.216.g61ab7.dirty

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: Why is "git tag --contains" so slow?
  2010-10-27 17:11                       ` Jeff King
@ 2010-10-28  8:07                         ` Clemens Buchacher
  0 siblings, 0 replies; 46+ messages in thread
From: Clemens Buchacher @ 2010-10-28  8:07 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Nieder, tytso, Avery Pennarun, git

[-- Attachment #1: Type: text/plain, Size: 2206 bytes --]

Hi Jeff,

On Wed, Oct 27, 2010 at 10:11:45AM -0700, Jeff King wrote:
> 
> Thanks both of you for the extra data points. If you don't mind, would
> you consider running my updated git-skew below on your test cases (or
> tweaking your skew detectors, since you both seem to be getting a '#
> skewed' column that mine didn't output).

I think this works well for small skews and for positive skews,
i.e., if the date goes back to the future. But if the date goes
into the past, that time is used as the new reference, and all
subsequently traversed commits are considered skewed. So for
example, if I add the following commit to git.git's current master,
the longest run becomes 1092 commits.

$ GIT_COMMITTER_DATE=$(date -d "now-1 year") git commit --allow-empty -m add-skew

To solve this, we could keep track of a window of the last n
commits traversed and take as a new time reference the largest date
within that window which is smaller than the previous reference.
That way we would fail to detect skew only if all of the commits in
that window are skewed into the past or into the future.

This particular scenario does not seem to occur in any of the repos
I tested, except maybe for the mesa repo. But manual inspection
shows that that repo has very long runs of skewed commits anyways.

linux                           99 days         80
wine                    1       14 days         1
mesa                    4930    150 days        1520

xf86-video-ati          13      ~2.5 hrs        4
xf86-video-intel        26      8 hrs           4
xf86-video-nouveau      12       10 hrs         9

fluxbox                 0       0               0
metacity                0       0               0
openbox                 8       4 hrs           4                               

giggle                  2       21 min          2                               
glibc                   2       931 days        1

Clemens

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC/PATCH 1/4] tag: speed up --contains calculation
  2010-07-05 12:33                 ` [RFC/PATCH 1/4] tag: speed up --contains calculation Jeff King
  2010-10-13 22:07                   ` Jonathan Nieder
  2010-10-13 22:56                   ` Clemens Buchacher
@ 2011-02-23 15:51                   ` Ævar Arnfjörð Bjarmason
  2011-02-23 16:39                     ` Jeff King
  2 siblings, 1 reply; 46+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2011-02-23 15:51 UTC (permalink / raw)
  To: Jeff King; +Cc: tytso, Avery Pennarun, git, Junio C Hamano

On Mon, Jul 5, 2010 at 14:33, Jeff King <peff@peff.net> wrote:
> When we want to know if commit A contains commit B (or any
> one of a set of commits, B through Z), we generally
> calculate the merge bases and see if B is a merge base of A
> (or for a set, if any of the commits B through Z have that
> property).

On a work repo with around 10k tags after and before this patch:

    $ time ~/g/git/git tag --contains HEAD~200 | wc -l
    113

    real    0m0.421s
    user    0m0.380s
    sys     0m0.042s

    $ time git tag --contains HEAD~200 | wc -l
    113

    real    2m18.861s
    user    2m18.750s
    sys     0m0.092s

I'd love to have this merged downwards from pu. It's the single
biggest usability improvement in my Git workflow for as long as I can
remember.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC/PATCH 1/4] tag: speed up --contains calculation
  2011-02-23 15:51                   ` Ævar Arnfjörð Bjarmason
@ 2011-02-23 16:39                     ` Jeff King
  0 siblings, 0 replies; 46+ messages in thread
From: Jeff King @ 2011-02-23 16:39 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: tytso, Avery Pennarun, git, Junio C Hamano

On Wed, Feb 23, 2011 at 04:51:02PM +0100, Ævar Arnfjörð Bjarmason wrote:

> On Mon, Jul 5, 2010 at 14:33, Jeff King <peff@peff.net> wrote:
> > When we want to know if commit A contains commit B (or any
> > one of a set of commits, B through Z), we generally
> > calculate the merge bases and see if B is a merge base of A
> > (or for a set, if any of the commits B through Z have that
> > property).
> 
> On a work repo with around 10k tags after and before this patch:
> [...much speed improvement...]
> I'd love to have this merged downwards from pu. It's the single
> biggest usability improvement in my Git workflow for as long as I can
> remember.

Yeah, I have been meaning to revisit the topic. I got sidetracked by
looking at clock skew issues, and there was some debate over whether the
strict depth-first traversal was the best way to do it (although if you
trust the commit timestamps, I think you stop going down dead-ends
pretty quickly in practice; and I think we do implicitly trust the
commit timestamps in the regular merge base traversals). If we can do a
multi-headed merge base, that would be better, but I'll have to think on
it.

I'll try to take a look this week and see what I can come up with.

-Peff

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2011-02-23 16:39 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-07-01  0:54 Why is "git tag --contains" so slow? Theodore Ts'o
2010-07-01  0:58 ` Shawn O. Pearce
2010-07-03 23:27   ` Sam Vilain
2010-07-01  1:00 ` Avery Pennarun
2010-07-01 12:17   ` tytso
2010-07-01 15:03     ` Jeff King
2010-07-01 15:38       ` Jeff King
2010-07-02 19:26         ` tytso
2010-07-03  8:06           ` Jeff King
2010-07-04  0:55             ` tytso
2010-07-05 12:27               ` Jeff King
2010-07-05 12:33                 ` [RFC/PATCH 1/4] tag: speed up --contains calculation Jeff King
2010-10-13 22:07                   ` Jonathan Nieder
2010-10-13 22:56                   ` Clemens Buchacher
2011-02-23 15:51                   ` Ævar Arnfjörð Bjarmason
2011-02-23 16:39                     ` Jeff King
2010-07-05 12:34                 ` [RFC/PATCH 2/4] limit "contains" traversals based on commit timestamp Jeff King
2010-10-13 23:21                   ` Jonathan Nieder
2010-07-05 12:35                 ` [RFC/PATCH 3/4] default core.clockskew variable to one day Jeff King
2010-07-05 12:36                 ` [RFC/PATCH 4/4] name-rev: respect core.clockskew Jeff King
2010-07-05 12:39                 ` Why is "git tag --contains" so slow? Jeff King
2010-10-14 18:59                   ` Jonathan Nieder
2010-10-16 14:32                     ` Clemens Buchacher
2010-10-27 17:11                       ` Jeff King
2010-10-28  8:07                         ` Clemens Buchacher
2010-07-05 14:10                 ` tytso
2010-07-06 11:58                   ` Jeff King
2010-07-06 15:31                     ` Will Palmer
2010-07-06 16:53                       ` tytso
2010-07-08 11:28                         ` Jeff King
2010-07-08 13:21                           ` Will Palmer
2010-07-08 13:54                             ` tytso
2010-07-07 17:45                       ` Jeff King
2010-07-08 10:29                         ` Theodore Tso
2010-07-08 11:12                           ` Jakub Narebski
2010-07-08 19:29                             ` Nicolas Pitre
2010-07-08 19:39                               ` Avery Pennarun
2010-07-08 20:13                                 ` Nicolas Pitre
2010-07-08 21:20                                   ` Jakub Narebski
2010-07-08 21:30                                     ` Sverre Rabbelier
2010-07-08 23:10                                       ` Nicolas Pitre
2010-07-08 23:15                                     ` Nicolas Pitre
2010-07-08 11:31                           ` Jeff King
2010-07-08 14:35                           ` Johan Herland
2010-07-08 19:06                           ` Nicolas Pitre
2010-07-07 17:50                       ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).