git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Junio C Hamano <gitster@pobox.com>
Cc: git@vger.kernel.org, "Jakub Narebski" <jnareb@gmail.com>,
	"Ted Ts'o" <tytso@mit.edu>,
	"Jonathan Nieder" <jrnieder@gmail.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Clemens Buchacher" <drizzd@aon.at>,
	"Shawn O. Pearce" <spearce@spearce.org>,
	"Linus Torvalds" <torvalds@linux-foundation.org>
Subject: Re: [RFC/PATCHv2 6/6] limit "contains" traversals based on commit generation
Date: Fri, 15 Jul 2011 16:40:02 -0400	[thread overview]
Message-ID: <20110715204002.GC356@sigill.intra.peff.net> (raw)
In-Reply-To: <7vpqlb1k1g.fsf@alter.siamese.dyndns.org>

On Fri, Jul 15, 2011 at 11:22:03AM -0700, Junio C Hamano wrote:

> > +	/* stop searching if we go too far back in time */
> > +	if (commit_generation(candidate) < cutoff)
> > +		return 0;
> > +
> 
> Here, the "generation number" was the commit timestamp of the commit in
> your earlier round, but now it is not.

Yes.

> [...]
> What you are trying to say in this series is that "If a tag points at a
> commit X (i.e. candidate), another commit Y (i.e. "want") that is much
> younger than cannot possibly be included by it, because a tag was made on
> commit X way before Y was created". You cut off by "age".

More or less. Given two ages, X and Y, you cannot know right off the bat
whether one is an ancestor of the other. You can only know that the
higher generation cannot possibly be an ancestor of the lower
generation.

So yes, in some cases we will see that the tag has generation 5, and the
"want" commit has generation 10, and we will know there is no point in
traversing backwards from 5.

But we may also see a tag with generation 10, and a want commit with
generation 5. In that case, we have to explore backwards, looking to
either find the commit, or to pass a point where we hit a commit with
generation less than 5, at which point we know the want commit cannot be
an ancestor.

So if you have a history like:

    B
   /
  A--C--D--...--Z

and you want to know if "B" is in "Z", you will have to traverse all the
way from Z to A before realizing that B cannot be in it. That will have
the same complexity as finding a merge base (because the merge-base
would breadth-first traverse, looking at the highest age first, and
would therefore touch the same sequence of commits).

> The heuristics would work efficiently to check what tags point at a
> relatively recent commit (e.g. when trying to see which releases are
> affected and needing a fix by a recently discovered bug introduced by
> commit Y) by discarding tags for ancient releases. In such a scenario, the
> timestamp of a tagged and ancient commit X on a side branch that was never
> merged in the mainline that leads to commit Y (i.e. "want"), in an ideal
> world without clock skews, will be way older than the timestamp of commit
> Y. In other words, if you use timestamp as "age", even though X and Y do
> not relate to each other directly, except that they may share an ancient
> common ancestor, their "age"s can be compared and you could apply your
> heuristics to optimize.
> 
> But if you use the generation number as "age", even in an ideal world
> without clock skew nor miscomputed generation number, you no longer can
> compare "age" of X and Y.  The ancient side branch that led to X may have
> tons more commits than the history leading to Y.

Yes, there are cases where commit timestamps can save us some traversal
work over generation numbers. But I think there are also cases where the
reverse is true.

Your example is of a long side branch with very old commits. But you
could also have a short side branch with a very recent commit (think
recent bugfix forked from an old version). Then with commit timestamps
we need to go to the merge base to realize we are talking about
something very old. With a generation number, you can easily see that
the short branch's generation is very low.

Using both as a cutoff (if we assumed we had both, and that they were
both implicitly accurate) would let you optimize for either situation.
In practice, timestamps are either good enough that we don't need
generation numbers, or are considered to be strictly less accurate than
generation numbers.

So I think we will end up with one or the other. Of all of the
complaints I have seen over the use of timestamps for cutoffs, I don't
think performance of these corner cases has been an issue.

>    So how about marking commits (using the metainfo-cache facility) that
>    has an ancestor (not necessarily its direct parent) that records a
>    younger timestamp (e.g. 1 is such a commit, as its ancestors include
>    things like 2 and 4)? There should be relatively small number of them,
>    and still_interesting() logic can be told to dig through such commits
>    even if everybody is uninteresting in the active list.

You don't even need metainfo-cache for this. The number of commits is
very small, so the existing "notes-cache" is plenty fast. I even posted
a patch for this already:

  http://article.gmane.org/gmane.comp.version-control.git/176642

>  * As to "tag --contains", when timestamp based heuristics breaks down is
>    when a tagged commit incorrectly records way young timestamp or the
>    "want" commit records way old timetsamp. I haven't thought things
>    through, but the same metainfo-cache may be useful to detect which
>    commit to dig through ignoring the cutoff heuristics.

It can also break down if intermediate commits are wrong, because we
have to traverse backwards, and we may erroneously cutoff early.

For example:

   A--B--C

   timestamp(A) = 2
   timestamp(B) = 1 # skewed!
   timestamp(C) = 3

If tag=C and want=A, then we traverse backwards from C. We can't stop
immediately because we know that 2 < 3. But we go back to B, and see
that 2 > 1, and assume that A cannot possibly be an ancestor of B.

You can push through another N commits of slop (where N=1 would be fine
in this case). But you can always have a run of N+1 skewed commits (and
the higher N, the more time you are wasting). From earlier measurements
posted to the list, most repos have runs less than a dozen commits.
Linux-2.6 is actually the second highest of those measured, with a run
of 80. The mesa repo apparently has a run of 1520. See:

  http://article.gmane.org/gmane.comp.version-control.git/160163

and the surrounding thread (Jonathan did some measurements, too; he
didn't include a "longest run", but he does include "total number of
skewed commits", which obviously provides an upper bound on a run).

-Peff

  reply	other threads:[~2011-07-15 20:40 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-07-13  6:47 [RFC/PATCHv2 0/6] generation numbers for faster traversals Jeff King
2011-07-13  6:57 ` [RFC/PATCHv2 1/6] decorate: allow storing values instead of pointers Jeff King
2011-07-13 17:52   ` Jonathan Nieder
2011-07-13 20:08     ` Jeff King
2011-07-14 17:34       ` Jeff King
2011-07-14 17:51         ` [PATCH 1/3] implement generic key/value map Jeff King
2011-07-14 18:52           ` Bert Wesarg
2011-07-14 18:54             ` Bert Wesarg
2011-07-14 18:55               ` Jeff King
2011-07-14 19:07                 ` Bert Wesarg
2011-07-14 19:14                   ` Jeff King
2011-07-14 19:18                     ` Bert Wesarg
2011-07-14 17:52         ` [PATCH 2/3] fast-export: use object to uint32 map instead of "decorate" Jeff King
2011-07-15  9:40           ` Sverre Rabbelier
2011-07-15 20:00             ` Jeff King
2011-07-14 17:53         ` [PATCH 3/3] decorate: use "map" for the underlying implementation Jeff King
2011-07-14 21:06         ` [RFC/PATCHv2 1/6] decorate: allow storing values instead of pointers Junio C Hamano
2011-08-04 22:43           ` [RFC/PATCH 0/5] macro-based key/value maps Jeff King
2011-08-04 22:45             ` [PATCH 1/5] implement generic key/value map Jeff King
2011-08-04 22:46             ` [PATCH 2/5] fast-export: use object to uint32 map instead of "decorate" Jeff King
2011-08-04 22:46             ` [PATCH 3/5] decorate: use "map" for the underlying implementation Jeff King
2011-08-04 22:46             ` [PATCH 4/5] map: implement persistent maps Jeff King
2011-08-04 22:46             ` [PATCH 5/5] implement metadata cache subsystem Jeff King
2011-08-04 22:48             ` [RFC/PATCH 0/2] patch-id caching Jeff King
2011-08-04 22:49               ` [PATCH 1/2] cherry: read default config Jeff King
2011-08-04 22:49               ` [PATCH 2/2] cache patch ids on disk Jeff King
2011-08-04 22:52                 ` Jeff King
2011-08-05 11:03             ` [RFC/PATCH 0/5] macro-based key/value maps Jeff King
2011-08-05 15:31               ` René Scharfe
2011-08-06  6:30                 ` Jeff King
2011-07-13  7:04 ` [RFC/PATCHv2 2/6] add metadata-cache infrastructure Jeff King
2011-07-13  8:18   ` Bert Wesarg
2011-07-13  8:31     ` Jeff King
2011-07-13  8:45       ` Bert Wesarg
2011-07-13 19:18         ` Jeff King
2011-07-13 19:40       ` Junio C Hamano
2011-07-13 19:33   ` Junio C Hamano
2011-07-13 20:25     ` Jeff King
2011-07-13  7:05 ` [RFC/PATCHv2 3/6] commit: add commit_generation function Jeff King
2011-07-13 14:26   ` Eric Sunshine
2011-07-13  7:05 ` [RFC/PATCHv2 4/6] pretty: support %G to show the generation number of a commit Jeff King
2011-07-13  7:06 ` [RFC/PATCHv2 5/6] check commit generation cache validity against grafts Jeff King
2011-07-13 14:26   ` Eric Sunshine
2011-07-13 19:35     ` Jeff King
2011-07-13  7:06 ` [RFC/PATCHv2 6/6] limit "contains" traversals based on commit generation Jeff King
2011-07-13  7:23   ` Jeff King
2011-07-13 20:33     ` Junio C Hamano
2011-07-13 20:58       ` Jeff King
2011-07-13 21:12         ` Junio C Hamano
2011-07-13 21:18           ` Jeff King
2011-07-15 18:22   ` Junio C Hamano
2011-07-15 20:40     ` Jeff King [this message]
2011-07-15 21:04       ` Junio C Hamano
2011-07-15 21:14         ` Jeff King
2011-07-15 21:01 ` Generation numbers and replacement objects Jakub Narebski
2011-07-15 21:10   ` Jeff King
2011-07-16 21:10     ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110715204002.GC356@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=avarab@gmail.com \
    --cc=drizzd@aon.at \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jnareb@gmail.com \
    --cc=jrnieder@gmail.com \
    --cc=spearce@spearce.org \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).