git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: tytso@mit.edu
Cc: Avery Pennarun <apenwarr@gmail.com>, git@vger.kernel.org
Subject: Re: Why is "git tag --contains" so slow?
Date: Tue, 6 Jul 2010 07:58:28 -0400	[thread overview]
Message-ID: <20100706115826.GA15413@sigill.intra.peff.net> (raw)
In-Reply-To: <20100705141012.GA25518@thunk.org>

On Mon, Jul 05, 2010 at 10:10:12AM -0400, tytso@mit.edu wrote:

> As time progresses, the clock skew breakage should be less likely to
> be hit by a typical developer, right?  That is, unless you are
> specifically referencing one of the commits which were skewed, two
> years from now, the chances of someone (who isn't doing code
> archeology) of getting hit by a problem should be small, right?  This

It's not about directly referencing skewed commits. It's about
traversing history that contains skewed commits. So if I have a history
like:

  A -- B -- C -- D

and "B" is skewed, then I will generally give up on finding "A" when
searching backwards from "C" or "D", or their descendants. So as time
moves forward, you will continue to have your old tags pointing to "C"
or "D", but also tags pointing to their descendants. Doing "git tag
--contains A" will continue to be inaccurate, since it will continue to
look for "A" from "C" and "D", but also from newer tags, all of which
involve traversing the skewed "B".

What I think is true is that people will be less likely to look at "A"
as time goes on, as code it introduced presumably becomes less relevant
(either bugs are shaken out, or it gets replaced, or whatever). And
obviously looking at "C" from "D", the skew in "B" will be irrelevant.

So I think typical developers become less likely to hit the issue as
time goes on, but software archaeologists will hit it forever.

> If so, I could imagine the automagic scheme choosing a default that
> only finds the worst skew in the past N months.  This would speed up
> things up for users who are using repositories that have skews in the
> distant past, at the cost of introducing potentially confusuing edge
> cases for people doing code archeology.

How do you decide, when looking for commits that have bogus timestamps,
which ones happened in the past N months? Certainly you can do some
statistical analysis to pick out anomalous ones. And you could perhaps
favor future skewing over past skewing, since that skew doesn't tend to
impact traversal cutoffs (and large past skewing seems to be more
common). But that is getting kind of complex.

> I'm not sure this is a good tradeoff, but given in practice how rarely
> most developers go back in time more than say, 12-24 months, maybe
> it's worth doing.  What do you think?

I'm not sure. I am tempted to just default it to 86400 and go no
further.  People who care about archaeology can turn off traversal
cutoffs if they like, and as the skewed history ages, people get less
likely to look at it. We could also pick half a year or some high number
as the default allowable. The performance increase is still quite
noticeable there, and it covers the only large skew we know about. I'd
be curious to see if other projects have skew, and how much.

-Peff

  reply	other threads:[~2010-07-06 11:58 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-07-01  0:54 Why is "git tag --contains" so slow? Theodore Ts'o
2010-07-01  0:58 ` Shawn O. Pearce
2010-07-03 23:27   ` Sam Vilain
2010-07-01  1:00 ` Avery Pennarun
2010-07-01 12:17   ` tytso
2010-07-01 15:03     ` Jeff King
2010-07-01 15:38       ` Jeff King
2010-07-02 19:26         ` tytso
2010-07-03  8:06           ` Jeff King
2010-07-04  0:55             ` tytso
2010-07-05 12:27               ` Jeff King
2010-07-05 12:33                 ` [RFC/PATCH 1/4] tag: speed up --contains calculation Jeff King
2010-10-13 22:07                   ` Jonathan Nieder
2010-10-13 22:56                   ` Clemens Buchacher
2011-02-23 15:51                   ` Ævar Arnfjörð Bjarmason
2011-02-23 16:39                     ` Jeff King
2010-07-05 12:34                 ` [RFC/PATCH 2/4] limit "contains" traversals based on commit timestamp Jeff King
2010-10-13 23:21                   ` Jonathan Nieder
2010-07-05 12:35                 ` [RFC/PATCH 3/4] default core.clockskew variable to one day Jeff King
2010-07-05 12:36                 ` [RFC/PATCH 4/4] name-rev: respect core.clockskew Jeff King
2010-07-05 12:39                 ` Why is "git tag --contains" so slow? Jeff King
2010-10-14 18:59                   ` Jonathan Nieder
2010-10-16 14:32                     ` Clemens Buchacher
2010-10-27 17:11                       ` Jeff King
2010-10-28  8:07                         ` Clemens Buchacher
2010-07-05 14:10                 ` tytso
2010-07-06 11:58                   ` Jeff King [this message]
2010-07-06 15:31                     ` Will Palmer
2010-07-06 16:53                       ` tytso
2010-07-08 11:28                         ` Jeff King
2010-07-08 13:21                           ` Will Palmer
2010-07-08 13:54                             ` tytso
2010-07-07 17:45                       ` Jeff King
2010-07-08 10:29                         ` Theodore Tso
2010-07-08 11:12                           ` Jakub Narebski
2010-07-08 19:29                             ` Nicolas Pitre
2010-07-08 19:39                               ` Avery Pennarun
2010-07-08 20:13                                 ` Nicolas Pitre
2010-07-08 21:20                                   ` Jakub Narebski
2010-07-08 21:30                                     ` Sverre Rabbelier
2010-07-08 23:10                                       ` Nicolas Pitre
2010-07-08 23:15                                     ` Nicolas Pitre
2010-07-08 11:31                           ` Jeff King
2010-07-08 14:35                           ` Johan Herland
2010-07-08 19:06                           ` Nicolas Pitre
2010-07-07 17:50                       ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100706115826.GA15413@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=apenwarr@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).