git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: "SZEDER Gábor" <szeder.dev@gmail.com>
Cc: Taylor Blau <me@ttaylorr.com>,
	git@vger.kernel.org, dstolee@microsoft.com
Subject: looking up object types quickly, was Re: [PATCH 1/1] commit-graph.c: avoid unnecessary tag dereference when merging
Date: Sun, 22 Mar 2020 14:45:19 -0400	[thread overview]
Message-ID: <20200322184519.GA654322@coredump.intra.peff.net> (raw)
In-Reply-To: <20200322110424.GC2224@szeder.dev>

On Sun, Mar 22, 2020 at 12:04:24PM +0100, SZEDER Gábor wrote:

> > > Where the last step is taking all commits listed in any pack, which is
> > > cheap to iterate.
> > 
> > I'm not sure it's all that cheap. It has to find the type of every
> > object in every pack. And finding types involves walking delta chains.
> 
> Indeed, that's just about the most expensive way to find all commits,
> in my experience it can be well over a magnitude slower than
> '--reachable'.

They don't necessarily produce the same set of commits, of course, but
for the purposes of "commit-graph write" I think they can generally be
considered interchangeable.

The speed depends on a couple of factors. One is just the makeup of the
repository: tree size, how many are deltas, etc. But a big one is
whether commit-graphs are already enabled. :)

On linux.git without commit-graphs, both of these take about 7.5s for
me:

  git cat-file --batch-all-objects --unordered --batch-check='%(objecttype) %(objectname)'

  git rev-list --all

But with graphs, the latter takes about 700ms.

I was coincidentally thinking about this problem ("show me all commits")
for something unrelated recently, and I do think that cat-file
invocation could be sped up in some cases. Here are a few ideas:

1. If the pack has a .bitmap file, it contains 4 "type" bitmaps that you
   can use to quickly determine the type of a given object. But:

   - the current code for loading bitmaps files is slower than we'd
     need; it indexes all of the commit bitmaps, but we wouldn't use
     them. So we'd want something that just loads the type bitmaps and
     stops.

   - you have to know the bit position of the object you're interested
     in, which implies computing the revidx. We could provide a "bit
     position" cache in the bitmap file (basically one uint32_t for each
     object giving the packfile order position, but stored in pack .idx
     order).

   Neither is too hard a problem, but it's clearly not just a small
   patch to consult the bitmap.

   It could be patched directly into oid_object_info_extended(), which
   could do individual type lookups this way. But it would be faster
   still if callers could say "I'm only interested in commits", in which
   case we could just quickly walk the commit-type bitmap in one pass
   (to get the commits in just that packfile, and then go through the
   other objects the old-fashioned way).

2. The commit-graph file can quickly tell us whether an object is a
   commit. But unfortunately it can give us a definite "yes", but not a
   definite "no" (because it might be a commit that just isn't (yet) in
   the graph file). So we'd have to find the real type of any
   non-commits, and I think that's where we spend the bulk of our time
   anyway.

3. To find the type of a delta object, we have to walk back to a base
   object with a concrete type. When we access the actual object
   contents, we cache intermediate bases to avoid digging down the same
   chain over and over. But I don't think there's any such cache for the
   type lookup. So if you have a 50-deep delta chain and want the type
   of each object, you'll walk down 49 links, then 48 links, and so on.
   It probably wouldn't be that hard to have a small type cache (or just
   allow the existing cache to have entries for "I know the type, but
   don't have the contents").

I suspect (1) would give the biggest speedup, but is more work and
requires bitmaps. I think (3) is likely to give a moderate win and
probably isn't that big a patch.

-Peff

  reply	other threads:[~2020-03-22 18:45 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-21  3:44 [PATCH 0/1] commit-graph: avoid unnecessary tag deference when merging Taylor Blau
2020-03-21  3:44 ` [PATCH 1/1] commit-graph.c: avoid unnecessary tag dereference " Taylor Blau
2020-03-21  5:00   ` Jeff King
2020-03-21  6:11     ` Taylor Blau
2020-03-21  6:24       ` Taylor Blau
2020-03-21  7:03       ` Jeff King
2020-03-21 17:27         ` Taylor Blau
2020-03-22  5:36           ` Jeff King
2020-03-22 11:04             ` SZEDER Gábor
2020-03-22 18:45               ` Jeff King [this message]
2020-03-22 19:18                 ` looking up object types quickly, was " Jeff King
2020-03-23 20:15               ` Taylor Blau
2020-03-22 16:45             ` Taylor Blau
2020-03-24  6:06               ` Jeff King
2020-03-21 18:50         ` Junio C Hamano
2020-03-22  0:03           ` Derrick Stolee
2020-03-22  0:20             ` Taylor Blau
2020-03-22  0:23               ` Derrick Stolee
2020-03-22  5:49                 ` Jeff King
2020-03-22  6:04                   ` Jeff King
2020-03-22 15:47                     ` Taylor Blau
2020-03-24  6:11                       ` Jeff King
2020-03-24 23:08                         ` Taylor Blau
2020-03-27  8:42                           ` Jeff King
2020-03-27 15:03                             ` Taylor Blau
2020-03-22 15:44                   ` Taylor Blau
2020-03-24  6:14                     ` Jeff King
2020-03-21  5:01   ` Junio C Hamano
2020-03-21  4:56 ` [PATCH 0/1] commit-graph: avoid unnecessary tag deference " Junio C Hamano
2020-03-21  5:04   ` Jeff King
2020-03-21  6:12     ` Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200322184519.GA654322@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=me@ttaylorr.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).