git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff King <peff@peff.net>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: git@vger.kernel.org
Subject: Re: [RFC PATCH] packfile: iterate packed objects in pack order
Date: Wed, 8 Aug 2018 19:25:16 -0400	[thread overview]
Message-ID: <20180808232515.GC21882@sigill.intra.peff.net> (raw)
In-Reply-To: <20180808231210.242120-1-jonathantanmy@google.com>

On Wed, Aug 08, 2018 at 04:12:10PM -0700, Jonathan Tan wrote:

> Many invocations of for_each_object_in_pack() and
> for_each_packed_object() (which invokes the former) subsequently check
> at least the type of the packed object, necessitating accessing the
> packfile itself. For locality reasons, it is thus better to iterate in
> pack order, instead of index order. Teach for_each_object_in_pack() to
> iterate in pack order by first creating a reverse index.
> 
> This is based off work by Jeff King.
> 
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> After writing this patch and looking at it further, I'm not sure if this
> is a clear benefit, but here's the patch anyway. In particular,
> builtin/fsck.c and builtin/cat-file.c just deal with the OID directly
> and does not access the packfile at all (at least at the time of
> invoking for_each_packed_object). And revision.c, if we are excluding
> promisor objects, parses each packed promisor object, but it seems
> possible to avoid doing that (replacing the parse_object() by
> lookup_unknown_object() still passes tests).

Even if you just use the oid to do a separate lookup in the object
database, there's still a benefit in accessing the objects in pack
order. The case in cat-file needs more than this, though, since it
separately sorts the output (it has to, because it has to merge and
de-dup the output from several packs plus loose objects).

With the patch below on top of yours, I get:

  $ time git.v2.18.0 cat-file --batch-all-objects --buffer --batch | wc -c
  6938365964

  real	0m44.686s
  user	0m42.932s
  sys	0m5.283s

  $ time git.compile cat-file --batch-all-objects --buffer --batch | wc -c
  8289859070

  real	0m7.007s
  user	0m5.542s
  sys	0m4.005s

But:

  - it needs to de-duplicate using a hashmap (which is why the output is
    so much bigger in the second case)

  - it probably needs to be enabled explicitly by the user, since
    cat-file is plumbing and callers may be relying on the existing sort
    order

I can try to pick this up and carry the cat-file bits to completion if
you want, but probably not until tomorrow or Friday.

-Peff

  reply	other threads:[~2018-08-08 23:25 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-08 23:12 [RFC PATCH] packfile: iterate packed objects in pack order Jonathan Tan
2018-08-08 23:25 ` Jeff King [this message]
2018-08-09 22:03   ` Jonathan Tan
2018-08-10 22:59     ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180808232515.GC21882@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).