git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [RFC PATCH] packfile: iterate packed objects in pack order
@ 2018-08-08 23:12 Jonathan Tan
  2018-08-08 23:25 ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Jonathan Tan @ 2018-08-08 23:12 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, peff

Many invocations of for_each_object_in_pack() and
for_each_packed_object() (which invokes the former) subsequently check
at least the type of the packed object, necessitating accessing the
packfile itself. For locality reasons, it is thus better to iterate in
pack order, instead of index order. Teach for_each_object_in_pack() to
iterate in pack order by first creating a reverse index.

This is based off work by Jeff King.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
After writing this patch and looking at it further, I'm not sure if this
is a clear benefit, but here's the patch anyway. In particular,
builtin/fsck.c and builtin/cat-file.c just deal with the OID directly
and does not access the packfile at all (at least at the time of
invoking for_each_packed_object). And revision.c, if we are excluding
promisor objects, parses each packed promisor object, but it seems
possible to avoid doing that (replacing the parse_object() by
lookup_unknown_object() still passes tests).
---
 packfile.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/packfile.c b/packfile.c
index 6974903e5..371b64e9b 100644
--- a/packfile.c
+++ b/packfile.c
@@ -15,6 +15,7 @@
 #include "tree-walk.h"
 #include "tree.h"
 #include "object-store.h"
+#include "pack-revindex.h"
 
 char *odb_pack_name(struct strbuf *buf,
 		    const unsigned char *sha1,
@@ -1890,14 +1891,17 @@ int for_each_object_in_pack(struct packed_git *p, each_packed_object_fn cb, void
 	uint32_t i;
 	int r = 0;
 
+	load_pack_revindex(p);
+
 	for (i = 0; i < p->num_objects; i++) {
+		uint32_t pack_nr = p->revindex[i].nr;
 		struct object_id oid;
 
-		if (!nth_packed_object_oid(&oid, p, i))
+		if (!nth_packed_object_oid(&oid, p, pack_nr))
 			return error("unable to get sha1 of object %u in %s",
-				     i, p->pack_name);
+				     pack_nr, p->pack_name);
 
-		r = cb(&oid, p, i, data);
+		r = cb(&oid, p, pack_nr, data);
 		if (r)
 			break;
 	}
-- 
2.18.0.597.ga71716f1ad-goog


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH] packfile: iterate packed objects in pack order
  2018-08-08 23:12 [RFC PATCH] packfile: iterate packed objects in pack order Jonathan Tan
@ 2018-08-08 23:25 ` Jeff King
  2018-08-09 22:03   ` Jonathan Tan
  0 siblings, 1 reply; 4+ messages in thread
From: Jeff King @ 2018-08-08 23:25 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

On Wed, Aug 08, 2018 at 04:12:10PM -0700, Jonathan Tan wrote:

> Many invocations of for_each_object_in_pack() and
> for_each_packed_object() (which invokes the former) subsequently check
> at least the type of the packed object, necessitating accessing the
> packfile itself. For locality reasons, it is thus better to iterate in
> pack order, instead of index order. Teach for_each_object_in_pack() to
> iterate in pack order by first creating a reverse index.
> 
> This is based off work by Jeff King.
> 
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> After writing this patch and looking at it further, I'm not sure if this
> is a clear benefit, but here's the patch anyway. In particular,
> builtin/fsck.c and builtin/cat-file.c just deal with the OID directly
> and does not access the packfile at all (at least at the time of
> invoking for_each_packed_object). And revision.c, if we are excluding
> promisor objects, parses each packed promisor object, but it seems
> possible to avoid doing that (replacing the parse_object() by
> lookup_unknown_object() still passes tests).

Even if you just use the oid to do a separate lookup in the object
database, there's still a benefit in accessing the objects in pack
order. The case in cat-file needs more than this, though, since it
separately sorts the output (it has to, because it has to merge and
de-dup the output from several packs plus loose objects).

With the patch below on top of yours, I get:

  $ time git.v2.18.0 cat-file --batch-all-objects --buffer --batch | wc -c
  6938365964

  real	0m44.686s
  user	0m42.932s
  sys	0m5.283s

  $ time git.compile cat-file --batch-all-objects --buffer --batch | wc -c
  8289859070

  real	0m7.007s
  user	0m5.542s
  sys	0m4.005s

But:

  - it needs to de-duplicate using a hashmap (which is why the output is
    so much bigger in the second case)

  - it probably needs to be enabled explicitly by the user, since
    cat-file is plumbing and callers may be relying on the existing sort
    order

I can try to pick this up and carry the cat-file bits to completion if
you want, but probably not until tomorrow or Friday.

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH] packfile: iterate packed objects in pack order
  2018-08-08 23:25 ` Jeff King
@ 2018-08-09 22:03   ` Jonathan Tan
  2018-08-10 22:59     ` Jeff King
  0 siblings, 1 reply; 4+ messages in thread
From: Jonathan Tan @ 2018-08-09 22:03 UTC (permalink / raw)
  To: Jeff King; +Cc: Git mailing list

On Wed, Aug 8, 2018 at 4:25 PM, Jeff King <peff@peff.net> wrote:
> Even if you just use the oid to do a separate lookup in the object
> database, there's still a benefit in accessing the objects in pack
> order.

You're probably right, but I don't immediately see what the benefit is.

On a not completely unrelated note, I just realized that in my patch,
i should be pack_nr (ordinal in pack order) and pack_nr should be
index_nr (ordinal in index order, i.e. alphabetical order). If you run
with this, feel free to write your own patch and maybe I'll learn how
accessing objects in pack order benefits looking up the object
database through the commit message.

> I can try to pick this up and carry the cat-file bits to completion if
> you want, but probably not until tomorrow or Friday.

Thanks - feel free to do this.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH] packfile: iterate packed objects in pack order
  2018-08-09 22:03   ` Jonathan Tan
@ 2018-08-10 22:59     ` Jeff King
  0 siblings, 0 replies; 4+ messages in thread
From: Jeff King @ 2018-08-10 22:59 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Git mailing list

On Thu, Aug 09, 2018 at 03:03:24PM -0700, Jonathan Tan wrote:

> On Wed, Aug 8, 2018 at 4:25 PM, Jeff King <peff@peff.net> wrote:
> > Even if you just use the oid to do a separate lookup in the object
> > database, there's still a benefit in accessing the objects in pack
> > order.
> 
> You're probably right, but I don't immediately see what the benefit is.
> 
> On a not completely unrelated note, I just realized that in my patch,
> i should be pack_nr (ordinal in pack order) and pack_nr should be
> index_nr (ordinal in index order, i.e. alphabetical order). If you run
> with this, feel free to write your own patch and maybe I'll learn how
> accessing objects in pack order benefits looking up the object
> database through the commit message.

It probably would have been more apparent if when I said "like in the
patch below" I actually remembered to paste in the patch. ;)

> > I can try to pick this up and carry the cat-file bits to completion if
> > you want, but probably not until tomorrow or Friday.
> 
> Thanks - feel free to do this.

I've got a series which I'll post momentarily which hopefully explains
what's going on, along with various timings.

-Peff

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2018-08-10 22:59 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-08 23:12 [RFC PATCH] packfile: iterate packed objects in pack order Jonathan Tan
2018-08-08 23:25 ` Jeff King
2018-08-09 22:03   ` Jonathan Tan
2018-08-10 22:59     ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).