git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/7] speeding up cat-file by reordering object access
@ 2018-08-10 23:07 Jeff King
  2018-08-10 23:09 ` [PATCH 1/7] for_each_*_object: store flag definitions in a single location Jeff King
                   ` (8 more replies)
  0 siblings, 9 replies; 27+ messages in thread
From: Jeff King @ 2018-08-10 23:07 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

This series is meant to replace the RFC discussion in:

  https://public-inbox.org/git/20180808231210.242120-1-jonathantanmy@google.com/

and

  https://public-inbox.org/git/20180808155045.GB1607@sigill.intra.peff.net/

The general idea is that accessing objects in packfile order is way
kinder to the delta base cache, and thus way more efficient. See patches
4 and 7 in particular for discussion and numbers.

I'm primarily interested in cat-file, so this series is focused there.
But there may be other callers of for_each_packed_object() who could
benefit. Most of the existing ones just care about getting the oid, so
they're better off as-is. It's possible the call in is_promisor_object()
could benefit, since it calls parse_object() on each entry it visits. I
didn't experiment with it.

  [1/7]: for_each_*_object: store flag definitions in a single location
  [2/7]: for_each_*_object: take flag arguments as enum
  [3/7]: for_each_*_object: give more comprehensive docstrings
  [4/7]: for_each_packed_object: support iterating in pack-order
  [5/7]: t1006: test cat-file --batch-all-objects with duplicates
  [6/7]: cat-file: rename batch_{loose,packed}_object callbacks
  [7/7]: cat-file: support "unordered" output for --batch-all-objects

 Documentation/git-cat-file.txt |  8 ++++
 builtin/cat-file.c             | 70 ++++++++++++++++++++++++++++------
 cache.h                        | 29 +++++++++++---
 commit-graph.c                 |  2 +-
 packfile.c                     | 24 +++++++++---
 packfile.h                     | 23 ++++++-----
 sha1-file.c                    |  3 +-
 t/t1006-cat-file.sh            | 17 ++++++++-
 8 files changed, 139 insertions(+), 37 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/7] for_each_*_object: store flag definitions in a single location
  2018-08-10 23:07 [PATCH 0/7] speeding up cat-file by reordering object access Jeff King
@ 2018-08-10 23:09 ` Jeff King
  2018-08-10 23:27   ` Stefan Beller
  2018-08-10 23:09 ` [PATCH 2/7] for_each_*_object: take flag arguments as enum Jeff King
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Jeff King @ 2018-08-10 23:09 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

These flags were split between cache.h and packfile.h,
because some of the flags apply only to packs. However, they
share a single numeric namespace, since both are respected
for the packed variant. Let's make sure they're defined
together so that nobody accidentally adds a new flag in one
location that duplicates the other.

While we're here, let's also put them in an enum (which
helps debugger visibility) and use "(1<<n)" rather than
counting powers of 2 manually.

Signed-off-by: Jeff King <peff@peff.net>
---
Arguably, all of these for_each_*_object() functions should stay
together. Even though some are related to packfiles and some to loose,
they are meant to be a unified API. So I'd be fine to do that on top,
but this at least reduces the chance of a mistake in the meantime.

 cache.h    | 13 ++++++++++++-
 packfile.h |  8 ++------
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/cache.h b/cache.h
index 8dc7134f00..4187238ecf 100644
--- a/cache.h
+++ b/cache.h
@@ -1623,12 +1623,23 @@ int for_each_loose_file_in_objdir_buf(struct strbuf *path,
 				      each_loose_subdir_fn subdir_cb,
 				      void *data);
 
+/*
+ * Flags for for_each_*_object(), including for_each_loose below and
+ * for_each_packed in packfile.h.
+ */
+enum for_each_object_flags {
+	/* Iterate only over local objects, not alternates. */
+	FOR_EACH_OBJECT_LOCAL_ONLY = (1<<0),
+
+	/* Only iterate over packs obtained from the promisor remote. */
+	FOR_EACH_OBJECT_PROMISOR_ONLY = (1<<1),
+};
+
 /*
  * Iterate over loose objects in both the local
  * repository and any alternates repositories (unless the
  * LOCAL_ONLY flag is set).
  */
-#define FOR_EACH_OBJECT_LOCAL_ONLY 0x1
 extern int for_each_loose_object(each_loose_object_fn, void *, unsigned flags);
 
 /*
diff --git a/packfile.h b/packfile.h
index cc7eaffe1b..6ddc6a2e91 100644
--- a/packfile.h
+++ b/packfile.h
@@ -148,15 +148,11 @@ extern int has_object_pack(const struct object_id *oid);
 
 extern int has_pack_index(const unsigned char *sha1);
 
-/*
- * Only iterate over packs obtained from the promisor remote.
- */
-#define FOR_EACH_OBJECT_PROMISOR_ONLY 2
-
 /*
  * Iterate over packed objects in both the local
  * repository and any alternates repositories (unless the
- * FOR_EACH_OBJECT_LOCAL_ONLY flag, defined in cache.h, is set).
+ * FOR_EACH_OBJECT_LOCAL_ONLY flag is set). See cache.h for the complete list
+ * of flags.
  */
 typedef int each_packed_object_fn(const struct object_id *oid,
 				  struct packed_git *pack,
-- 
2.18.0.1058.g7433f71063


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/7] for_each_*_object: take flag arguments as enum
  2018-08-10 23:07 [PATCH 0/7] speeding up cat-file by reordering object access Jeff King
  2018-08-10 23:09 ` [PATCH 1/7] for_each_*_object: store flag definitions in a single location Jeff King
@ 2018-08-10 23:09 ` Jeff King
  2018-08-10 23:11 ` [PATCH 3/7] for_each_*_object: give more comprehensive docstrings Jeff King
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2018-08-10 23:09 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

It's not wrong to pass our flags in an "unsigned", as we
know it will be at least as large as the enum.  However,
using the enum in the declaration makes it more obvious
where to find the list of flags.

While we're here, let's also drop the "extern" noise-words
from the declarations, per our modern coding style.

Signed-off-by: Jeff King <peff@peff.net>
---
 cache.h     | 3 ++-
 packfile.c  | 3 ++-
 packfile.h  | 5 +++--
 sha1-file.c | 3 ++-
 4 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/cache.h b/cache.h
index 4187238ecf..9e02fc494e 100644
--- a/cache.h
+++ b/cache.h
@@ -1640,7 +1640,8 @@ enum for_each_object_flags {
  * repository and any alternates repositories (unless the
  * LOCAL_ONLY flag is set).
  */
-extern int for_each_loose_object(each_loose_object_fn, void *, unsigned flags);
+int for_each_loose_object(each_loose_object_fn, void *,
+			  enum for_each_object_flags flags);
 
 /*
  * Set this to 0 to prevent sha1_object_info_extended() from fetching missing
diff --git a/packfile.c b/packfile.c
index 6974903e58..9da8f6d728 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1904,7 +1904,8 @@ int for_each_object_in_pack(struct packed_git *p, each_packed_object_fn cb, void
 	return r;
 }
 
-int for_each_packed_object(each_packed_object_fn cb, void *data, unsigned flags)
+int for_each_packed_object(each_packed_object_fn cb, void *data,
+			   enum for_each_object_flags flags)
 {
 	struct packed_git *p;
 	int r = 0;
diff --git a/packfile.h b/packfile.h
index 6ddc6a2e91..9861728514 100644
--- a/packfile.h
+++ b/packfile.h
@@ -158,8 +158,9 @@ typedef int each_packed_object_fn(const struct object_id *oid,
 				  struct packed_git *pack,
 				  uint32_t pos,
 				  void *data);
-extern int for_each_object_in_pack(struct packed_git *p, each_packed_object_fn, void *data);
-extern int for_each_packed_object(each_packed_object_fn, void *, unsigned flags);
+int for_each_object_in_pack(struct packed_git *p, each_packed_object_fn, void *data);
+int for_each_packed_object(each_packed_object_fn, void *,
+			   enum for_each_object_flags flags);
 
 /*
  * Return 1 if an object in a promisor packfile is or refers to the given
diff --git a/sha1-file.c b/sha1-file.c
index dfa8a35d68..cc0b57a751 100644
--- a/sha1-file.c
+++ b/sha1-file.c
@@ -2146,7 +2146,8 @@ static int loose_from_alt_odb(struct alternate_object_database *alt,
 	return r;
 }
 
-int for_each_loose_object(each_loose_object_fn cb, void *data, unsigned flags)
+int for_each_loose_object(each_loose_object_fn cb, void *data,
+			  enum for_each_object_flags flags)
 {
 	struct loose_alt_odb_data alt;
 	int r;
-- 
2.18.0.1058.g7433f71063


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/7] for_each_*_object: give more comprehensive docstrings
  2018-08-10 23:07 [PATCH 0/7] speeding up cat-file by reordering object access Jeff King
  2018-08-10 23:09 ` [PATCH 1/7] for_each_*_object: store flag definitions in a single location Jeff King
  2018-08-10 23:09 ` [PATCH 2/7] for_each_*_object: take flag arguments as enum Jeff King
@ 2018-08-10 23:11 ` Jeff King
  2018-08-10 23:15 ` [PATCH 4/7] for_each_packed_object: support iterating in pack-order Jeff King
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2018-08-10 23:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

We already mention the local/alternate behavior of these
functions, but we can help clarify a few other behaviors:

 - there's no need to mention LOCAL_ONLY specifically, since
   we already reference the flags by type (and as we add
   more flags, we don't want to have to mention each)

 - clarify that reachability doesn't matter here; this is
   all accessible objects

 - what ordering/uniqueness guarantees we give

 - how pack-specific flags are handled for the loose case

Signed-off-by: Jeff King <peff@peff.net>
---
 cache.h    |  8 +++++---
 packfile.h | 12 ++++++++----
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/cache.h b/cache.h
index 9e02fc494e..f438540f9b 100644
--- a/cache.h
+++ b/cache.h
@@ -1636,9 +1636,11 @@ enum for_each_object_flags {
 };
 
 /*
- * Iterate over loose objects in both the local
- * repository and any alternates repositories (unless the
- * LOCAL_ONLY flag is set).
+ * Iterate over all accessible loose objects without respect to
+ * reachability. By default, this includes both local and alternate objects.
+ * The order in which objects are visited is unspecified.
+ *
+ * Any flags specific to packs are ignored.
  */
 int for_each_loose_object(each_loose_object_fn, void *,
 			  enum for_each_object_flags flags);
diff --git a/packfile.h b/packfile.h
index 9861728514..c86a8c2716 100644
--- a/packfile.h
+++ b/packfile.h
@@ -149,10 +149,14 @@ extern int has_object_pack(const struct object_id *oid);
 extern int has_pack_index(const unsigned char *sha1);
 
 /*
- * Iterate over packed objects in both the local
- * repository and any alternates repositories (unless the
- * FOR_EACH_OBJECT_LOCAL_ONLY flag is set). See cache.h for the complete list
- * of flags.
+ * Iterate over all accessible packed objects without respect to reachability.
+ * By default, this includes both local and alternate packs.
+ *
+ * Note that some objects may appear twice if they are found in multiple packs.
+ * Each pack is visited in an unspecified order. Objects within a pack are
+ * visited in pack-idx order (i.e., sorted by oid).
+ *
+ * The list of flags can be found in cache.h.
  */
 typedef int each_packed_object_fn(const struct object_id *oid,
 				  struct packed_git *pack,
-- 
2.18.0.1058.g7433f71063


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/7] for_each_packed_object: support iterating in pack-order
  2018-08-10 23:07 [PATCH 0/7] speeding up cat-file by reordering object access Jeff King
                   ` (2 preceding siblings ...)
  2018-08-10 23:11 ` [PATCH 3/7] for_each_*_object: give more comprehensive docstrings Jeff King
@ 2018-08-10 23:15 ` Jeff King
  2018-08-15 13:28   ` Derrick Stolee
  2018-08-10 23:16 ` [PATCH 5/7] t1006: test cat-file --batch-all-objects with duplicates Jeff King
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 27+ messages in thread
From: Jeff King @ 2018-08-10 23:15 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

We currently iterate over objects within a pack in .idx
order, which uses the object hashes. That means that it
is effectively random with respect to the location of the
object within the pack. If you're going to access the actual
object data, there are two reasons to move linearly through
the pack itself:

  1. It improves the locality of access in the packfile. In
     the cold-cache case, this may mean fewer disk seeks, or
     better usage of disk cache.

  2. We store related deltas together in the packfile. Which
     means that the delta base cache can operate much more
     efficiently if we visit all of those related deltas in
     sequence, as the earlier items are likely to still be
     in the cache.  Whereas if we visit the objects in
     random order, our cache entries are much more likely to
     have been evicted by unrelated deltas in the meantime.

So in general, if you're going to access the object contents
pack order is generally going to end up more efficient.

But if you're simply generating a list of object names, or
if you're going to end up sorting the result anyway, you're
better off just using the .idx order, as finding the pack
order means generating the in-memory pack-revindex.
According to the numbers in 8b8dfd5132 (pack-revindex:
radix-sort the revindex, 2013-07-11), that takes about 200ms
for linux.git, and 20ms for git.git (those numbers are a few
years old but are still a good ballpark).

That makes it a good optimization for some cases (we can
save tens of seconds in git.git by having good locality of
delta access, for a 20ms cost), but a bad one for others
(e.g., right now "cat-file --batch-all-objects
--batch-check="%(objectname)" is 170ms in git.git, so adding
20ms to that is noticeable).

Hence this patch makes it an optional flag. You can't
actually do any interesting timings yet, as it's not plumbed
through to any user-facing tools like cat-file. That will
come in a later patch.

Signed-off-by: Jeff King <peff@peff.net>
---
 cache.h        |  5 +++++
 commit-graph.c |  2 +-
 packfile.c     | 21 ++++++++++++++++-----
 packfile.h     |  8 +++++---
 4 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/cache.h b/cache.h
index f438540f9b..6d14702df2 100644
--- a/cache.h
+++ b/cache.h
@@ -1633,6 +1633,11 @@ enum for_each_object_flags {
 
 	/* Only iterate over packs obtained from the promisor remote. */
 	FOR_EACH_OBJECT_PROMISOR_ONLY = (1<<1),
+
+	/*
+	 * Visit objects within a pack in packfile order rather than .idx order
+	 */
+	FOR_EACH_OBJECT_PACK_ORDER = (1<<2),
 };
 
 /*
diff --git a/commit-graph.c b/commit-graph.c
index b0a55ad128..69a0d1c203 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -730,7 +730,7 @@ void write_commit_graph(const char *obj_dir,
 				die("error adding pack %s", packname.buf);
 			if (open_pack_index(p))
 				die("error opening index for %s", packname.buf);
-			for_each_object_in_pack(p, add_packed_commits, &oids);
+			for_each_object_in_pack(p, add_packed_commits, &oids, 0);
 			close_pack(p);
 		}
 		strbuf_release(&packname);
diff --git a/packfile.c b/packfile.c
index 9da8f6d728..ebcb5742ec 100644
--- a/packfile.c
+++ b/packfile.c
@@ -1885,19 +1885,30 @@ int has_pack_index(const unsigned char *sha1)
 	return 1;
 }
 
-int for_each_object_in_pack(struct packed_git *p, each_packed_object_fn cb, void *data)
+int for_each_object_in_pack(struct packed_git *p,
+			    each_packed_object_fn cb, void *data,
+			    enum for_each_object_flags flags)
 {
 	uint32_t i;
 	int r = 0;
 
+	if (flags & FOR_EACH_OBJECT_PACK_ORDER)
+		load_pack_revindex(p);
+
 	for (i = 0; i < p->num_objects; i++) {
+		uint32_t pos;
 		struct object_id oid;
 
-		if (!nth_packed_object_oid(&oid, p, i))
+		if (flags & FOR_EACH_OBJECT_PACK_ORDER)
+			pos = p->revindex[i].nr;
+		else
+			pos = i;
+
+		if (!nth_packed_object_oid(&oid, p, pos))
 			return error("unable to get sha1 of object %u in %s",
-				     i, p->pack_name);
+				     pos, p->pack_name);
 
-		r = cb(&oid, p, i, data);
+		r = cb(&oid, p, pos, data);
 		if (r)
 			break;
 	}
@@ -1922,7 +1933,7 @@ int for_each_packed_object(each_packed_object_fn cb, void *data,
 			pack_errors = 1;
 			continue;
 		}
-		r = for_each_object_in_pack(p, cb, data);
+		r = for_each_object_in_pack(p, cb, data, flags);
 		if (r)
 			break;
 	}
diff --git a/packfile.h b/packfile.h
index c86a8c2716..99411bdd85 100644
--- a/packfile.h
+++ b/packfile.h
@@ -153,8 +153,8 @@ extern int has_pack_index(const unsigned char *sha1);
  * By default, this includes both local and alternate packs.
  *
  * Note that some objects may appear twice if they are found in multiple packs.
- * Each pack is visited in an unspecified order. Objects within a pack are
- * visited in pack-idx order (i.e., sorted by oid).
+ * Each pack is visited in an unspecified order. By default, objects within a
+ * pack are visited in pack-idx order (i.e., sorted by oid).
  *
  * The list of flags can be found in cache.h.
  */
@@ -162,7 +162,9 @@ typedef int each_packed_object_fn(const struct object_id *oid,
 				  struct packed_git *pack,
 				  uint32_t pos,
 				  void *data);
-int for_each_object_in_pack(struct packed_git *p, each_packed_object_fn, void *data);
+int for_each_object_in_pack(struct packed_git *p,
+			    each_packed_object_fn, void *data,
+			    enum for_each_object_flags flags);
 int for_each_packed_object(each_packed_object_fn, void *,
 			   enum for_each_object_flags flags);
 
-- 
2.18.0.1058.g7433f71063


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 5/7] t1006: test cat-file --batch-all-objects with duplicates
  2018-08-10 23:07 [PATCH 0/7] speeding up cat-file by reordering object access Jeff King
                   ` (3 preceding siblings ...)
  2018-08-10 23:15 ` [PATCH 4/7] for_each_packed_object: support iterating in pack-order Jeff King
@ 2018-08-10 23:16 ` Jeff King
  2018-08-10 23:17 ` [PATCH 6/7] cat-file: rename batch_{loose,packed}_object callbacks Jeff King
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2018-08-10 23:16 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

The test for --batch-all-objects in t1006 covers a variety
of object storage situations, but one thing it doesn't cover
is that we avoid mentioning duplicate objects. We won't have
any because running "git repack -ad" will have packed them
all and deleted the loose ones.

This does work (because we sort and de-dup the output list),
but it's good to include it in our test. And doubly so for
when we add an unordered mode which has to de-dup in a
different way.

Note that we cannot just re-create one of the objects, as
Git will omit the write of an object that is already
present. However, we can create a new pack with one of the
objects, which forces the duplication.

One alternative would be to just use "git repack -a" instead
of "-ad". But then _every_ object would be duplicated as
loose and packed, and we might miss a bug that omits packed
objects (because we'd show their loose counterparts).

Signed-off-by: Jeff King <peff@peff.net>
---
 t/t1006-cat-file.sh | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 13dd510b2e..4fb5e098f2 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -550,8 +550,8 @@ test_expect_success 'git cat-file --batch --follow-symlink returns correct sha a
 test_expect_success 'cat-file --batch-all-objects shows all objects' '
 	# make new repos so we know the full set of objects; we will
 	# also make sure that there are some packed and some loose
-	# objects, some referenced and some not, and that there are
-	# some available only via alternates.
+	# objects, some referenced and some not, some duplicates, and that
+	# there are some available only via alternates.
 	git init all-one &&
 	(
 		cd all-one &&
@@ -567,6 +567,8 @@ test_expect_success 'cat-file --batch-all-objects shows all objects' '
 		cd all-two &&
 		echo local-unref | git hash-object -w --stdin
 	) >>expect.unsorted &&
+	git -C all-two rev-parse HEAD:file |
+		git -C all-two pack-objects .git/objects/pack/pack &&
 	sort <expect.unsorted >expect &&
 	git -C all-two cat-file --batch-all-objects \
 				--batch-check="%(objectname)" >actual &&
-- 
2.18.0.1058.g7433f71063


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 6/7] cat-file: rename batch_{loose,packed}_object callbacks
  2018-08-10 23:07 [PATCH 0/7] speeding up cat-file by reordering object access Jeff King
                   ` (4 preceding siblings ...)
  2018-08-10 23:16 ` [PATCH 5/7] t1006: test cat-file --batch-all-objects with duplicates Jeff King
@ 2018-08-10 23:17 ` Jeff King
  2018-08-10 23:24 ` [PATCH 7/7] cat-file: support "unordered" output for --batch-all-objects Jeff King
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2018-08-10 23:17 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

We're not really doing the batch-show operation in these
callbacks, but just collecting the set of objects. That
distinction will become more important in a future patch, so
let's rename them now to avoid cluttering that diff.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/cat-file.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 4a44b2404f..2d34f3b867 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -420,18 +420,18 @@ static int batch_object_cb(const struct object_id *oid, void *vdata)
 	return 0;
 }
 
-static int batch_loose_object(const struct object_id *oid,
-			      const char *path,
-			      void *data)
+static int collect_loose_object(const struct object_id *oid,
+				const char *path,
+				void *data)
 {
 	oid_array_append(data, oid);
 	return 0;
 }
 
-static int batch_packed_object(const struct object_id *oid,
-			       struct packed_git *pack,
-			       uint32_t pos,
-			       void *data)
+static int collect_packed_object(const struct object_id *oid,
+				 struct packed_git *pack,
+				 uint32_t pos,
+				 void *data)
 {
 	oid_array_append(data, oid);
 	return 0;
@@ -476,8 +476,8 @@ static int batch_objects(struct batch_options *opt)
 		struct oid_array sa = OID_ARRAY_INIT;
 		struct object_cb_data cb;
 
-		for_each_loose_object(batch_loose_object, &sa, 0);
-		for_each_packed_object(batch_packed_object, &sa, 0);
+		for_each_loose_object(collect_loose_object, &sa, 0);
+		for_each_packed_object(collect_packed_object, &sa, 0);
 		if (repository_format_partial_clone)
 			warning("This repository has extensions.partialClone set. Some objects may not be loaded.");
 
-- 
2.18.0.1058.g7433f71063


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 7/7] cat-file: support "unordered" output for --batch-all-objects
  2018-08-10 23:07 [PATCH 0/7] speeding up cat-file by reordering object access Jeff King
                   ` (5 preceding siblings ...)
  2018-08-10 23:17 ` [PATCH 6/7] cat-file: rename batch_{loose,packed}_object callbacks Jeff King
@ 2018-08-10 23:24 ` Jeff King
  2018-08-13 18:45 ` [PATCH 0/7] speeding up cat-file by reordering object access Jonathan Tan
  2018-08-15 14:05 ` [PATCH 0/7] speeding up cat-file by reordering object access Derrick Stolee
  8 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2018-08-10 23:24 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

If you're going to access the contents of every object in a
packfile, it's generally much more efficient to do so in
pack order, rather than in hash order. That increases the
locality of access within the packfile, which in turn is
friendlier to the delta base cache, since the packfile puts
related deltas next to each other. By contrast, hash order
is effectively random, since the sha1 has no discernible
relationship to the content.

This patch introduces an "--unordered" option to cat-file
which iterates over packs in pack-order under the hood. You
can see the results when dumping all of the file content:

  $ time ./git cat-file --batch-all-objects --buffer --batch | wc -c
  6883195596

  real	0m44.491s
  user	0m42.902s
  sys	0m5.230s

  $ time ./git cat-file --unordered \
                        --batch-all-objects --buffer --batch | wc -c
  6883195596

  real	0m6.075s
  user	0m4.774s
  sys	0m3.548s

Same output, different order, way faster. The same speed-up
applies even if you end up accessing the object content in a
different process, like:

  git cat-file --batch-all-objects --buffer --batch-check |
  grep blob |
  git cat-file --batch='%(objectname) %(rest)' |
  wc -c

Adding "--unordered" to the first command drops the runtime
in git.git from 24s to 3.5s.

  Side note: there are actually further speedups available
  for doing it all in-process now. Since we are outputting
  the object content during the actual pack iteration, we
  know where to find the object and could skip the extra
  lookup done by oid_object_info(). This patch stops short
  of that optimization since the underlying API isn't ready
  for us to make those sorts of direct requests.

So if --unordered is so much better, why not make it the
default? Two reasons:

  1. We've promised in the documentation that --batch-all-objects
     outputs in hash order. Since cat-file is plumbing,
     people may be relying on that default, and we can't
     change it.

  2. It's actually _slower_ for some cases. We have to
     compute the pack revindex to walk in pack order. And
     our de-duplication step uses an oidset, rather than a
     sort-and-dedup, which can end up being more expensive.
     If we're just accessing the type and size of each
     object, for example, like:

       git cat-file --batch-all-objects --buffer --batch-check

     my best-of-five warm cache timings go from 900ms to
     1100ms using --unordered. Though it's possible in a
     cold-cache or under memory pressure that we could do
     better, since we'd have better locality within the
     packfile.

And one final question: why is it "--unordered" and not
"--pack-order"? The answer is again two-fold:

  1. "pack order" isn't a well-defined thing across the
     whole set of objects. We're hitting loose objects, as
     well as objects in multiple packs, and the only
     ordering we're promising is _within_ a single pack. The
     rest is apparently random.

  2. The point here is optimization. So we don't want to
     promise any particular ordering, but only to say that
     we will choose an ordering which is likely to be
     efficient for accessing the object content. That leaves
     the door open for further changes in the future without
     having to add another compatibility option.

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/git-cat-file.txt | 10 ++++++
 builtin/cat-file.c             | 56 +++++++++++++++++++++++++++++++---
 t/t1006-cat-file.sh            | 11 +++++++
 3 files changed, 72 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-cat-file.txt b/Documentation/git-cat-file.txt
index f90f09b03f..74013335a1 100644
--- a/Documentation/git-cat-file.txt
+++ b/Documentation/git-cat-file.txt
@@ -104,6 +104,16 @@ OPTIONS
 	buffering; this is much more efficient when invoking
 	`--batch-check` on a large number of objects.
 
+--unordered::
+	When `--batch-all-objects` is in use, visit objects in an
+	order which may be more efficient for accessing the object
+	contents than hash order. The exact details of the order are
+	unspecified, but if you do not require a specific order, this
+	should generally result in faster output, especially with
+	`--batch`.  Note that `cat-file` will still show each object
+	only once, even if it is stored multiple times in the
+	repository.
+
 --allow-unknown-type::
 	Allow -s or -t to query broken/corrupt objects of unknown type.
 
diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 2d34f3b867..45992c9be9 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -21,6 +21,7 @@ struct batch_options {
 	int print_contents;
 	int buffer_output;
 	int all_objects;
+	int unordered;
 	int cmdmode; /* may be 'w' or 'c' for --filters or --textconv */
 	const char *format;
 };
@@ -410,6 +411,7 @@ static void batch_one_object(const char *obj_name, struct batch_options *opt,
 struct object_cb_data {
 	struct batch_options *opt;
 	struct expand_data *expand;
+	struct oidset *seen;
 };
 
 static int batch_object_cb(const struct object_id *oid, void *vdata)
@@ -437,6 +439,32 @@ static int collect_packed_object(const struct object_id *oid,
 	return 0;
 }
 
+static int batch_unordered_object(const struct object_id *oid, void *vdata)
+{
+	struct object_cb_data *data = vdata;
+
+	if (oidset_contains(data->seen, oid))
+		return 0;
+	oidset_insert(data->seen, oid);
+
+	return batch_object_cb(oid, data);
+}
+
+static int batch_unordered_loose(const struct object_id *oid,
+				 const char *path,
+				 void *data)
+{
+	return batch_unordered_object(oid, data);
+}
+
+static int batch_unordered_packed(const struct object_id *oid,
+				  struct packed_git *pack,
+				  uint32_t pos,
+				  void *data)
+{
+	return batch_unordered_object(oid, data);
+}
+
 static int batch_objects(struct batch_options *opt)
 {
 	struct strbuf buf = STRBUF_INIT;
@@ -473,19 +501,35 @@ static int batch_objects(struct batch_options *opt)
 		data.info.typep = &data.type;
 
 	if (opt->all_objects) {
-		struct oid_array sa = OID_ARRAY_INIT;
 		struct object_cb_data cb;
 
-		for_each_loose_object(collect_loose_object, &sa, 0);
-		for_each_packed_object(collect_packed_object, &sa, 0);
 		if (repository_format_partial_clone)
 			warning("This repository has extensions.partialClone set. Some objects may not be loaded.");
 
 		cb.opt = opt;
 		cb.expand = &data;
-		oid_array_for_each_unique(&sa, batch_object_cb, &cb);
 
-		oid_array_clear(&sa);
+		if (opt->unordered) {
+			struct oidset seen = OIDSET_INIT;
+
+			cb.seen = &seen;
+
+			for_each_loose_object(batch_unordered_loose, &cb, 0);
+			for_each_packed_object(batch_unordered_packed, &cb,
+					       FOR_EACH_OBJECT_PACK_ORDER);
+
+			oidset_clear(&seen);
+		} else {
+			struct oid_array sa = OID_ARRAY_INIT;
+
+			for_each_loose_object(collect_loose_object, &sa, 0);
+			for_each_packed_object(collect_packed_object, &sa, 0);
+
+			oid_array_for_each_unique(&sa, batch_object_cb, &cb);
+
+			oid_array_clear(&sa);
+		}
+
 		return 0;
 	}
 
@@ -586,6 +630,8 @@ int cmd_cat_file(int argc, const char **argv, const char *prefix)
 			 N_("follow in-tree symlinks (used with --batch or --batch-check)")),
 		OPT_BOOL(0, "batch-all-objects", &batch.all_objects,
 			 N_("show all objects with --batch or --batch-check")),
+		OPT_BOOL(0, "unordered", &batch.unordered,
+			 N_("do not order --batch-all-objects output")),
 		OPT_END()
 	};
 
diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh
index 4fb5e098f2..7f19d591f2 100755
--- a/t/t1006-cat-file.sh
+++ b/t/t1006-cat-file.sh
@@ -575,4 +575,15 @@ test_expect_success 'cat-file --batch-all-objects shows all objects' '
 	test_cmp expect actual
 '
 
+# The only user-visible difference is that the objects are no longer sorted,
+# and the resulting sort order is undefined. So we can only check that it
+# produces the same objects as the ordered case, but that at least exercises
+# the code.
+test_expect_success 'cat-file --unordered works' '
+	git -C all-two cat-file --batch-all-objects --unordered \
+				--batch-check="%(objectname)" >actual.unsorted &&
+	sort <actual.unsorted >actual &&
+	test_cmp expect actual
+'
+
 test_done
-- 
2.18.0.1058.g7433f71063


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] for_each_*_object: store flag definitions in a single location
  2018-08-10 23:09 ` [PATCH 1/7] for_each_*_object: store flag definitions in a single location Jeff King
@ 2018-08-10 23:27   ` Stefan Beller
  2018-08-10 23:31     ` Jeff King
  0 siblings, 1 reply; 27+ messages in thread
From: Stefan Beller @ 2018-08-10 23:27 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Jonathan Tan

On Fri, Aug 10, 2018 at 4:09 PM Jeff King <peff@peff.net> wrote:
>
> These flags were split between cache.h and packfile.h,
> because some of the flags apply only to packs. However, they
> share a single numeric namespace, since both are respected
> for the packed variant. Let's make sure they're defined
> together so that nobody accidentally adds a new flag in one
> location that duplicates the other.
>
> While we're here, let's also put them in an enum (which
> helps debugger visibility) and use "(1<<n)" rather than
> counting powers of 2 manually.

All good, but ...

>  cache.h    | 13 ++++++++++++-
>  packfile.h |  8 ++------
>  2 files changed, 14 insertions(+), 7 deletions(-)

rubs me the wrong way. ;-)

cache.h is such a misnomer of a name, and a kitchen sink
of a file in the Git project that in an ideal world it would
be way smaller and contain only things related to some
caching related code.

I would suggest object.h or object-store.h instead.
Probably the object-store as that will be the only external
exposure and hopefully we'd get the objects in a similar
shape as the refs subsystem eventually?

I might be biased by commits such as 4f39cd821d1
(pack: move pack name-related functions, 2017-08-18)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] for_each_*_object: store flag definitions in a single location
  2018-08-10 23:27   ` Stefan Beller
@ 2018-08-10 23:31     ` Jeff King
  2018-08-10 23:33       ` Jeff King
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff King @ 2018-08-10 23:31 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git, Jonathan Tan

On Fri, Aug 10, 2018 at 04:27:25PM -0700, Stefan Beller wrote:

> >  cache.h    | 13 ++++++++++++-
> >  packfile.h |  8 ++------
> >  2 files changed, 14 insertions(+), 7 deletions(-)
> 
> rubs me the wrong way. ;-)
> 
> cache.h is such a misnomer of a name, and a kitchen sink
> of a file in the Git project that in an ideal world it would
> be way smaller and contain only things related to some
> caching related code.
> 
> I would suggest object.h or object-store.h instead.
> Probably the object-store as that will be the only external
> exposure and hopefully we'd get the objects in a similar
> shape as the refs subsystem eventually?

Yes, for_each_loose_object() ought to be in loose.h to match packfile.h,
or the whole thing should go into object-store.h.

This series was already getting long, though, so I'd much rather do this
now and other reorganization later (in particular, wherever they end up,
we want the flags to move as a unit).

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] for_each_*_object: store flag definitions in a single location
  2018-08-10 23:31     ` Jeff King
@ 2018-08-10 23:33       ` Jeff King
  2018-08-10 23:39         ` Stefan Beller
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff King @ 2018-08-10 23:33 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git, Jonathan Tan

On Fri, Aug 10, 2018 at 07:31:33PM -0400, Jeff King wrote:

> On Fri, Aug 10, 2018 at 04:27:25PM -0700, Stefan Beller wrote:
> 
> > >  cache.h    | 13 ++++++++++++-
> > >  packfile.h |  8 ++------
> > >  2 files changed, 14 insertions(+), 7 deletions(-)
> > 
> > rubs me the wrong way. ;-)
> > 
> > cache.h is such a misnomer of a name, and a kitchen sink
> > of a file in the Git project that in an ideal world it would
> > be way smaller and contain only things related to some
> > caching related code.
> > 
> > I would suggest object.h or object-store.h instead.
> > Probably the object-store as that will be the only external
> > exposure and hopefully we'd get the objects in a similar
> > shape as the refs subsystem eventually?
> 
> Yes, for_each_loose_object() ought to be in loose.h to match packfile.h,
> or the whole thing should go into object-store.h.

Heh, I thought you were making up a hypothetical object-store.h, but I
see it has already come to pass.

IMHO the whole for_each_*_object() interface should go in there (it even
has packed_git defined there already!). I think I'd still just as soon
do it on top of this series, but it might not be too bad to do as part
of a re-roll.

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] for_each_*_object: store flag definitions in a single location
  2018-08-10 23:33       ` Jeff King
@ 2018-08-10 23:39         ` Stefan Beller
  2018-08-11  0:33           ` Jeff King
  0 siblings, 1 reply; 27+ messages in thread
From: Stefan Beller @ 2018-08-10 23:39 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Jonathan Tan

On Fri, Aug 10, 2018 at 4:33 PM Jeff King <peff@peff.net> wrote:
>
> On Fri, Aug 10, 2018 at 07:31:33PM -0400, Jeff King wrote:
>
> > On Fri, Aug 10, 2018 at 04:27:25PM -0700, Stefan Beller wrote:
> >
> > > >  cache.h    | 13 ++++++++++++-
> > > >  packfile.h |  8 ++------
> > > >  2 files changed, 14 insertions(+), 7 deletions(-)
> > >
> > > rubs me the wrong way. ;-)
> > >
> > > cache.h is such a misnomer of a name, and a kitchen sink
> > > of a file in the Git project that in an ideal world it would
> > > be way smaller and contain only things related to some
> > > caching related code.
> > >
> > > I would suggest object.h or object-store.h instead.
> > > Probably the object-store as that will be the only external
> > > exposure and hopefully we'd get the objects in a similar
> > > shape as the refs subsystem eventually?
> >
> > Yes, for_each_loose_object() ought to be in loose.h to match packfile.h,
> > or the whole thing should go into object-store.h.
>
> Heh, I thought you were making up a hypothetical object-store.h, but I
> see it has already come to pass.
>
> IMHO the whole for_each_*_object() interface should go in there (it even
> has packed_git defined there already!). I think I'd still just as soon
> do it on top of this series, but it might not be too bad to do as part
> of a re-roll.

Yeah, I realize that I distracted myself and ranted about a different thing
other than the quality of this patch. (We had a couple of internal discussions
about project velocity and contributor happiness and I personally think this
derailing is some sort of anti pattern as fixing things like these is easy
as compared to user visible things such as file formats or configs.
Sorry for that.)

Stefan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] for_each_*_object: store flag definitions in a single location
  2018-08-10 23:39         ` Stefan Beller
@ 2018-08-11  0:33           ` Jeff King
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2018-08-11  0:33 UTC (permalink / raw)
  To: Stefan Beller; +Cc: git, Jonathan Tan

On Fri, Aug 10, 2018 at 04:39:14PM -0700, Stefan Beller wrote:

> > IMHO the whole for_each_*_object() interface should go in there (it even
> > has packed_git defined there already!). I think I'd still just as soon
> > do it on top of this series, but it might not be too bad to do as part
> > of a re-roll.
> 
> Yeah, I realize that I distracted myself and ranted about a different thing
> other than the quality of this patch. (We had a couple of internal discussions
> about project velocity and contributor happiness and I personally think this
> derailing is some sort of anti pattern as fixing things like these is easy
> as compared to user visible things such as file formats or configs.
> Sorry for that.)

It's a tough line to draw sometimes. This kind of ancillary discussion
is often what spurs further work, so I think the discussions are good to
have. And sometimes the right answer is "yeah, while we're here, let's
clean this up, too". This may even be one of those cases.

But sometimes the right answer is to push back a little and say "you're
right, but let's deal with it later". And maybe later never even
happens, but in that case maybe it wasn't that important in the first
place. :) Or maybe it takes the same point coming up a few times to
decide it's worth pursuing.

I wish I had a good guideline for when to start such a discussion and
when to push back. I mostly just follow my instincts, and my answer (on
either side of that conversation) might change from day to day.  I think
the most important guideline is for everybody to be accepting of both
sides of the conversation (i.e., it's OK to prod a little about
ancillary issues as long as "yes, but not right now" is an acceptable
answer).

And then sometimes you catch me in a philosophical mood...

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/7] speeding up cat-file by reordering object access
  2018-08-10 23:07 [PATCH 0/7] speeding up cat-file by reordering object access Jeff King
                   ` (6 preceding siblings ...)
  2018-08-10 23:24 ` [PATCH 7/7] cat-file: support "unordered" output for --batch-all-objects Jeff King
@ 2018-08-13 18:45 ` Jonathan Tan
  2018-08-14 18:13   ` [PATCH 0/4] finishing touches on jk/for-each-object-iteration Jeff King
  2018-08-15 14:05 ` [PATCH 0/7] speeding up cat-file by reordering object access Derrick Stolee
  8 siblings, 1 reply; 27+ messages in thread
From: Jonathan Tan @ 2018-08-13 18:45 UTC (permalink / raw)
  To: peff; +Cc: git, jonathantanmy

>   [1/7]: for_each_*_object: store flag definitions in a single location
>   [2/7]: for_each_*_object: take flag arguments as enum
>   [3/7]: for_each_*_object: give more comprehensive docstrings
>   [4/7]: for_each_packed_object: support iterating in pack-order
>   [5/7]: t1006: test cat-file --batch-all-objects with duplicates
>   [6/7]: cat-file: rename batch_{loose,packed}_object callbacks
>   [7/7]: cat-file: support "unordered" output for --batch-all-objects

Thanks for laying all the patches out so cleanly! All of them are:
Reviewed-by: Jonathan Tan <jonathantanmy@google.com>

Normally I would re-explain the patches to demonstrate that I understand
them, but in this case, I think they are simple enough - patches 1, 2,
3, and 6 are refactorings that I agree with, patch 5 just makes a test
more comprehensive, and patches 4 and 7 do what their commit messages
say.

Stefan brought up the concern that cache.h is increasing in size, but I
agree with the patch as written that it's probably best that we
centralize all the flags somewhere, and we can deal with the location in
a future patch.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 0/4] finishing touches on jk/for-each-object-iteration
  2018-08-13 18:45 ` [PATCH 0/7] speeding up cat-file by reordering object access Jonathan Tan
@ 2018-08-14 18:13   ` Jeff King
  2018-08-14 18:14     ` [PATCH 1/4] cat-file: use oidset check-and-insert Jeff King
                       ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Jeff King @ 2018-08-14 18:13 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: René Scharfe, git

On Mon, Aug 13, 2018 at 11:45:06AM -0700, Jonathan Tan wrote:

> >   [1/7]: for_each_*_object: store flag definitions in a single location
> >   [2/7]: for_each_*_object: take flag arguments as enum
> >   [3/7]: for_each_*_object: give more comprehensive docstrings
> >   [4/7]: for_each_packed_object: support iterating in pack-order
> >   [5/7]: t1006: test cat-file --batch-all-objects with duplicates
> >   [6/7]: cat-file: rename batch_{loose,packed}_object callbacks
> >   [7/7]: cat-file: support "unordered" output for --batch-all-objects
> 
> Thanks for laying all the patches out so cleanly! All of them are:
> Reviewed-by: Jonathan Tan <jonathantanmy@google.com>
> 
> Normally I would re-explain the patches to demonstrate that I understand
> them, but in this case, I think they are simple enough - patches 1, 2,
> 3, and 6 are refactorings that I agree with, patch 5 just makes a test
> more comprehensive, and patches 4 and 7 do what their commit messages
> say.
> 
> Stefan brought up the concern that cache.h is increasing in size, but I
> agree with the patch as written that it's probably best that we
> centralize all the flags somewhere, and we can deal with the location in
> a future patch.

Thanks for the review. Here are a few patches on top to deal with the
cache.h thing, as well as some optimizations that came out of discussing
oidset in another thread (I left out for now the "big" optimization of
moving oidset to a different data structure; that's complicated enough
to be dealt with on its own, I think).

The first patch here could arguably be squashed into the final patch of
the original series, but I'm OK with it either way.

  [1/4]: cat-file: use oidset check-and-insert
  [2/4]: cat-file: split batch "buf" into two variables
  [3/4]: cat-file: use a single strbuf for all output
  [4/4]: for_each_*_object: move declarations to object-store.h

 builtin/cat-file.c     | 43 +++++++++++---------
 builtin/prune-packed.c |  1 +
 cache.h                | 75 -----------------------------------
 object-store.h         | 90 ++++++++++++++++++++++++++++++++++++++++++
 packfile.h             | 20 ----------
 5 files changed, 116 insertions(+), 113 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/4] cat-file: use oidset check-and-insert
  2018-08-14 18:13   ` [PATCH 0/4] finishing touches on jk/for-each-object-iteration Jeff King
@ 2018-08-14 18:14     ` Jeff King
  2018-08-14 18:18     ` [PATCH 2/4] cat-file: split batch "buf" into two variables Jeff King
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2018-08-14 18:14 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: René Scharfe, git

We don't need to check if the oidset has our object before
we insert it; that's done as part of the insertion. We can
just rely on the return value from oidset_insert(), which
saves one hash lookup per object.

This measurable speedup is tiny and within the run-to-run
noise, but the result is simpler to read, too.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/cat-file.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 45992c9be9..04b5cda191 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -443,9 +443,8 @@ static int batch_unordered_object(const struct object_id *oid, void *vdata)
 {
 	struct object_cb_data *data = vdata;
 
-	if (oidset_contains(data->seen, oid))
+	if (oidset_insert(data->seen, oid))
 		return 0;
-	oidset_insert(data->seen, oid);
 
 	return batch_object_cb(oid, data);
 }
-- 
2.18.0.1066.g0d97f3a098


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/4] cat-file: split batch "buf" into two variables
  2018-08-14 18:13   ` [PATCH 0/4] finishing touches on jk/for-each-object-iteration Jeff King
  2018-08-14 18:14     ` [PATCH 1/4] cat-file: use oidset check-and-insert Jeff King
@ 2018-08-14 18:18     ` Jeff King
  2018-08-14 18:20     ` [PATCH 3/4] cat-file: use a single strbuf for all output Jeff King
  2018-08-14 18:21     ` [PATCH 4/4] for_each_*_object: move declarations to object-store.h Jeff King
  3 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2018-08-14 18:18 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: René Scharfe, git

We use the "buf" strbuf for two things: to read incoming
lines, and as a scratch space for test-expanding the
user-provided format. Let's split this into two variables
with descriptive names, which makes their purpose and
lifetime more clear.

It will also help in a future patch when we start using the
"output" buffer for more expansions.

Signed-off-by: Jeff King <peff@peff.net>
---
René, in the patch you sent earlier, I noticed that for the
non-batch-all-objects case we use the same strbuf for input and output.
That'd probably be OK most of the time (the first thing we do is resolve
the input to an oid), but I suspect it could be pretty bad with %(rest).
We'd write over or even realloc the string it points into as part of the
output.

This patch just clarifies the names; your reuse idea is in the next one.

 builtin/cat-file.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 04b5cda191..3ed1d0be80 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -466,7 +466,8 @@ static int batch_unordered_packed(const struct object_id *oid,
 
 static int batch_objects(struct batch_options *opt)
 {
-	struct strbuf buf = STRBUF_INIT;
+	struct strbuf input = STRBUF_INIT;
+	struct strbuf output = STRBUF_INIT;
 	struct expand_data data;
 	int save_warning;
 	int retval = 0;
@@ -481,8 +482,9 @@ static int batch_objects(struct batch_options *opt)
 	 */
 	memset(&data, 0, sizeof(data));
 	data.mark_query = 1;
-	strbuf_expand(&buf, opt->format, expand_format, &data);
+	strbuf_expand(&output, opt->format, expand_format, &data);
 	data.mark_query = 0;
+	strbuf_release(&output);
 	if (opt->cmdmode)
 		data.split_on_whitespace = 1;
 
@@ -542,14 +544,14 @@ static int batch_objects(struct batch_options *opt)
 	save_warning = warn_on_object_refname_ambiguity;
 	warn_on_object_refname_ambiguity = 0;
 
-	while (strbuf_getline(&buf, stdin) != EOF) {
+	while (strbuf_getline(&input, stdin) != EOF) {
 		if (data.split_on_whitespace) {
 			/*
 			 * Split at first whitespace, tying off the beginning
 			 * of the string and saving the remainder (or NULL) in
 			 * data.rest.
 			 */
-			char *p = strpbrk(buf.buf, " \t");
+			char *p = strpbrk(input.buf, " \t");
 			if (p) {
 				while (*p && strchr(" \t", *p))
 					*p++ = '\0';
@@ -557,10 +559,10 @@ static int batch_objects(struct batch_options *opt)
 			data.rest = p;
 		}
 
-		batch_one_object(buf.buf, opt, &data);
+		batch_one_object(input.buf, opt, &data);
 	}
 
-	strbuf_release(&buf);
+	strbuf_release(&input);
 	warn_on_object_refname_ambiguity = save_warning;
 	return retval;
 }
-- 
2.18.0.1066.g0d97f3a098


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/4] cat-file: use a single strbuf for all output
  2018-08-14 18:13   ` [PATCH 0/4] finishing touches on jk/for-each-object-iteration Jeff King
  2018-08-14 18:14     ` [PATCH 1/4] cat-file: use oidset check-and-insert Jeff King
  2018-08-14 18:18     ` [PATCH 2/4] cat-file: split batch "buf" into two variables Jeff King
@ 2018-08-14 18:20     ` Jeff King
  2018-08-14 19:30       ` René Scharfe
  2018-08-14 18:21     ` [PATCH 4/4] for_each_*_object: move declarations to object-store.h Jeff King
  3 siblings, 1 reply; 27+ messages in thread
From: Jeff King @ 2018-08-14 18:20 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: René Scharfe, git

When we're in batch mode, we end up in batch_object_write()
for each object, which allocates its own strbuf for each
call. Instead, we can provide a single "scratch" buffer that
gets reused for each output. When running:

  git cat-file --batch-all-objects --batch-check='%(objectname)'

on git.git, my best-of-five time drops from:

  real	0m0.171s
  user	0m0.159s
  sys	0m0.012s

to:

  real	0m0.133s
  user	0m0.121s
  sys	0m0.012s

Note that we could do this just by putting the "scratch"
pointer into "struct expand_data", but I chose instead to
add an extra parameter to the callstack. That's more
verbose, but it makes it a bit more obvious what is going
on, which in turn makes it easy to see where we need to be
releasing the string in the caller (right after the loop
which uses it in each case).

Based-on-a-patch-by: René Scharfe <l.s.r@web.de>
Signed-off-by: Jeff King <peff@peff.net>
---
It also made it easy to see that without the prior patch,
we'd have been using "buf" for two parameters. :)

 builtin/cat-file.c | 28 +++++++++++++++++-----------
 1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 3ed1d0be80..08dced2614 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -338,11 +338,11 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d
 	}
 }
 
-static void batch_object_write(const char *obj_name, struct batch_options *opt,
+static void batch_object_write(const char *obj_name,
+			       struct strbuf *scratch,
+			       struct batch_options *opt,
 			       struct expand_data *data)
 {
-	struct strbuf buf = STRBUF_INIT;
-
 	if (!data->skip_object_info &&
 	    oid_object_info_extended(the_repository, &data->oid, &data->info,
 				     OBJECT_INFO_LOOKUP_REPLACE) < 0) {
@@ -352,10 +352,10 @@ static void batch_object_write(const char *obj_name, struct batch_options *opt,
 		return;
 	}
 
-	strbuf_expand(&buf, opt->format, expand_format, data);
-	strbuf_addch(&buf, '\n');
-	batch_write(opt, buf.buf, buf.len);
-	strbuf_release(&buf);
+	strbuf_reset(scratch);
+	strbuf_expand(scratch, opt->format, expand_format, data);
+	strbuf_addch(scratch, '\n');
+	batch_write(opt, scratch->buf, scratch->len);
 
 	if (opt->print_contents) {
 		print_object_or_die(opt, data);
@@ -363,7 +363,9 @@ static void batch_object_write(const char *obj_name, struct batch_options *opt,
 	}
 }
 
-static void batch_one_object(const char *obj_name, struct batch_options *opt,
+static void batch_one_object(const char *obj_name,
+			     struct strbuf *scratch,
+			     struct batch_options *opt,
 			     struct expand_data *data)
 {
 	struct object_context ctx;
@@ -405,20 +407,21 @@ static void batch_one_object(const char *obj_name, struct batch_options *opt,
 		return;
 	}
 
-	batch_object_write(obj_name, opt, data);
+	batch_object_write(obj_name, scratch, opt, data);
 }
 
 struct object_cb_data {
 	struct batch_options *opt;
 	struct expand_data *expand;
 	struct oidset *seen;
+	struct strbuf *scratch;
 };
 
 static int batch_object_cb(const struct object_id *oid, void *vdata)
 {
 	struct object_cb_data *data = vdata;
 	oidcpy(&data->expand->oid, oid);
-	batch_object_write(NULL, data->opt, data->expand);
+	batch_object_write(NULL, data->scratch, data->opt, data->expand);
 	return 0;
 }
 
@@ -509,6 +512,7 @@ static int batch_objects(struct batch_options *opt)
 
 		cb.opt = opt;
 		cb.expand = &data;
+		cb.scratch = &output;
 
 		if (opt->unordered) {
 			struct oidset seen = OIDSET_INIT;
@@ -531,6 +535,7 @@ static int batch_objects(struct batch_options *opt)
 			oid_array_clear(&sa);
 		}
 
+		strbuf_release(&output);
 		return 0;
 	}
 
@@ -559,10 +564,11 @@ static int batch_objects(struct batch_options *opt)
 			data.rest = p;
 		}
 
-		batch_one_object(input.buf, opt, &data);
+		batch_one_object(input.buf, &output, opt, &data);
 	}
 
 	strbuf_release(&input);
+	strbuf_release(&output);
 	warn_on_object_refname_ambiguity = save_warning;
 	return retval;
 }
-- 
2.18.0.1066.g0d97f3a098


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/4] for_each_*_object: move declarations to object-store.h
  2018-08-14 18:13   ` [PATCH 0/4] finishing touches on jk/for-each-object-iteration Jeff King
                       ` (2 preceding siblings ...)
  2018-08-14 18:20     ` [PATCH 3/4] cat-file: use a single strbuf for all output Jeff King
@ 2018-08-14 18:21     ` Jeff King
  3 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2018-08-14 18:21 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: René Scharfe, git

The for_each_loose_object() and for_each_packed_object()
functions are meant to be part of a unified interface: they
use the same set of for_each_object_flags, and it's not
inconceivable that we might one day add a single
for_each_object() wrapper around them.

Let's put them together in a single file, so we can avoid
awkwardness like saying "the flags for this function are
over in cache.h". Moving the loose functions to packfile.h
is silly. Moving the packed functions to cache.h works, but
makes the "cache.h is a kitchen sink" problem worse. The
best place is the recently-created object-store.h, since
these are quite obviously related to object storage.

The for_each_*_in_objdir() functions do not use the same
flags, but they are logically part of the same interface as
for_each_loose_object(), and share callback signatures. So
we'll move those, as well, as they also make sense in
object-store.h.

Signed-off-by: Jeff King <peff@peff.net>
---
This patch also happens to be a nice showcase for --color-moved.

 builtin/prune-packed.c |  1 +
 cache.h                | 75 -----------------------------------
 object-store.h         | 90 ++++++++++++++++++++++++++++++++++++++++++
 packfile.h             | 20 ----------
 4 files changed, 91 insertions(+), 95 deletions(-)

diff --git a/builtin/prune-packed.c b/builtin/prune-packed.c
index 4ff525e50f..a9e7b552b9 100644
--- a/builtin/prune-packed.c
+++ b/builtin/prune-packed.c
@@ -3,6 +3,7 @@
 #include "progress.h"
 #include "parse-options.h"
 #include "packfile.h"
+#include "object-store.h"
 
 static const char * const prune_packed_usage[] = {
 	N_("git prune-packed [-n | --dry-run] [-q | --quiet]"),
diff --git a/cache.h b/cache.h
index 6d14702df2..aee36afa54 100644
--- a/cache.h
+++ b/cache.h
@@ -1575,81 +1575,6 @@ extern int odb_mkstemp(struct strbuf *temp_filename, const char *pattern);
  */
 extern int odb_pack_keep(const char *name);
 
-/*
- * Iterate over the files in the loose-object parts of the object
- * directory "path", triggering the following callbacks:
- *
- *  - loose_object is called for each loose object we find.
- *
- *  - loose_cruft is called for any files that do not appear to be
- *    loose objects. Note that we only look in the loose object
- *    directories "objects/[0-9a-f]{2}/", so we will not report
- *    "objects/foobar" as cruft.
- *
- *  - loose_subdir is called for each top-level hashed subdirectory
- *    of the object directory (e.g., "$OBJDIR/f0"). It is called
- *    after the objects in the directory are processed.
- *
- * Any callback that is NULL will be ignored. Callbacks returning non-zero
- * will end the iteration.
- *
- * In the "buf" variant, "path" is a strbuf which will also be used as a
- * scratch buffer, but restored to its original contents before
- * the function returns.
- */
-typedef int each_loose_object_fn(const struct object_id *oid,
-				 const char *path,
-				 void *data);
-typedef int each_loose_cruft_fn(const char *basename,
-				const char *path,
-				void *data);
-typedef int each_loose_subdir_fn(unsigned int nr,
-				 const char *path,
-				 void *data);
-int for_each_file_in_obj_subdir(unsigned int subdir_nr,
-				struct strbuf *path,
-				each_loose_object_fn obj_cb,
-				each_loose_cruft_fn cruft_cb,
-				each_loose_subdir_fn subdir_cb,
-				void *data);
-int for_each_loose_file_in_objdir(const char *path,
-				  each_loose_object_fn obj_cb,
-				  each_loose_cruft_fn cruft_cb,
-				  each_loose_subdir_fn subdir_cb,
-				  void *data);
-int for_each_loose_file_in_objdir_buf(struct strbuf *path,
-				      each_loose_object_fn obj_cb,
-				      each_loose_cruft_fn cruft_cb,
-				      each_loose_subdir_fn subdir_cb,
-				      void *data);
-
-/*
- * Flags for for_each_*_object(), including for_each_loose below and
- * for_each_packed in packfile.h.
- */
-enum for_each_object_flags {
-	/* Iterate only over local objects, not alternates. */
-	FOR_EACH_OBJECT_LOCAL_ONLY = (1<<0),
-
-	/* Only iterate over packs obtained from the promisor remote. */
-	FOR_EACH_OBJECT_PROMISOR_ONLY = (1<<1),
-
-	/*
-	 * Visit objects within a pack in packfile order rather than .idx order
-	 */
-	FOR_EACH_OBJECT_PACK_ORDER = (1<<2),
-};
-
-/*
- * Iterate over all accessible loose objects without respect to
- * reachability. By default, this includes both local and alternate objects.
- * The order in which objects are visited is unspecified.
- *
- * Any flags specific to packs are ignored.
- */
-int for_each_loose_object(each_loose_object_fn, void *,
-			  enum for_each_object_flags flags);
-
 /*
  * Set this to 0 to prevent sha1_object_info_extended() from fetching missing
  * blobs. This has a difference only if extensions.partialClone is set.
diff --git a/object-store.h b/object-store.h
index e481f7ad41..162156f5dc 100644
--- a/object-store.h
+++ b/object-store.h
@@ -262,4 +262,94 @@ int oid_object_info_extended(struct repository *r,
 			     const struct object_id *,
 			     struct object_info *, unsigned flags);
 
+/*
+ * Iterate over the files in the loose-object parts of the object
+ * directory "path", triggering the following callbacks:
+ *
+ *  - loose_object is called for each loose object we find.
+ *
+ *  - loose_cruft is called for any files that do not appear to be
+ *    loose objects. Note that we only look in the loose object
+ *    directories "objects/[0-9a-f]{2}/", so we will not report
+ *    "objects/foobar" as cruft.
+ *
+ *  - loose_subdir is called for each top-level hashed subdirectory
+ *    of the object directory (e.g., "$OBJDIR/f0"). It is called
+ *    after the objects in the directory are processed.
+ *
+ * Any callback that is NULL will be ignored. Callbacks returning non-zero
+ * will end the iteration.
+ *
+ * In the "buf" variant, "path" is a strbuf which will also be used as a
+ * scratch buffer, but restored to its original contents before
+ * the function returns.
+ */
+typedef int each_loose_object_fn(const struct object_id *oid,
+				 const char *path,
+				 void *data);
+typedef int each_loose_cruft_fn(const char *basename,
+				const char *path,
+				void *data);
+typedef int each_loose_subdir_fn(unsigned int nr,
+				 const char *path,
+				 void *data);
+int for_each_file_in_obj_subdir(unsigned int subdir_nr,
+				struct strbuf *path,
+				each_loose_object_fn obj_cb,
+				each_loose_cruft_fn cruft_cb,
+				each_loose_subdir_fn subdir_cb,
+				void *data);
+int for_each_loose_file_in_objdir(const char *path,
+				  each_loose_object_fn obj_cb,
+				  each_loose_cruft_fn cruft_cb,
+				  each_loose_subdir_fn subdir_cb,
+				  void *data);
+int for_each_loose_file_in_objdir_buf(struct strbuf *path,
+				      each_loose_object_fn obj_cb,
+				      each_loose_cruft_fn cruft_cb,
+				      each_loose_subdir_fn subdir_cb,
+				      void *data);
+
+/* Flags for for_each_*_object() below. */
+enum for_each_object_flags {
+	/* Iterate only over local objects, not alternates. */
+	FOR_EACH_OBJECT_LOCAL_ONLY = (1<<0),
+
+	/* Only iterate over packs obtained from the promisor remote. */
+	FOR_EACH_OBJECT_PROMISOR_ONLY = (1<<1),
+
+	/*
+	 * Visit objects within a pack in packfile order rather than .idx order
+	 */
+	FOR_EACH_OBJECT_PACK_ORDER = (1<<2),
+};
+
+/*
+ * Iterate over all accessible loose objects without respect to
+ * reachability. By default, this includes both local and alternate objects.
+ * The order in which objects are visited is unspecified.
+ *
+ * Any flags specific to packs are ignored.
+ */
+int for_each_loose_object(each_loose_object_fn, void *,
+			  enum for_each_object_flags flags);
+
+/*
+ * Iterate over all accessible packed objects without respect to reachability.
+ * By default, this includes both local and alternate packs.
+ *
+ * Note that some objects may appear twice if they are found in multiple packs.
+ * Each pack is visited in an unspecified order. By default, objects within a
+ * pack are visited in pack-idx order (i.e., sorted by oid).
+ */
+typedef int each_packed_object_fn(const struct object_id *oid,
+				  struct packed_git *pack,
+				  uint32_t pos,
+				  void *data);
+int for_each_object_in_pack(struct packed_git *p,
+			    each_packed_object_fn, void *data,
+			    enum for_each_object_flags flags);
+int for_each_packed_object(each_packed_object_fn, void *,
+			   enum for_each_object_flags flags);
+
 #endif /* OBJECT_STORE_H */
diff --git a/packfile.h b/packfile.h
index 99411bdd85..d91e43fe73 100644
--- a/packfile.h
+++ b/packfile.h
@@ -148,26 +148,6 @@ extern int has_object_pack(const struct object_id *oid);
 
 extern int has_pack_index(const unsigned char *sha1);
 
-/*
- * Iterate over all accessible packed objects without respect to reachability.
- * By default, this includes both local and alternate packs.
- *
- * Note that some objects may appear twice if they are found in multiple packs.
- * Each pack is visited in an unspecified order. By default, objects within a
- * pack are visited in pack-idx order (i.e., sorted by oid).
- *
- * The list of flags can be found in cache.h.
- */
-typedef int each_packed_object_fn(const struct object_id *oid,
-				  struct packed_git *pack,
-				  uint32_t pos,
-				  void *data);
-int for_each_object_in_pack(struct packed_git *p,
-			    each_packed_object_fn, void *data,
-			    enum for_each_object_flags flags);
-int for_each_packed_object(each_packed_object_fn, void *,
-			   enum for_each_object_flags flags);
-
 /*
  * Return 1 if an object in a promisor packfile is or refers to the given
  * object, 0 otherwise.
-- 
2.18.0.1066.g0d97f3a098

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] cat-file: use a single strbuf for all output
  2018-08-14 18:20     ` [PATCH 3/4] cat-file: use a single strbuf for all output Jeff King
@ 2018-08-14 19:30       ` René Scharfe
  2018-08-14 19:39         ` Jeff King
  0 siblings, 1 reply; 27+ messages in thread
From: René Scharfe @ 2018-08-14 19:30 UTC (permalink / raw)
  To: Jeff King, Jonathan Tan; +Cc: git

Am 14.08.2018 um 20:20 schrieb Jeff King:
> When we're in batch mode, we end up in batch_object_write()
> for each object, which allocates its own strbuf for each
> call. Instead, we can provide a single "scratch" buffer that
> gets reused for each output. When running:
> 
>   git cat-file --batch-all-objects --batch-check='%(objectname)'
> 
> on git.git, my best-of-five time drops from:
> 
>   real	0m0.171s
>   user	0m0.159s
>   sys	0m0.012s
> 
> to:
> 
>   real	0m0.133s
>   user	0m0.121s
>   sys	0m0.012s
> 
> Note that we could do this just by putting the "scratch"
> pointer into "struct expand_data", but I chose instead to
> add an extra parameter to the callstack. That's more
> verbose, but it makes it a bit more obvious what is going
> on, which in turn makes it easy to see where we need to be
> releasing the string in the caller (right after the loop
> which uses it in each case).
> 
> Based-on-a-patch-by: René Scharfe <l.s.r@web.de>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> It also made it easy to see that without the prior patch,
> we'd have been using "buf" for two parameters. :)

Good catch.

> 
>  builtin/cat-file.c | 28 +++++++++++++++++-----------
>  1 file changed, 17 insertions(+), 11 deletions(-)
> 
> diff --git a/builtin/cat-file.c b/builtin/cat-file.c
> index 3ed1d0be80..08dced2614 100644
> --- a/builtin/cat-file.c
> +++ b/builtin/cat-file.c
> @@ -338,11 +338,11 @@ static void print_object_or_die(struct batch_options *opt, struct expand_data *d
>  	}
>  }
>  
> -static void batch_object_write(const char *obj_name, struct batch_options *opt,
> +static void batch_object_write(const char *obj_name,
> +			       struct strbuf *scratch,
> +			       struct batch_options *opt,
>  			       struct expand_data *data)
>  {
> -	struct strbuf buf = STRBUF_INIT;

We could also avoid passing that buffer around by making it static.  I
shy away from adding static variables because the resulting code won't
be thread-safe, but that fear might be irrational, especially with
cat-file.

> -
>  	if (!data->skip_object_info &&
>  	    oid_object_info_extended(the_repository, &data->oid, &data->info,
>  				     OBJECT_INFO_LOOKUP_REPLACE) < 0) {
> @@ -352,10 +352,10 @@ static void batch_object_write(const char *obj_name, struct batch_options *opt,
>  		return;
>  	}
>  
> -	strbuf_expand(&buf, opt->format, expand_format, data);
> -	strbuf_addch(&buf, '\n');
> -	batch_write(opt, buf.buf, buf.len);
> -	strbuf_release(&buf);
> +	strbuf_reset(scratch);
> +	strbuf_expand(scratch, opt->format, expand_format, data);
> +	strbuf_addch(scratch, '\n');
> +	batch_write(opt, scratch->buf, scratch->len);
>  
>  	if (opt->print_contents) {
>  		print_object_or_die(opt, data);
> @@ -363,7 +363,9 @@ static void batch_object_write(const char *obj_name, struct batch_options *opt,
>  	}
>  }
>  
> -static void batch_one_object(const char *obj_name, struct batch_options *opt,
> +static void batch_one_object(const char *obj_name,
> +			     struct strbuf *scratch,
> +			     struct batch_options *opt,
>  			     struct expand_data *data)
>  {
>  	struct object_context ctx;
> @@ -405,20 +407,21 @@ static void batch_one_object(const char *obj_name, struct batch_options *opt,
>  		return;
>  	}
>  
> -	batch_object_write(obj_name, opt, data);
> +	batch_object_write(obj_name, scratch, opt, data);
>  }
>  
>  struct object_cb_data {
>  	struct batch_options *opt;
>  	struct expand_data *expand;
>  	struct oidset *seen;
> +	struct strbuf *scratch;
>  };
>  
>  static int batch_object_cb(const struct object_id *oid, void *vdata)
>  {
>  	struct object_cb_data *data = vdata;
>  	oidcpy(&data->expand->oid, oid);
> -	batch_object_write(NULL, data->opt, data->expand);
> +	batch_object_write(NULL, data->scratch, data->opt, data->expand);
>  	return 0;
>  }
>  
> @@ -509,6 +512,7 @@ static int batch_objects(struct batch_options *opt)
>  
>  		cb.opt = opt;
>  		cb.expand = &data;
> +		cb.scratch = &output;
>  
>  		if (opt->unordered) {
>  			struct oidset seen = OIDSET_INIT;
> @@ -531,6 +535,7 @@ static int batch_objects(struct batch_options *opt)
>  			oid_array_clear(&sa);
>  		}
>  
> +		strbuf_release(&output);
>  		return 0;
>  	}
>  
> @@ -559,10 +564,11 @@ static int batch_objects(struct batch_options *opt)
>  			data.rest = p;
>  		}
>  
> -		batch_one_object(input.buf, opt, &data);
> +		batch_one_object(input.buf, &output, opt, &data);
>  	}
>  
>  	strbuf_release(&input);
> +	strbuf_release(&output);
>  	warn_on_object_refname_ambiguity = save_warning;
>  	return retval;
>  }
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] cat-file: use a single strbuf for all output
  2018-08-14 19:30       ` René Scharfe
@ 2018-08-14 19:39         ` Jeff King
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2018-08-14 19:39 UTC (permalink / raw)
  To: René Scharfe; +Cc: Jonathan Tan, git

On Tue, Aug 14, 2018 at 09:30:57PM +0200, René Scharfe wrote:

> > -static void batch_object_write(const char *obj_name, struct batch_options *opt,
> > +static void batch_object_write(const char *obj_name,
> > +			       struct strbuf *scratch,
> > +			       struct batch_options *opt,
> >  			       struct expand_data *data)
> >  {
> > -	struct strbuf buf = STRBUF_INIT;
> 
> We could also avoid passing that buffer around by making it static.  I
> shy away from adding static variables because the resulting code won't
> be thread-safe, but that fear might be irrational, especially with
> cat-file.

True, I didn't even think of that after your original got me in the
mindset of passing the buffer down. It's not too bad to do it this way,
and I agree with you that we are better avoiding static variables if we
can. Five years ago I might have said the opposite, but we've cleaned up
a lot of confusing hidden-static bits in that time. Let's not go in the
opposite direction. :)

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/7] for_each_packed_object: support iterating in pack-order
  2018-08-10 23:15 ` [PATCH 4/7] for_each_packed_object: support iterating in pack-order Jeff King
@ 2018-08-15 13:28   ` Derrick Stolee
  2018-08-16 17:36     ` Jeff King
  0 siblings, 1 reply; 27+ messages in thread
From: Derrick Stolee @ 2018-08-15 13:28 UTC (permalink / raw)
  To: Jeff King, git; +Cc: Jonathan Tan

On 8/10/2018 7:15 PM, Jeff King wrote:
> diff --git a/commit-graph.c b/commit-graph.c
> index b0a55ad128..69a0d1c203 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -730,7 +730,7 @@ void write_commit_graph(const char *obj_dir,
>   				die("error adding pack %s", packname.buf);
>   			if (open_pack_index(p))
>   				die("error opening index for %s", packname.buf);
> -			for_each_object_in_pack(p, add_packed_commits, &oids);
> +			for_each_object_in_pack(p, add_packed_commits, &oids, 0);
>   			close_pack(p);
>   		}

This use in write_commit_graph() is actually a good candidate for 
pack-order, since we are checking each object to see if it is a commit. 
This is only used when running `git commit-graph write --stdin-packs`, 
which is how VFS for Git maintains the commit-graph.

I have a note to run performance tests on this case and follow up with a 
change on top of this series that adds the FOR_EACH_OBJECT_PACK_ORDER flag.

Thanks,

-Stolee


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/7] speeding up cat-file by reordering object access
  2018-08-10 23:07 [PATCH 0/7] speeding up cat-file by reordering object access Jeff King
                   ` (7 preceding siblings ...)
  2018-08-13 18:45 ` [PATCH 0/7] speeding up cat-file by reordering object access Jonathan Tan
@ 2018-08-15 14:05 ` Derrick Stolee
  2018-08-16 17:39   ` Jeff King
  8 siblings, 1 reply; 27+ messages in thread
From: Derrick Stolee @ 2018-08-15 14:05 UTC (permalink / raw)
  To: Jeff King, git; +Cc: Jonathan Tan

On 8/10/2018 7:07 PM, Jeff King wrote:
> The general idea is that accessing objects in packfile order is way
> kinder to the delta base cache, and thus way more efficient. See patches
> 4 and 7 in particular for discussion and numbers.
>
> I'm primarily interested in cat-file, so this series is focused there.
> But there may be other callers of for_each_packed_object() who could
> benefit. Most of the existing ones just care about getting the oid, so
> they're better off as-is. It's possible the call in is_promisor_object()
> could benefit, since it calls parse_object() on each entry it visits. I
> didn't experiment with it.

I like this series, and the follow-up. I could not find any problems 
with it.

One thing that I realized while reading it is that the multi-pack-index 
is not integrated into the for_each_packed_object method. I was already 
going to work on some cleanups in that area [1][2].

When using the new flag with the multi-pack-index, I expect that we will 
want to load the pack-files that are covered by the multi-pack-index 
(simply, the 'packs' array) and use the same mechanism to traverse them 
in order. The only "strange" thing about this is that we would see 
duplicate objects when traversing the pack-files directly but not when 
traversing the multi-pack-index (since it de-duplicates when indexing).

I hope to have a series working on top of this series by end-of-week.

Thanks,

-Stolee

[1] 
https://public-inbox.org/git/CAPig+cTU--KrGcv4C_CwBZEuec4dgm_tJqL=CFWKT6vxxR016w@mail.gmail.com/

     Re: [PATCH v4 04/23] multi-pack-index: add 'write' verb

     (Recommends more user-friendly usage reporting in 'git 
multi-pack-index')

[2] 
https://public-inbox.org/git/20180814222846.GG142615@aiede.svl.corp.google.com/

     [PATCH] partial-clone: render design doc using asciidoc

     (The commit-graph and multi-pack-index docs are not in the 
Makefile, either.)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/7] for_each_packed_object: support iterating in pack-order
  2018-08-15 13:28   ` Derrick Stolee
@ 2018-08-16 17:36     ` Jeff King
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2018-08-16 17:36 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jonathan Tan

On Wed, Aug 15, 2018 at 09:28:44AM -0400, Derrick Stolee wrote:

> On 8/10/2018 7:15 PM, Jeff King wrote:
> > diff --git a/commit-graph.c b/commit-graph.c
> > index b0a55ad128..69a0d1c203 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -730,7 +730,7 @@ void write_commit_graph(const char *obj_dir,
> >   				die("error adding pack %s", packname.buf);
> >   			if (open_pack_index(p))
> >   				die("error opening index for %s", packname.buf);
> > -			for_each_object_in_pack(p, add_packed_commits, &oids);
> > +			for_each_object_in_pack(p, add_packed_commits, &oids, 0);
> >   			close_pack(p);
> >   		}
> 
> This use in write_commit_graph() is actually a good candidate for
> pack-order, since we are checking each object to see if it is a commit. This
> is only used when running `git commit-graph write --stdin-packs`, which is
> how VFS for Git maintains the commit-graph.
> 
> I have a note to run performance tests on this case and follow up with a
> change on top of this series that adds the FOR_EACH_OBJECT_PACK_ORDER flag.

I doubt that it will show the dramatic improvement in CPU that I
mentioned in my commit message, because most of that comes from more
efficient use of the delta cache. But it's very rare for commits to be
deltas (usually it's just almost-twins due to cherry-picks and rebases).

So you may benefit from block cache efficiency on a cold-cache or on a
system under memory pressure, but I wouldn't expect much change at all
for the warm-cache case.

I doubt it will hurt, though; you'll pay for the revindex generation,
but that's probably not a big deal compared to walking all the objects.

One thing you _could_ do is stop walking through the pack when you see a
non-commit, since we stick all of the commits at the front. But that's
just what the code happens to do, and not a strict promise. So I think
it's a bad idea to rely on it (and in fact the delta-islands work under
discussion elsewhere will break that assumption).

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/7] speeding up cat-file by reordering object access
  2018-08-15 14:05 ` [PATCH 0/7] speeding up cat-file by reordering object access Derrick Stolee
@ 2018-08-16 17:39   ` Jeff King
  2018-08-16 18:52     ` Junio C Hamano
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff King @ 2018-08-16 17:39 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: git, Jonathan Tan

On Wed, Aug 15, 2018 at 10:05:04AM -0400, Derrick Stolee wrote:

> One thing that I realized while reading it is that the multi-pack-index is
> not integrated into the for_each_packed_object method. I was already going
> to work on some cleanups in that area [1][2].
> 
> When using the new flag with the multi-pack-index, I expect that we will
> want to load the pack-files that are covered by the multi-pack-index
> (simply, the 'packs' array) and use the same mechanism to traverse them in
> order. The only "strange" thing about this is that we would see duplicate
> objects when traversing the pack-files directly but not when traversing the
> multi-pack-index (since it de-duplicates when indexing).

I think that makes sense. We already see duplicates from
for_each_packed_object() when they're in multiple packs, and callers
just need to be ready to deal with it (and depending on what you're
doing, you may actually _want_ the duplicates).

Thanks for thinking through the implications for other topics. I hadn't
even considered how this would interact with midx.

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/7] speeding up cat-file by reordering object access
  2018-08-16 17:39   ` Jeff King
@ 2018-08-16 18:52     ` Junio C Hamano
  2018-08-16 19:45       ` Jeff King
  0 siblings, 1 reply; 27+ messages in thread
From: Junio C Hamano @ 2018-08-16 18:52 UTC (permalink / raw)
  To: Jeff King; +Cc: Derrick Stolee, git, Jonathan Tan

Jeff King <peff@peff.net> writes:

> I think that makes sense. We already see duplicates from
> for_each_packed_object() when they're in multiple packs, and callers
> just need to be ready to deal with it (and depending on what you're
> doing, you may actually _want_ the duplicates).

You of course would also see dups between loose and packed until
prune-packed is run.  

I also was thinking about the same thing after Derrick's response,
but unless you are very specialized caller, it does not allow you do
very much to learn that object X exists as a loose object locally,
also as a loose object in our alternate, and also in pack A, but not
in other packs.  You need a way to say "Read the contents of object
X from that place, not from any other place", "Remove that copy of
object X at that place, but not at any other place" etc. to make
effective use of that kind of information.

The codepath that implements runtime access has "I found a copy
here, but it is unusable, so let's go on to look for another usable
copy" fallback.  This is a tangent but it is something we should not
lose in the midx-enabled world.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/7] speeding up cat-file by reordering object access
  2018-08-16 18:52     ` Junio C Hamano
@ 2018-08-16 19:45       ` Jeff King
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2018-08-16 19:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Derrick Stolee, git, Jonathan Tan

On Thu, Aug 16, 2018 at 11:52:13AM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > I think that makes sense. We already see duplicates from
> > for_each_packed_object() when they're in multiple packs, and callers
> > just need to be ready to deal with it (and depending on what you're
> > doing, you may actually _want_ the duplicates).
> 
> You of course would also see dups between loose and packed until
> prune-packed is run.

Yes, although there are two separate calls to look at those two sources,
so it's a bit more obvious that the caller has to de-dup. I'm not
opposed to a flag to ask the function to de-dup for us, but it really
only makes sense if we do a combined for_each_object() call.
De-duping is potentially expensive, and there's little point in
de-duping the packs if we have to then de-dup the result with the loose
objects.

One benefit of having the iterator function de-dup is that it _could_ be
done more efficiently. Right now, even before my recent patches the
cat-file de-dup happens by creating a secondary list, sorting it, and
then skipping the duplicates.

But if a function knew _all_ of the sources we were going to look at
(any loose directories, including alternates, and all available packs),
and it could walk each individual source in sorted order (easy for packs
by .idx, not too bad for loose by sorting the readdir() results), then
we could do it as an online list-merge, with a single walk through the
results.

In practice I don't know if it's worth the effort. If you're accessing
the object contents, sorted order really is bad. And if you're just
collecting the hashes, then you're likely generating some data structure
for future lookups, and you can just de-dup as part of that process.

> I also was thinking about the same thing after Derrick's response,
> but unless you are very specialized caller, it does not allow you do
> very much to learn that object X exists as a loose object locally,
> also as a loose object in our alternate, and also in pack A, but not
> in other packs.  You need a way to say "Read the contents of object
> X from that place, not from any other place", "Remove that copy of
> object X at that place, but not at any other place" etc. to make
> effective use of that kind of information.

We do have those functions, and use them. E.g., fsck uses
read_loose_object() to actually check that particular copy of the
object. That's obviously a specialized caller as you note, but then
anybody iterating over all of the objects is pretty specialized already.

> The codepath that implements runtime access has "I found a copy
> here, but it is unusable, so let's go on to look for another usable
> copy" fallback.  This is a tangent but it is something we should not
> lose in the midx-enabled world.

Yeah, good catch. That's orthogonal to most of this discussion, I think,
but it's a possible downside of the midx because it has literally
discarded those duplicate index entries (or at least that's my
understanding). If we kept the .idx for each pack as a fallback for
older Gits, then it's easy to solve: in the unlikely case the
.midx-referenced copy fails, you look in each individual pack for
another copy.

But I think eventually you'd want to stop generating those .idx files,
too, if you know that you'll only use a modern version of Git. I wonder
if the .midx format should have an extension for "here are all the
duplicates I found". They could even be part of the main index (it's
easy enough for a binary-search lookup to always point to the start of a
run of duplicates), but if you had a ton of duplicates they would slow
your binary searches a little.

I'll leave all that to Stolee to ponder. :)

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2018-08-16 19:45 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-08-10 23:07 [PATCH 0/7] speeding up cat-file by reordering object access Jeff King
2018-08-10 23:09 ` [PATCH 1/7] for_each_*_object: store flag definitions in a single location Jeff King
2018-08-10 23:27   ` Stefan Beller
2018-08-10 23:31     ` Jeff King
2018-08-10 23:33       ` Jeff King
2018-08-10 23:39         ` Stefan Beller
2018-08-11  0:33           ` Jeff King
2018-08-10 23:09 ` [PATCH 2/7] for_each_*_object: take flag arguments as enum Jeff King
2018-08-10 23:11 ` [PATCH 3/7] for_each_*_object: give more comprehensive docstrings Jeff King
2018-08-10 23:15 ` [PATCH 4/7] for_each_packed_object: support iterating in pack-order Jeff King
2018-08-15 13:28   ` Derrick Stolee
2018-08-16 17:36     ` Jeff King
2018-08-10 23:16 ` [PATCH 5/7] t1006: test cat-file --batch-all-objects with duplicates Jeff King
2018-08-10 23:17 ` [PATCH 6/7] cat-file: rename batch_{loose,packed}_object callbacks Jeff King
2018-08-10 23:24 ` [PATCH 7/7] cat-file: support "unordered" output for --batch-all-objects Jeff King
2018-08-13 18:45 ` [PATCH 0/7] speeding up cat-file by reordering object access Jonathan Tan
2018-08-14 18:13   ` [PATCH 0/4] finishing touches on jk/for-each-object-iteration Jeff King
2018-08-14 18:14     ` [PATCH 1/4] cat-file: use oidset check-and-insert Jeff King
2018-08-14 18:18     ` [PATCH 2/4] cat-file: split batch "buf" into two variables Jeff King
2018-08-14 18:20     ` [PATCH 3/4] cat-file: use a single strbuf for all output Jeff King
2018-08-14 19:30       ` René Scharfe
2018-08-14 19:39         ` Jeff King
2018-08-14 18:21     ` [PATCH 4/4] for_each_*_object: move declarations to object-store.h Jeff King
2018-08-15 14:05 ` [PATCH 0/7] speeding up cat-file by reordering object access Derrick Stolee
2018-08-16 17:39   ` Jeff King
2018-08-16 18:52     ` Junio C Hamano
2018-08-16 19:45       ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).