git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] clone: do faster object check for partial clones
@ 2019-04-03 17:27 Josh Steadmon
  2019-04-03 18:58 ` Jonathan Tan
                   ` (4 more replies)
  0 siblings, 5 replies; 27+ messages in thread
From: Josh Steadmon @ 2019-04-03 17:27 UTC (permalink / raw)
  To: git

For partial clones, doing a full connectivity check is wasteful; we skip
promisor objects (which, for a partial clone, is all known objects), and
excluding them all from the connectivity check can take a significant
amount of time on large repos.

At most, we want to make sure that we get the objects referred to by any
wanted refs. For partial clones, just check that these objects were
transferred.

Signed-off-by: Josh Steadmon <steadmon@google.com>
---
 builtin/clone.c |  6 ++++--
 connected.c     | 15 +++++++++++++++
 connected.h     |  8 ++++++++
 3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index 50bde99618..fdbbd8942a 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -657,7 +657,8 @@ static void update_remote_refs(const struct ref *refs,
 			       const char *branch_top,
 			       const char *msg,
 			       struct transport *transport,
-			       int check_connectivity)
+			       int check_connectivity,
+			       int check_refs_only)
 {
 	const struct ref *rm = mapped_refs;
 
@@ -666,6 +667,7 @@ static void update_remote_refs(const struct ref *refs,
 
 		opt.transport = transport;
 		opt.progress = transport->progress;
+		opt.check_refs_only = !!check_refs_only;
 
 		if (check_connected(iterate_ref_map, &rm, &opt))
 			die(_("remote did not send all necessary objects"));
@@ -1224,7 +1226,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 
 	update_remote_refs(refs, mapped_refs, remote_head_points_at,
 			   branch_top.buf, reflog_msg.buf, transport,
-			   !is_local);
+			   !is_local, filter_options.choice);
 
 	update_head(our_head_points_at, remote_head, reflog_msg.buf);
 
diff --git a/connected.c b/connected.c
index 1bba888eff..c297cdc5ab 100644
--- a/connected.c
+++ b/connected.c
@@ -1,4 +1,5 @@
 #include "cache.h"
+#include "object-store.h"
 #include "run-command.h"
 #include "sigchain.h"
 #include "connected.h"
@@ -49,6 +50,20 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 		strbuf_release(&idx_file);
 	}
 
+	if (opt->check_refs_only) {
+		/*
+		 * For partial clones, we don't want to walk the full commit
+		 * graph because we're skipping promisor objects anyway. We
+		 * should just check that objects referenced by wanted refs were
+		 * transferred.
+		 */
+		do {
+			if (!repo_has_object_file(the_repository, &oid))
+				return 1;
+		} while (!fn(cb_data, &oid));
+		return 0;
+	}
+
 	if (opt->shallow_file) {
 		argv_array_push(&rev_list.args, "--shallow-file");
 		argv_array_push(&rev_list.args, opt->shallow_file);
diff --git a/connected.h b/connected.h
index 8d5a6b3ad6..bb4afcb301 100644
--- a/connected.h
+++ b/connected.h
@@ -46,6 +46,14 @@ struct check_connected_options {
 	 * during a fetch.
 	 */
 	unsigned is_deepening_fetch : 1;
+
+	/*
+	 * If non-zero, only check the top-level objects referenced by the
+	 * wanted refs (passed in as cb_data). This is useful for partial
+	 * clones, where this can be much faster than excluding all promisor
+	 * objects prior to walking the commit graph.
+	 */
+	unsigned check_refs_only : 1;
 };
 
 #define CHECK_CONNECTED_INIT { 0 }
-- 
2.21.0.392.gf8f6787159e-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] clone: do faster object check for partial clones
  2019-04-03 17:27 [PATCH] clone: do faster object check for partial clones Josh Steadmon
@ 2019-04-03 18:58 ` Jonathan Tan
  2019-04-03 19:41 ` Jeff King
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 27+ messages in thread
From: Jonathan Tan @ 2019-04-03 18:58 UTC (permalink / raw)
  To: steadmon; +Cc: git, Jonathan Tan

> For partial clones, doing a full connectivity check is wasteful; we skip
> promisor objects (which, for a partial clone, is all known objects), and
> excluding them all from the connectivity check can take a significant
> amount of time on large repos.

Instead of "excluding them all", I would word this as "enumerating them
all so that we can exclude them" - the enumerating is the slow part, not
the excluding (which actually makes things faster).

> +	if (opt->check_refs_only) {
> +		/*
> +		 * For partial clones, we don't want to walk the full commit
> +		 * graph because we're skipping promisor objects anyway. We
> +		 * should just check that objects referenced by wanted refs were
> +		 * transferred.

The enumeration of promisor objects to be excluded is done through
for_each_packed_object() (see is_promisor_object()), not through walking
the commit graph. Maybe reword this comment to be similar to what I've
suggested for the commit message.

> @@ -46,6 +46,14 @@ struct check_connected_options {
>  	 * during a fetch.
>  	 */
>  	unsigned is_deepening_fetch : 1;
> +
> +	/*
> +	 * If non-zero, only check the top-level objects referenced by the
> +	 * wanted refs (passed in as cb_data). This is useful for partial
> +	 * clones, where this can be much faster than excluding all promisor
> +	 * objects prior to walking the commit graph.
> +	 */
> +	unsigned check_refs_only : 1;
>  };

Same enumerating vs excluding comment as before.

Aside from that: thinking from scratch, we want something that tells
check_connected() to avoid anything that enumerates the list of promised
objects, since the objects that we're checking are all promisor objects,
and thus any outgoing links are automatically promised. I would include
some of this explanation in the comment, but in the interest of trying
to avoid a bikeshedding discussion, I don't consider this necessary.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] clone: do faster object check for partial clones
  2019-04-03 17:27 [PATCH] clone: do faster object check for partial clones Josh Steadmon
  2019-04-03 18:58 ` Jonathan Tan
@ 2019-04-03 19:41 ` Jeff King
  2019-04-03 20:57   ` Jonathan Tan
  2019-04-04 22:53 ` [PATCH v2] rev-list: exclude promisor objects at walk time Josh Steadmon
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 27+ messages in thread
From: Jeff King @ 2019-04-03 19:41 UTC (permalink / raw)
  To: Josh Steadmon; +Cc: git

On Wed, Apr 03, 2019 at 10:27:21AM -0700, Josh Steadmon wrote:

> For partial clones, doing a full connectivity check is wasteful; we skip
> promisor objects (which, for a partial clone, is all known objects), and
> excluding them all from the connectivity check can take a significant
> amount of time on large repos.
> 
> At most, we want to make sure that we get the objects referred to by any
> wanted refs. For partial clones, just check that these objects were
> transferred.

This isn't strictly true, since we could get objects from elsewhere via
--shared or --reference. Those might not be promisor objects.

Shouldn't we be able to stop a traversal as soon as we see that an
object is in a promisor pack?

I.e., here:

> +	if (opt->check_refs_only) {
> +		/*
> +		 * For partial clones, we don't want to walk the full commit
> +		 * graph because we're skipping promisor objects anyway. We
> +		 * should just check that objects referenced by wanted refs were
> +		 * transferred.
> +		 */
> +		do {
> +			if (!repo_has_object_file(the_repository, &oid))
> +				return 1;
> +		} while (!fn(cb_data, &oid));
> +		return 0;
> +	}

for each object where you ask "do we have this?" we could, for the same
cost, ask "do we have this in a promisor pack?". And the answer would be
yes for each in a partial clone.

But that would also cleanly handle --shared, etc, that I mentioned. And
more importantly, it would _also_ work on fetches. If I fetch from you
and get a promisor pack, then there is no point in me enumerating every
tree you sent. As long as you gave me a tip tree, then you have promised
that you'd give me all the others if I ask.

So it seems like this should be a feature of the child rev-list, to stop
walking the graph at any object that is in a promisor pack.

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] clone: do faster object check for partial clones
  2019-04-03 19:41 ` Jeff King
@ 2019-04-03 20:57   ` Jonathan Tan
  2019-04-04  0:21     ` Josh Steadmon
  2019-04-04  1:33     ` Jeff King
  0 siblings, 2 replies; 27+ messages in thread
From: Jonathan Tan @ 2019-04-03 20:57 UTC (permalink / raw)
  To: peff; +Cc: steadmon, git, Jonathan Tan

> On Wed, Apr 03, 2019 at 10:27:21AM -0700, Josh Steadmon wrote:
> 
> > For partial clones, doing a full connectivity check is wasteful; we skip
> > promisor objects (which, for a partial clone, is all known objects), and
> > excluding them all from the connectivity check can take a significant
> > amount of time on large repos.
> > 
> > At most, we want to make sure that we get the objects referred to by any
> > wanted refs. For partial clones, just check that these objects were
> > transferred.
> 
> This isn't strictly true, since we could get objects from elsewhere via
> --shared or --reference. Those might not be promisor objects.

I don't think local clones (which --shared or --reference implies) can
be partial, but the bigger point is below.

> Shouldn't we be able to stop a traversal as soon as we see that an
> object is in a promisor pack?
> 
> I.e., here:
> 
> > +	if (opt->check_refs_only) {
> > +		/*
> > +		 * For partial clones, we don't want to walk the full commit
> > +		 * graph because we're skipping promisor objects anyway. We
> > +		 * should just check that objects referenced by wanted refs were
> > +		 * transferred.
> > +		 */
> > +		do {
> > +			if (!repo_has_object_file(the_repository, &oid))
> > +				return 1;
> > +		} while (!fn(cb_data, &oid));
> > +		return 0;
> > +	}
> 
> for each object where you ask "do we have this?" we could, for the same
> cost, ask "do we have this in a promisor pack?". And the answer would be
> yes for each in a partial clone.
> 
> But that would also cleanly handle --shared, etc, that I mentioned. And
> more importantly, it would _also_ work on fetches. If I fetch from you
> and get a promisor pack, then there is no point in me enumerating every
> tree you sent. As long as you gave me a tip tree, then you have promised
> that you'd give me all the others if I ask.
> 
> So it seems like this should be a feature of the child rev-list, to stop
> walking the graph at any object that is in a promisor pack.

We currently already do a less optimal version of this - we pass
--exclude-promisor-objects to rev-list, which indeed stops traversal at
any promisor objects (whether in a promisor pack or referenced by one).
As far as I know, the problem is that to do so, we currently enumerate
all the objects in all promisor packs, and all objects that those
objects reference (which means we inflate them too) - so that we have an
oidset that we can check objects against.

A partial solution is for is_promisor_object() to first check if the
given object is in a promisor pack, avoiding generating the set of
promisor objects until necessary. This would work in a blob:none clone
with the refs pointing all to commits or all to blobs, but would not
work in a tree:none clone (or maybe, in this case, the clone would be
small enough that performance is not a concern, hmm).

Maybe the ideal solution is for rev-list to check if an object is in a
promisor pack and if --exclude-promisor-objects is active, we do not
follow any outgoing links.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] clone: do faster object check for partial clones
  2019-04-03 20:57   ` Jonathan Tan
@ 2019-04-04  0:21     ` Josh Steadmon
  2019-04-04  1:33     ` Jeff King
  1 sibling, 0 replies; 27+ messages in thread
From: Josh Steadmon @ 2019-04-04  0:21 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: peff, git

On 2019.04.03 13:57, Jonathan Tan wrote:
> > On Wed, Apr 03, 2019 at 10:27:21AM -0700, Josh Steadmon wrote:
> > 
> > > For partial clones, doing a full connectivity check is wasteful; we skip
> > > promisor objects (which, for a partial clone, is all known objects), and
> > > excluding them all from the connectivity check can take a significant
> > > amount of time on large repos.
> > > 
> > > At most, we want to make sure that we get the objects referred to by any
> > > wanted refs. For partial clones, just check that these objects were
> > > transferred.
> > 
> > This isn't strictly true, since we could get objects from elsewhere via
> > --shared or --reference. Those might not be promisor objects.
> 
> I don't think local clones (which --shared or --reference implies) can
> be partial, but the bigger point is below.
> 
> > Shouldn't we be able to stop a traversal as soon as we see that an
> > object is in a promisor pack?
> > 
> > I.e., here:
> > 
> > > +	if (opt->check_refs_only) {
> > > +		/*
> > > +		 * For partial clones, we don't want to walk the full commit
> > > +		 * graph because we're skipping promisor objects anyway. We
> > > +		 * should just check that objects referenced by wanted refs were
> > > +		 * transferred.
> > > +		 */
> > > +		do {
> > > +			if (!repo_has_object_file(the_repository, &oid))
> > > +				return 1;
> > > +		} while (!fn(cb_data, &oid));
> > > +		return 0;
> > > +	}
> > 
> > for each object where you ask "do we have this?" we could, for the same
> > cost, ask "do we have this in a promisor pack?". And the answer would be
> > yes for each in a partial clone.
> > 
> > But that would also cleanly handle --shared, etc, that I mentioned. And
> > more importantly, it would _also_ work on fetches. If I fetch from you
> > and get a promisor pack, then there is no point in me enumerating every
> > tree you sent. As long as you gave me a tip tree, then you have promised
> > that you'd give me all the others if I ask.
> > 
> > So it seems like this should be a feature of the child rev-list, to stop
> > walking the graph at any object that is in a promisor pack.
> 
> We currently already do a less optimal version of this - we pass
> --exclude-promisor-objects to rev-list, which indeed stops traversal at
> any promisor objects (whether in a promisor pack or referenced by one).
> As far as I know, the problem is that to do so, we currently enumerate
> all the objects in all promisor packs, and all objects that those
> objects reference (which means we inflate them too) - so that we have an
> oidset that we can check objects against.
> 
> A partial solution is for is_promisor_object() to first check if the
> given object is in a promisor pack, avoiding generating the set of
> promisor objects until necessary. This would work in a blob:none clone
> with the refs pointing all to commits or all to blobs, but would not
> work in a tree:none clone (or maybe, in this case, the clone would be
> small enough that performance is not a concern, hmm).
> 
> Maybe the ideal solution is for rev-list to check if an object is in a
> promisor pack and if --exclude-promisor-objects is active, we do not
> follow any outgoing links.

I am not sure that this actually saves us any work vs. the original
method of marking promisor objects as uninteresting. Marking the objects
uninteresting involves calling for_each_packed_object(some_callback_fn, ...,
FOR_EACH_OBJECT_PROMISOR_ONLY). But this is also exactly the setup that
is_promisor_object() runs the first time it's called. In fact, it looks
to me like the callback function used by the latter is more expensive
than the former. Is there some alternative to is_promisor_object() that
doesn't involve this object enumeration?

I've tried implementing this approach anyway (diff against master is
below), but I can't get a version that passes t0410-partial-clone.sh
(cases 17 and 19). In case 17, the rev-list emits one extra line
corresponding to the blob for baz.t. I'm not sure why the original
implementation thinks that baz.t is a promisor object but my WIP version
doesn't.

I'll keep poking at this tomorrow, I just thought I should comment on
what I've found so far in case anyone can spot where I'm going wrong.



diff --git a/list-objects.c b/list-objects.c
index dc77361e11..1cb85f1662 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -326,6 +326,12 @@ static void traverse_trees_and_blobs(struct traversal_context *ctx,
 		struct object *obj = pending->item;
 		const char *name = pending->name;
 		const char *path = pending->path;
+		trace_printf("DEBUG: Examining object: %s\n", oid_to_hex(&obj->oid));
+		if (ctx->revs->exclude_promisor_objects &&
+		    is_promisor_object(&obj->oid)) {
+			trace_printf("DEBUG: Skipping object: %s\n", oid_to_hex(&obj->oid));
+			continue;
+		}
 		if (obj->flags & (UNINTERESTING | SEEN))
 			continue;
 		if (obj->type == OBJ_TAG) {
@@ -356,6 +362,13 @@ static void do_traverse(struct traversal_context *ctx)
 	strbuf_init(&csp, PATH_MAX);
 
 	while ((commit = get_revision(ctx->revs)) != NULL) {
+		trace_printf("DEBUG: Examining commit: %s\n", oid_to_hex(&commit->object.oid));
+		if (ctx->revs->exclude_promisor_objects &&
+		    is_promisor_object(&commit->object.oid)) {
+			trace_printf("DEBUG: Skipping commit: %s\n", oid_to_hex(&commit->object.oid));
+			continue;
+		}
+
 		/*
 		 * an uninteresting boundary commit may not have its tree
 		 * parsed yet, but we are not going to show them anyway
diff --git a/revision.c b/revision.c
index eb8e51bc63..85974e941d 100644
--- a/revision.c
+++ b/revision.c
@@ -3067,17 +3067,6 @@ void reset_revision_walk(void)
 	clear_object_flags(SEEN | ADDED | SHOWN);
 }
 
-static int mark_uninteresting(const struct object_id *oid,
-			      struct packed_git *pack,
-			      uint32_t pos,
-			      void *cb)
-{
-	struct rev_info *revs = cb;
-	struct object *o = parse_object(revs->repo, oid);
-	o->flags |= UNINTERESTING | SEEN;
-	return 0;
-}
-
 define_commit_slab(indegree_slab, int);
 define_commit_slab(author_date_slab, timestamp_t);
 
@@ -3316,11 +3305,6 @@ int prepare_revision_walk(struct rev_info *revs)
 	    (revs->limited && limiting_can_increase_treesame(revs)))
 		revs->treesame.name = "treesame";
 
-	if (revs->exclude_promisor_objects) {
-		for_each_packed_object(mark_uninteresting, revs,
-				       FOR_EACH_OBJECT_PROMISOR_ONLY);
-	}
-
 	if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
 		commit_list_sort_by_date(&revs->commits);
 	if (revs->no_walk)
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index bce02788e6..55ac29f650 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -342,7 +342,7 @@ test_expect_success 'rev-list stops traversal at promisor commit, tree, and blob
 
 	git -C repo config core.repositoryformatversion 1 &&
 	git -C repo config extensions.partialclone "arbitrary string" &&
-	git -C repo rev-list --exclude-promisor-objects --objects HEAD >out &&
+	GIT_TRACE="$(pwd)/rev_trace" git -C repo rev-list --exclude-promisor-objects --objects HEAD >out &&
 	! grep $COMMIT out &&
 	! grep $TREE out &&
 	! grep $BLOB out &&

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] clone: do faster object check for partial clones
  2019-04-03 20:57   ` Jonathan Tan
  2019-04-04  0:21     ` Josh Steadmon
@ 2019-04-04  1:33     ` Jeff King
  1 sibling, 0 replies; 27+ messages in thread
From: Jeff King @ 2019-04-04  1:33 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: steadmon, git

On Wed, Apr 03, 2019 at 01:57:48PM -0700, Jonathan Tan wrote:

> > This isn't strictly true, since we could get objects from elsewhere via
> > --shared or --reference. Those might not be promisor objects.
> 
> I don't think local clones (which --shared or --reference implies) can
> be partial, but the bigger point is below.

Yeah, you're right about --shared. But I don't see any reason a
--reference clone could not be partial.

> > So it seems like this should be a feature of the child rev-list, to stop
> > walking the graph at any object that is in a promisor pack.
> 
> We currently already do a less optimal version of this - we pass
> --exclude-promisor-objects to rev-list, which indeed stops traversal at
> any promisor objects (whether in a promisor pack or referenced by one).
> As far as I know, the problem is that to do so, we currently enumerate
> all the objects in all promisor packs, and all objects that those
> objects reference (which means we inflate them too) - so that we have an
> oidset that we can check objects against.
> 
> A partial solution is for is_promisor_object() to first check if the
> given object is in a promisor pack, avoiding generating the set of
> promisor objects until necessary. This would work in a blob:none clone
> with the refs pointing all to commits or all to blobs, but would not
> work in a tree:none clone (or maybe, in this case, the clone would be
> small enough that performance is not a concern, hmm).
> 
> Maybe the ideal solution is for rev-list to check if an object is in a
> promisor pack and if --exclude-promisor-objects is active, we do not
> follow any outgoing links.

I was thinking you could actually check it before even loading the
object. I.e., something like:

  struct object_info oi = OBJECT_INFO_INIT;

  if (!oid_object_info_extended(oid, &oi, 0) &&
      oi->whence = OI_PACKED &&
      oi->u.packed.pack->pack_promisor)) {
          /*
	   * no point in even looking at its links,
	   * since the promisor pack claims that we
	   * can get anything we need later from the
	   * remote
	   */
     return 0; /* or whatever, depending where this goes ;) */
  } else {
    /* not a promisor object, load it and traverse as normal */
  }

That doesn't quite work as an implementation of is_promisor_object(),
because it wouldn't know about items that we _don't_ have that are
promised. But I think it could work as part of the traversal in
list-objects.c, since we'd just be walking down a traversal from which
we presumably have all the objects.

I guess maybe it would be complicated if you had non-promisor objects
that refer indirectly to promisor ones. E.g., imagine ref A points to
object X, which is in a promisor pack pointing to Y (which we don't
have). But we also have ref B pointing to object Z which also refers to
Y, but _isn't_ in a promisor pack. I'm not sure that can actually happen
with the promisor mechanism, though (how did we get a Z without all of
its objects into a non-promisor pack?).

It's also a shame that it would incur an extra object lookup, since if
it _isn't_ in the promisor pack we'd then have to actually look it up
again via parse_object() or whatever. It may not be measurable though.
In an ideal world, we'd have an object access API that lets us open a
handle, ask things about it (like "which pack is this coming from") and
then load it if we want. But I don't think that needs to hold up this
particular topic.

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v2] rev-list: exclude promisor objects at walk time
  2019-04-03 17:27 [PATCH] clone: do faster object check for partial clones Josh Steadmon
  2019-04-03 18:58 ` Jonathan Tan
  2019-04-03 19:41 ` Jeff King
@ 2019-04-04 22:53 ` Josh Steadmon
  2019-04-04 23:08   ` Jeff King
  2019-04-08 21:06 ` [PATCH v3] " Josh Steadmon
  2019-04-19 21:00 ` [PATCH v4] clone: do faster object check for partial clones Josh Steadmon
  4 siblings, 1 reply; 27+ messages in thread
From: Josh Steadmon @ 2019-04-04 22:53 UTC (permalink / raw)
  To: git, peff, jonathantanmy, jrnieder

For large repositories, enumerating the list of all promisor objects (in
order to exclude them from a rev-list walk) can take a significant
amount of time).

When --exclude-promisor-objects is passed to rev-list, don't enumerate
the promisor objects. Instead, filter them (and any children objects)
during the actual graph walk.

Remove the mark_uninteresting() function as it's not used anywhere else.

Helped-By: Jonathan Tan <jonathantanmy@google.com>
Helped-By: Jeff King <peff@peff.net>
Helped-By: Jonathan Nieder <jrnieder@gmail.com>

Signed-off-by: Josh Steadmon <steadmon@google.com>
---

Re-implemented following Jonathan & Jeff's advice (and also previously
Jonathan Nieder's, although I didn't understand it at the time). Thanks
for the feedback all.


 list-objects.c | 26 ++++++++++++++++++++++++++
 revision.c     | 16 ----------------
 2 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/list-objects.c b/list-objects.c
index dc77361e11..d1eaa0999e 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -30,6 +30,7 @@ static void process_blob(struct traversal_context *ctx,
 	struct object *obj = &blob->object;
 	size_t pathlen;
 	enum list_objects_filter_result r = LOFR_MARK_SEEN | LOFR_DO_SHOW;
+	struct object_info oi = OBJECT_INFO_INIT;
 
 	if (!ctx->revs->blob_objects)
 		return;
@@ -37,6 +38,11 @@ static void process_blob(struct traversal_context *ctx,
 		die("bad blob object");
 	if (obj->flags & (UNINTERESTING | SEEN))
 		return;
+	if (ctx->revs->exclude_promisor_objects &&
+	    !oid_object_info_extended(the_repository, &obj->oid, &oi, 0) &&
+	    oi.whence == OI_PACKED &&
+	    oi.u.packed.pack->pack_promisor)
+		return;
 
 	/*
 	 * Pre-filter known-missing objects when explicitly requested.
@@ -149,6 +155,7 @@ static void process_tree(struct traversal_context *ctx,
 	int baselen = base->len;
 	enum list_objects_filter_result r = LOFR_MARK_SEEN | LOFR_DO_SHOW;
 	int failed_parse;
+	struct object_info oi = OBJECT_INFO_INIT;
 
 	if (!revs->tree_objects)
 		return;
@@ -156,6 +163,11 @@ static void process_tree(struct traversal_context *ctx,
 		die("bad tree object");
 	if (obj->flags & (UNINTERESTING | SEEN))
 		return;
+	if (ctx->revs->exclude_promisor_objects &&
+	    !oid_object_info_extended(the_repository, &obj->oid, &oi, 0) &&
+	    oi.whence == OI_PACKED &&
+	    oi.u.packed.pack->pack_promisor)
+		return;
 
 	failed_parse = parse_tree_gently(tree, 1);
 	if (failed_parse) {
@@ -318,6 +330,7 @@ static void traverse_trees_and_blobs(struct traversal_context *ctx,
 				     struct strbuf *base)
 {
 	int i;
+	struct object_info oi = OBJECT_INFO_INIT;
 
 	assert(base->len == 0);
 
@@ -326,6 +339,12 @@ static void traverse_trees_and_blobs(struct traversal_context *ctx,
 		struct object *obj = pending->item;
 		const char *name = pending->name;
 		const char *path = pending->path;
+		if (ctx->revs->exclude_promisor_objects &&
+		    !oid_object_info_extended(the_repository, &obj->oid, &oi, 0) &&
+		    oi.whence == OI_PACKED &&
+		    oi.u.packed.pack->pack_promisor)
+			continue;
+
 		if (obj->flags & (UNINTERESTING | SEEN))
 			continue;
 		if (obj->type == OBJ_TAG) {
@@ -353,9 +372,16 @@ static void do_traverse(struct traversal_context *ctx)
 {
 	struct commit *commit;
 	struct strbuf csp; /* callee's scratch pad */
+	struct object_info oi = OBJECT_INFO_INIT;
 	strbuf_init(&csp, PATH_MAX);
 
 	while ((commit = get_revision(ctx->revs)) != NULL) {
+		if (ctx->revs->exclude_promisor_objects &&
+		    !oid_object_info_extended(the_repository, &commit->object.oid, &oi, 0) &&
+		    oi.whence == OI_PACKED &&
+		    oi.u.packed.pack->pack_promisor)
+			continue;
+
 		/*
 		 * an uninteresting boundary commit may not have its tree
 		 * parsed yet, but we are not going to show them anyway
diff --git a/revision.c b/revision.c
index eb8e51bc63..85974e941d 100644
--- a/revision.c
+++ b/revision.c
@@ -3067,17 +3067,6 @@ void reset_revision_walk(void)
 	clear_object_flags(SEEN | ADDED | SHOWN);
 }
 
-static int mark_uninteresting(const struct object_id *oid,
-			      struct packed_git *pack,
-			      uint32_t pos,
-			      void *cb)
-{
-	struct rev_info *revs = cb;
-	struct object *o = parse_object(revs->repo, oid);
-	o->flags |= UNINTERESTING | SEEN;
-	return 0;
-}
-
 define_commit_slab(indegree_slab, int);
 define_commit_slab(author_date_slab, timestamp_t);
 
@@ -3316,11 +3305,6 @@ int prepare_revision_walk(struct rev_info *revs)
 	    (revs->limited && limiting_can_increase_treesame(revs)))
 		revs->treesame.name = "treesame";
 
-	if (revs->exclude_promisor_objects) {
-		for_each_packed_object(mark_uninteresting, revs,
-				       FOR_EACH_OBJECT_PROMISOR_ONLY);
-	}
-
 	if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
 		commit_list_sort_by_date(&revs->commits);
 	if (revs->no_walk)
-- 
2.21.0.392.gf8f6787159e-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] rev-list: exclude promisor objects at walk time
  2019-04-04 22:53 ` [PATCH v2] rev-list: exclude promisor objects at walk time Josh Steadmon
@ 2019-04-04 23:08   ` Jeff King
  2019-04-04 23:47     ` Josh Steadmon
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff King @ 2019-04-04 23:08 UTC (permalink / raw)
  To: Josh Steadmon; +Cc: git, jonathantanmy, jrnieder

On Thu, Apr 04, 2019 at 03:53:56PM -0700, Josh Steadmon wrote:

> For large repositories, enumerating the list of all promisor objects (in
> order to exclude them from a rev-list walk) can take a significant
> amount of time).
> 
> When --exclude-promisor-objects is passed to rev-list, don't enumerate
> the promisor objects. Instead, filter them (and any children objects)
> during the actual graph walk.

Yeah, this is definitely the approach I was thinking of.

Did you (or anybody else) have any thoughts on the case where a given
object is referred to both by a promisor and a non-promisor (and we
don't have it)? That's the "shortcut" I think we're taking here: we
would no longer realize that it's available via the promisor when we
traverse to it from the non-promisor. I'm just not clear on whether that
can ever happen.

> Helped-By: Jonathan Tan <jonathantanmy@google.com>
> Helped-By: Jeff King <peff@peff.net>
> Helped-By: Jonathan Nieder <jrnieder@gmail.com>
> 
> Signed-off-by: Josh Steadmon <steadmon@google.com>

Minor nit, but these should all be part of the same block.

> diff --git a/list-objects.c b/list-objects.c
> index dc77361e11..d1eaa0999e 100644
> --- a/list-objects.c
> +++ b/list-objects.c
> @@ -30,6 +30,7 @@ static void process_blob(struct traversal_context *ctx,
>  	struct object *obj = &blob->object;
>  	size_t pathlen;
>  	enum list_objects_filter_result r = LOFR_MARK_SEEN | LOFR_DO_SHOW;
> +	struct object_info oi = OBJECT_INFO_INIT;
>  
>  	if (!ctx->revs->blob_objects)
>  		return;
> @@ -37,6 +38,11 @@ static void process_blob(struct traversal_context *ctx,
>  		die("bad blob object");
>  	if (obj->flags & (UNINTERESTING | SEEN))
>  		return;
> +	if (ctx->revs->exclude_promisor_objects &&
> +	    !oid_object_info_extended(the_repository, &obj->oid, &oi, 0) &&
> +	    oi.whence == OI_PACKED &&
> +	    oi.u.packed.pack->pack_promisor)
> +		return;

This conditional gets repeated a lot in your patch. Perhaps it's worth a
helper so we can say:

  if (skip_promisor_object(&ctx->revs, &obj->oid))
	return;

in each place?

One other possible small optimization: we don't look up the object
unless the caller asked to exclude promisors, which is good. But we
could also keep a single flag for "is there a promisor pack at all?".
When there isn't, we know there's no point in looking for the object.

It might not matter much in practice. The main caller here is going to
be check_connected(), and it only passes --exclude-promisor-objects if
it's in a partial clone.

> [...]

I didn't see any tweaks to the callers, which makes sense; we're already
passing --exclude-promisor-objects as necessary. Which means by itself,
this patch should be making things faster, right? Do you have timings to
show that off?

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] rev-list: exclude promisor objects at walk time
  2019-04-04 23:08   ` Jeff King
@ 2019-04-04 23:47     ` Josh Steadmon
  2019-04-05  0:00       ` Jeff King
  0 siblings, 1 reply; 27+ messages in thread
From: Josh Steadmon @ 2019-04-04 23:47 UTC (permalink / raw)
  To: Jeff King; +Cc: git, jonathantanmy, jrnieder

On 2019.04.04 19:08, Jeff King wrote:
> On Thu, Apr 04, 2019 at 03:53:56PM -0700, Josh Steadmon wrote:
> 
> > For large repositories, enumerating the list of all promisor objects (in
> > order to exclude them from a rev-list walk) can take a significant
> > amount of time).
> > 
> > When --exclude-promisor-objects is passed to rev-list, don't enumerate
> > the promisor objects. Instead, filter them (and any children objects)
> > during the actual graph walk.
> 
> Yeah, this is definitely the approach I was thinking of.
> 
> Did you (or anybody else) have any thoughts on the case where a given
> object is referred to both by a promisor and a non-promisor (and we
> don't have it)? That's the "shortcut" I think we're taking here: we
> would no longer realize that it's available via the promisor when we
> traverse to it from the non-promisor. I'm just not clear on whether that
> can ever happen.

I am not sure either. In process_blob() and process_tree() there are
additional checks for whether missing blobs/trees are promisor objects
using is_promisor_object()...  but if we call that we undo the
performance gains from this change.


> > Helped-By: Jonathan Tan <jonathantanmy@google.com>
> > Helped-By: Jeff King <peff@peff.net>
> > Helped-By: Jonathan Nieder <jrnieder@gmail.com>
> > 
> > Signed-off-by: Josh Steadmon <steadmon@google.com>
> 
> Minor nit, but these should all be part of the same block.

Will fix in v3.


> > diff --git a/list-objects.c b/list-objects.c
> > index dc77361e11..d1eaa0999e 100644
> > --- a/list-objects.c
> > +++ b/list-objects.c
> > @@ -30,6 +30,7 @@ static void process_blob(struct traversal_context *ctx,
> >  	struct object *obj = &blob->object;
> >  	size_t pathlen;
> >  	enum list_objects_filter_result r = LOFR_MARK_SEEN | LOFR_DO_SHOW;
> > +	struct object_info oi = OBJECT_INFO_INIT;
> >  
> >  	if (!ctx->revs->blob_objects)
> >  		return;
> > @@ -37,6 +38,11 @@ static void process_blob(struct traversal_context *ctx,
> >  		die("bad blob object");
> >  	if (obj->flags & (UNINTERESTING | SEEN))
> >  		return;
> > +	if (ctx->revs->exclude_promisor_objects &&
> > +	    !oid_object_info_extended(the_repository, &obj->oid, &oi, 0) &&
> > +	    oi.whence == OI_PACKED &&
> > +	    oi.u.packed.pack->pack_promisor)
> > +		return;
> 
> This conditional gets repeated a lot in your patch. Perhaps it's worth a
> helper so we can say:
> 
>   if (skip_promisor_object(&ctx->revs, &obj->oid))
> 	return;
> 
> in each place?

Will fix in v3.


> One other possible small optimization: we don't look up the object
> unless the caller asked to exclude promisors, which is good. But we
> could also keep a single flag for "is there a promisor pack at all?".
> When there isn't, we know there's no point in looking for the object.
> 
> It might not matter much in practice. The main caller here is going to
> be check_connected(), and it only passes --exclude-promisor-objects if
> it's in a partial clone.

I'm not necessarily opposed, but I'm leaning towards the "won't matter
much" side.

Where would such a flag live, in this case, and who would be responsible
for initializing it? I guess it would only matter for rev-list, so we
could initialize it in cmd_rev_list() if --exclude-promisor-objects is
passed?

> > [...]
> 
> I didn't see any tweaks to the callers, which makes sense; we're already
> passing --exclude-promisor-objects as necessary. Which means by itself,
> this patch should be making things faster, right? Do you have timings to
> show that off?

Yeah, for a partial clone of a large-ish Android repo [1], we see the
connectivity check go from >180s to ~7s.

[1]: https://android.googlesource.com/platform/frameworks/base/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] rev-list: exclude promisor objects at walk time
  2019-04-04 23:47     ` Josh Steadmon
@ 2019-04-05  0:00       ` Jeff King
  2019-04-05  0:09         ` Josh Steadmon
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff King @ 2019-04-05  0:00 UTC (permalink / raw)
  To: Josh Steadmon; +Cc: git, jonathantanmy, jrnieder

On Thu, Apr 04, 2019 at 04:47:26PM -0700, Josh Steadmon wrote:

> > Did you (or anybody else) have any thoughts on the case where a given
> > object is referred to both by a promisor and a non-promisor (and we
> > don't have it)? That's the "shortcut" I think we're taking here: we
> > would no longer realize that it's available via the promisor when we
> > traverse to it from the non-promisor. I'm just not clear on whether that
> > can ever happen.
> 
> I am not sure either. In process_blob() and process_tree() there are
> additional checks for whether missing blobs/trees are promisor objects
> using is_promisor_object()...  but if we call that we undo the
> performance gains from this change.

Hmm. That might be a good outcome, though. If it never happens, we're
fast. If it does happen, then our worst case is that we fall back to the
current slower-but-more-thorough check. (And I think that happens with
your patch, without us having to do anything further).

> > One other possible small optimization: we don't look up the object
> > unless the caller asked to exclude promisors, which is good. But we
> > could also keep a single flag for "is there a promisor pack at all?".
> > When there isn't, we know there's no point in looking for the object.
> [...]
> I'm not necessarily opposed, but I'm leaning towards the "won't matter
> much" side.
> 
> Where would such a flag live, in this case, and who would be responsible
> for initializing it? I guess it would only matter for rev-list, so we
> could initialize it in cmd_rev_list() if --exclude-promisor-objects is
> passed?

The check is really something like:

  int have_promisor_pack() {
	for (p = packed_git; p; p = p->next) {
		if (p->pack_promisor)
			return 1;
	}
	return 0;
  }

That could be lazily cached as a single bit, but it would need to be
reset whenever we call reprepare_packed_git().

Let's just punt on it for now. I'm not convinced it would actually yield
any benefit, unless we have a partial-clone repo that doesn't have any
promisor packs (but then, I suspect whatever un-partial'd it should
probably be resetting the partial flag in the config).

> > I didn't see any tweaks to the callers, which makes sense; we're already
> > passing --exclude-promisor-objects as necessary. Which means by itself,
> > this patch should be making things faster, right? Do you have timings to
> > show that off?
> 
> Yeah, for a partial clone of a large-ish Android repo [1], we see the
> connectivity check go from >180s to ~7s.

Those are nice numbers. :) Worth mentioning in the commit message, I
think. How does it compare to your earlier patch? I'd hope they're about
the same.

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] rev-list: exclude promisor objects at walk time
  2019-04-05  0:00       ` Jeff King
@ 2019-04-05  0:09         ` Josh Steadmon
  2019-04-08 20:59           ` Josh Steadmon
  0 siblings, 1 reply; 27+ messages in thread
From: Josh Steadmon @ 2019-04-05  0:09 UTC (permalink / raw)
  To: Jeff King; +Cc: git, jonathantanmy, jrnieder

On 2019.04.04 20:00, Jeff King wrote:
> On Thu, Apr 04, 2019 at 04:47:26PM -0700, Josh Steadmon wrote:
> 
> > > Did you (or anybody else) have any thoughts on the case where a given
> > > object is referred to both by a promisor and a non-promisor (and we
> > > don't have it)? That's the "shortcut" I think we're taking here: we
> > > would no longer realize that it's available via the promisor when we
> > > traverse to it from the non-promisor. I'm just not clear on whether that
> > > can ever happen.
> > 
> > I am not sure either. In process_blob() and process_tree() there are
> > additional checks for whether missing blobs/trees are promisor objects
> > using is_promisor_object()...  but if we call that we undo the
> > performance gains from this change.
> 
> Hmm. That might be a good outcome, though. If it never happens, we're
> fast. If it does happen, then our worst case is that we fall back to the
> current slower-but-more-thorough check. (And I think that happens with
> your patch, without us having to do anything further).
> 
> > > One other possible small optimization: we don't look up the object
> > > unless the caller asked to exclude promisors, which is good. But we
> > > could also keep a single flag for "is there a promisor pack at all?".
> > > When there isn't, we know there's no point in looking for the object.
> > [...]
> > I'm not necessarily opposed, but I'm leaning towards the "won't matter
> > much" side.
> > 
> > Where would such a flag live, in this case, and who would be responsible
> > for initializing it? I guess it would only matter for rev-list, so we
> > could initialize it in cmd_rev_list() if --exclude-promisor-objects is
> > passed?
> 
> The check is really something like:
> 
>   int have_promisor_pack() {
> 	for (p = packed_git; p; p = p->next) {
> 		if (p->pack_promisor)
> 			return 1;
> 	}
> 	return 0;
>   }
> 
> That could be lazily cached as a single bit, but it would need to be
> reset whenever we call reprepare_packed_git().
> 
> Let's just punt on it for now. I'm not convinced it would actually yield
> any benefit, unless we have a partial-clone repo that doesn't have any
> promisor packs (but then, I suspect whatever un-partial'd it should
> probably be resetting the partial flag in the config).
> 
> > > I didn't see any tweaks to the callers, which makes sense; we're already
> > > passing --exclude-promisor-objects as necessary. Which means by itself,
> > > this patch should be making things faster, right? Do you have timings to
> > > show that off?
> > 
> > Yeah, for a partial clone of a large-ish Android repo [1], we see the
> > connectivity check go from >180s to ~7s.
> 
> Those are nice numbers. :) Worth mentioning in the commit message, I
> think. How does it compare to your earlier patch? I'd hope they're about
> the same.

Thanks, will include them in the v3 commit message. Unfortunately it's
hard to compare against v1, because v1 doesn't call rev-list at all, and
thus we don't have a good measurement in the trace / trace2 output. The
rev-list timing has been pretty consistent at just a bit over 3 minutes,
but the overall clone takes anywhere from 12-20 minutes, so any
difference between v1 and v2 performance just gets lost in the noise. If
I get a chance on Monday I may go back to v1 and add some timing.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v2] rev-list: exclude promisor objects at walk time
  2019-04-05  0:09         ` Josh Steadmon
@ 2019-04-08 20:59           ` Josh Steadmon
  0 siblings, 0 replies; 27+ messages in thread
From: Josh Steadmon @ 2019-04-08 20:59 UTC (permalink / raw)
  To: Jeff King, git, jonathantanmy, jrnieder

On 2019.04.04 17:09, Josh Steadmon wrote:
> On 2019.04.04 20:00, Jeff King wrote:
> > On Thu, Apr 04, 2019 at 04:47:26PM -0700, Josh Steadmon wrote:
> > 
> > > > Did you (or anybody else) have any thoughts on the case where a given
> > > > object is referred to both by a promisor and a non-promisor (and we
> > > > don't have it)? That's the "shortcut" I think we're taking here: we
> > > > would no longer realize that it's available via the promisor when we
> > > > traverse to it from the non-promisor. I'm just not clear on whether that
> > > > can ever happen.
> > > 
> > > I am not sure either. In process_blob() and process_tree() there are
> > > additional checks for whether missing blobs/trees are promisor objects
> > > using is_promisor_object()...  but if we call that we undo the
> > > performance gains from this change.
> > 
> > Hmm. That might be a good outcome, though. If it never happens, we're
> > fast. If it does happen, then our worst case is that we fall back to the
> > current slower-but-more-thorough check. (And I think that happens with
> > your patch, without us having to do anything further).
> > 
> > > > One other possible small optimization: we don't look up the object
> > > > unless the caller asked to exclude promisors, which is good. But we
> > > > could also keep a single flag for "is there a promisor pack at all?".
> > > > When there isn't, we know there's no point in looking for the object.
> > > [...]
> > > I'm not necessarily opposed, but I'm leaning towards the "won't matter
> > > much" side.
> > > 
> > > Where would such a flag live, in this case, and who would be responsible
> > > for initializing it? I guess it would only matter for rev-list, so we
> > > could initialize it in cmd_rev_list() if --exclude-promisor-objects is
> > > passed?
> > 
> > The check is really something like:
> > 
> >   int have_promisor_pack() {
> > 	for (p = packed_git; p; p = p->next) {
> > 		if (p->pack_promisor)
> > 			return 1;
> > 	}
> > 	return 0;
> >   }
> > 
> > That could be lazily cached as a single bit, but it would need to be
> > reset whenever we call reprepare_packed_git().
> > 
> > Let's just punt on it for now. I'm not convinced it would actually yield
> > any benefit, unless we have a partial-clone repo that doesn't have any
> > promisor packs (but then, I suspect whatever un-partial'd it should
> > probably be resetting the partial flag in the config).
> > 
> > > > I didn't see any tweaks to the callers, which makes sense; we're already
> > > > passing --exclude-promisor-objects as necessary. Which means by itself,
> > > > this patch should be making things faster, right? Do you have timings to
> > > > show that off?
> > > 
> > > Yeah, for a partial clone of a large-ish Android repo [1], we see the
> > > connectivity check go from >180s to ~7s.
> > 
> > Those are nice numbers. :) Worth mentioning in the commit message, I
> > think. How does it compare to your earlier patch? I'd hope they're about
> > the same.
> 
> Thanks, will include them in the v3 commit message. Unfortunately it's
> hard to compare against v1, because v1 doesn't call rev-list at all, and
> thus we don't have a good measurement in the trace / trace2 output. The
> rev-list timing has been pretty consistent at just a bit over 3 minutes,
> but the overall clone takes anywhere from 12-20 minutes, so any
> difference between v1 and v2 performance just gets lost in the noise. If
> I get a chance on Monday I may go back to v1 and add some timing.

For the record, the V1 abbreviated check takes about 5ms

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3] rev-list: exclude promisor objects at walk time
  2019-04-03 17:27 [PATCH] clone: do faster object check for partial clones Josh Steadmon
                   ` (2 preceding siblings ...)
  2019-04-04 22:53 ` [PATCH v2] rev-list: exclude promisor objects at walk time Josh Steadmon
@ 2019-04-08 21:06 ` Josh Steadmon
  2019-04-08 22:23   ` Christian Couder
                     ` (2 more replies)
  2019-04-19 21:00 ` [PATCH v4] clone: do faster object check for partial clones Josh Steadmon
  4 siblings, 3 replies; 27+ messages in thread
From: Josh Steadmon @ 2019-04-08 21:06 UTC (permalink / raw)
  To: git, peff, jonathantanmy, jrnieder

For large repositories, enumerating the list of all promisor objects (in
order to exclude them from a rev-list walk) can take a significant
amount of time).

When --exclude-promisor-objects is passed to rev-list, don't enumerate
the promisor objects. Instead, filter them (and any children objects)
during the actual graph walk.

Remove the mark_uninteresting() function as it's not used anywhere else.

When testing against a large repo [1], this patch reduces the
connectivity check runtime from 3 minutes to ~7 seconds.

[1]: https://android.googlesource.com/platform/frameworks/base/

Helped-By: Jonathan Tan <jonathantanmy@google.com>
Helped-By: Jeff King <peff@peff.net>
Helped-By: Jonathan Nieder <jrnieder@gmail.com>
Signed-off-by: Josh Steadmon <steadmon@google.com>

Signed-off-by: Josh Steadmon <steadmon@google.com>
---
Changes since V2:
* Pulled the "OK to skip?" logic into a separate function.

Changes since V1:
* Switched to alternate approach, we now do the regular rev-list walk
  but skip promisor objects at walk time, rather than pre-excluding
  them.

Range-diff against v2:
1:  9f327d6d8d ! 1:  9856e7fc74 rev-list: exclude promisor objects at walk time
    @@ -10,9 +10,15 @@
         the promisor objects. Instead, filter them (and any children objects)
         during the actual graph walk.
     
    +    When testing against a large repo [1], this reduces the connectivity
    +    check runtime from 3 minutes to ~7 seconds.
    +
    +    [1]: https://android.googlesource.com/platform/frameworks/base/
    +
         Helped-By: Jonathan Tan <jonathantanmy@google.com>
         Helped-By: Jeff King <peff@peff.net>
         Helped-By: Jonathan Nieder <jrnieder@gmail.com>
    +    Signed-off-by: Josh Steadmon <steadmon@google.com>
     
     
    @@ -20,78 +26,55 @@
      --- a/list-objects.c
      +++ b/list-objects.c
     @@
    - 	struct object *obj = &blob->object;
    - 	size_t pathlen;
    - 	enum list_objects_filter_result r = LOFR_MARK_SEEN | LOFR_DO_SHOW;
    -+	struct object_info oi = OBJECT_INFO_INIT;
    + 	void *filter_data;
    + };
      
    - 	if (!ctx->revs->blob_objects)
    - 		return;
    ++static int should_skip_promisor_object(const struct rev_info *revs,
    ++				       const struct object_id *oid)
    ++{
    ++	struct object_info oi = OBJECT_INFO_INIT;
    ++	return (revs->exclude_promisor_objects &&
    ++		!oid_object_info_extended(the_repository, oid, &oi, 0) &&
    ++		oi.whence == OI_PACKED &&
    ++		oi.u.packed.pack->pack_promisor);
    ++}
    ++
    + static void process_blob(struct traversal_context *ctx,
    + 			 struct blob *blob,
    + 			 struct strbuf *path,
     @@
      		die("bad blob object");
      	if (obj->flags & (UNINTERESTING | SEEN))
      		return;
    -+	if (ctx->revs->exclude_promisor_objects &&
    -+	    !oid_object_info_extended(the_repository, &obj->oid, &oi, 0) &&
    -+	    oi.whence == OI_PACKED &&
    -+	    oi.u.packed.pack->pack_promisor)
    ++	if (should_skip_promisor_object(ctx->revs, &obj->oid))
     +		return;
      
      	/*
      	 * Pre-filter known-missing objects when explicitly requested.
    -@@
    - 	int baselen = base->len;
    - 	enum list_objects_filter_result r = LOFR_MARK_SEEN | LOFR_DO_SHOW;
    - 	int failed_parse;
    -+	struct object_info oi = OBJECT_INFO_INIT;
    - 
    - 	if (!revs->tree_objects)
    - 		return;
     @@
      		die("bad tree object");
      	if (obj->flags & (UNINTERESTING | SEEN))
      		return;
    -+	if (ctx->revs->exclude_promisor_objects &&
    -+	    !oid_object_info_extended(the_repository, &obj->oid, &oi, 0) &&
    -+	    oi.whence == OI_PACKED &&
    -+	    oi.u.packed.pack->pack_promisor)
    ++	if (should_skip_promisor_object(ctx->revs, &obj->oid))
     +		return;
      
      	failed_parse = parse_tree_gently(tree, 1);
      	if (failed_parse) {
    -@@
    - 				     struct strbuf *base)
    - {
    - 	int i;
    -+	struct object_info oi = OBJECT_INFO_INIT;
    - 
    - 	assert(base->len == 0);
    - 
     @@
      		struct object *obj = pending->item;
      		const char *name = pending->name;
      		const char *path = pending->path;
    -+		if (ctx->revs->exclude_promisor_objects &&
    -+		    !oid_object_info_extended(the_repository, &obj->oid, &oi, 0) &&
    -+		    oi.whence == OI_PACKED &&
    -+		    oi.u.packed.pack->pack_promisor)
    ++		if (should_skip_promisor_object(ctx->revs, &obj->oid))
     +			continue;
     +
      		if (obj->flags & (UNINTERESTING | SEEN))
      			continue;
      		if (obj->type == OBJ_TAG) {
     @@
    - {
    - 	struct commit *commit;
    - 	struct strbuf csp; /* callee's scratch pad */
    -+	struct object_info oi = OBJECT_INFO_INIT;
      	strbuf_init(&csp, PATH_MAX);
      
      	while ((commit = get_revision(ctx->revs)) != NULL) {
    -+		if (ctx->revs->exclude_promisor_objects &&
    -+		    !oid_object_info_extended(the_repository, &commit->object.oid, &oi, 0) &&
    -+		    oi.whence == OI_PACKED &&
    -+		    oi.u.packed.pack->pack_promisor)
    ++		if (should_skip_promisor_object(ctx->revs, &commit->object.oid))
     +			continue;
     +
      		/*

 list-objects.c | 20 ++++++++++++++++++++
 revision.c     | 16 ----------------
 2 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/list-objects.c b/list-objects.c
index dc77361e11..c153ee5dfb 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -22,6 +22,16 @@ struct traversal_context {
 	void *filter_data;
 };
 
+static int should_skip_promisor_object(const struct rev_info *revs,
+				       const struct object_id *oid)
+{
+	struct object_info oi = OBJECT_INFO_INIT;
+	return (revs->exclude_promisor_objects &&
+		!oid_object_info_extended(the_repository, oid, &oi, 0) &&
+		oi.whence == OI_PACKED &&
+		oi.u.packed.pack->pack_promisor);
+}
+
 static void process_blob(struct traversal_context *ctx,
 			 struct blob *blob,
 			 struct strbuf *path,
@@ -37,6 +47,8 @@ static void process_blob(struct traversal_context *ctx,
 		die("bad blob object");
 	if (obj->flags & (UNINTERESTING | SEEN))
 		return;
+	if (should_skip_promisor_object(ctx->revs, &obj->oid))
+		return;
 
 	/*
 	 * Pre-filter known-missing objects when explicitly requested.
@@ -156,6 +168,8 @@ static void process_tree(struct traversal_context *ctx,
 		die("bad tree object");
 	if (obj->flags & (UNINTERESTING | SEEN))
 		return;
+	if (should_skip_promisor_object(ctx->revs, &obj->oid))
+		return;
 
 	failed_parse = parse_tree_gently(tree, 1);
 	if (failed_parse) {
@@ -326,6 +340,9 @@ static void traverse_trees_and_blobs(struct traversal_context *ctx,
 		struct object *obj = pending->item;
 		const char *name = pending->name;
 		const char *path = pending->path;
+		if (should_skip_promisor_object(ctx->revs, &obj->oid))
+			continue;
+
 		if (obj->flags & (UNINTERESTING | SEEN))
 			continue;
 		if (obj->type == OBJ_TAG) {
@@ -356,6 +373,9 @@ static void do_traverse(struct traversal_context *ctx)
 	strbuf_init(&csp, PATH_MAX);
 
 	while ((commit = get_revision(ctx->revs)) != NULL) {
+		if (should_skip_promisor_object(ctx->revs, &commit->object.oid))
+			continue;
+
 		/*
 		 * an uninteresting boundary commit may not have its tree
 		 * parsed yet, but we are not going to show them anyway
diff --git a/revision.c b/revision.c
index eb8e51bc63..85974e941d 100644
--- a/revision.c
+++ b/revision.c
@@ -3067,17 +3067,6 @@ void reset_revision_walk(void)
 	clear_object_flags(SEEN | ADDED | SHOWN);
 }
 
-static int mark_uninteresting(const struct object_id *oid,
-			      struct packed_git *pack,
-			      uint32_t pos,
-			      void *cb)
-{
-	struct rev_info *revs = cb;
-	struct object *o = parse_object(revs->repo, oid);
-	o->flags |= UNINTERESTING | SEEN;
-	return 0;
-}
-
 define_commit_slab(indegree_slab, int);
 define_commit_slab(author_date_slab, timestamp_t);
 
@@ -3316,11 +3305,6 @@ int prepare_revision_walk(struct rev_info *revs)
 	    (revs->limited && limiting_can_increase_treesame(revs)))
 		revs->treesame.name = "treesame";
 
-	if (revs->exclude_promisor_objects) {
-		for_each_packed_object(mark_uninteresting, revs,
-				       FOR_EACH_OBJECT_PROMISOR_ONLY);
-	}
-
 	if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
 		commit_list_sort_by_date(&revs->commits);
 	if (revs->no_walk)
-- 
2.21.0.392.gf8f6787159e-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v3] rev-list: exclude promisor objects at walk time
  2019-04-08 21:06 ` [PATCH v3] " Josh Steadmon
@ 2019-04-08 22:23   ` Christian Couder
  2019-04-08 23:12     ` Josh Steadmon
  2019-04-09 15:14   ` Junio C Hamano
  2019-04-09 18:04   ` SZEDER Gábor
  2 siblings, 1 reply; 27+ messages in thread
From: Christian Couder @ 2019-04-08 22:23 UTC (permalink / raw)
  To: Josh Steadmon; +Cc: git, Jeff King, Jonathan Tan, Jonathan Nieder

On Mon, Apr 8, 2019 at 11:46 PM Josh Steadmon <steadmon@google.com> wrote:
>
> Helped-By: Jonathan Tan <jonathantanmy@google.com>
> Helped-By: Jeff King <peff@peff.net>
> Helped-By: Jonathan Nieder <jrnieder@gmail.com>
> Signed-off-by: Josh Steadmon <steadmon@google.com>
>
> Signed-off-by: Josh Steadmon <steadmon@google.com>

Your S-o-B is duplicated an there is a spurious line between the
duplicated lines. Do you use an automated script/hook to add your
S-o-B?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3] rev-list: exclude promisor objects at walk time
  2019-04-08 22:23   ` Christian Couder
@ 2019-04-08 23:12     ` Josh Steadmon
  0 siblings, 0 replies; 27+ messages in thread
From: Josh Steadmon @ 2019-04-08 23:12 UTC (permalink / raw)
  To: Christian Couder; +Cc: git, Jeff King, Jonathan Tan, Jonathan Nieder

On 2019.04.09 00:23, Christian Couder wrote:
> On Mon, Apr 8, 2019 at 11:46 PM Josh Steadmon <steadmon@google.com> wrote:
> >
> > Helped-By: Jonathan Tan <jonathantanmy@google.com>
> > Helped-By: Jeff King <peff@peff.net>
> > Helped-By: Jonathan Nieder <jrnieder@gmail.com>
> > Signed-off-by: Josh Steadmon <steadmon@google.com>
> >
> > Signed-off-by: Josh Steadmon <steadmon@google.com>
> 
> Your S-o-B is duplicated an there is a spurious line between the
> duplicated lines. Do you use an automated script/hook to add your
> S-o-B?

I only use the --signoff flag to git-format-patch. It looks like the
cause is that I have a hook to add a Gerrit Change-Id to my commit
messages, and that hook added a blank line followed by Change-Id:blah
after my original Signed-off-by line. Then git-format-patch added
another Signed-off-by after the Change-Id. The Change-Id line was then
stripped out by a wrapper around git-send-email.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3] rev-list: exclude promisor objects at walk time
  2019-04-08 21:06 ` [PATCH v3] " Josh Steadmon
  2019-04-08 22:23   ` Christian Couder
@ 2019-04-09 15:14   ` Junio C Hamano
  2019-04-09 15:15     ` Jeff King
  2019-04-09 18:04   ` SZEDER Gábor
  2 siblings, 1 reply; 27+ messages in thread
From: Junio C Hamano @ 2019-04-09 15:14 UTC (permalink / raw)
  To: Josh Steadmon; +Cc: git, peff, jonathantanmy, jrnieder

Josh Steadmon <steadmon@google.com> writes:

> For large repositories, enumerating the list of all promisor objects (in
> order to exclude them from a rev-list walk) can take a significant
> amount of time).
>
> When --exclude-promisor-objects is passed to rev-list, don't enumerate
> the promisor objects. Instead, filter them (and any children objects)
> during the actual graph walk.
>
> Remove the mark_uninteresting() function as it's not used anywhere else.
>
> When testing against a large repo [1], this patch reduces the
> connectivity check runtime from 3 minutes to ~7 seconds.
>
> [1]: https://android.googlesource.com/platform/frameworks/base/
>
> Helped-By: Jonathan Tan <jonathantanmy@google.com>
> Helped-By: Jeff King <peff@peff.net>
> Helped-By: Jonathan Nieder <jrnieder@gmail.com>
> Signed-off-by: Josh Steadmon <steadmon@google.com>
>
> Signed-off-by: Josh Steadmon <steadmon@google.com>

I've dealt with the stray double-sign-off locally, but is there
anything else planned for v4 or later?  Is this performance-only
change, or does it have an externally observable behaviour change
that we can easily add to our test suite?

>  list-objects.c | 20 ++++++++++++++++++++
>  revision.c     | 16 ----------------
>  2 files changed, 20 insertions(+), 16 deletions(-)


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3] rev-list: exclude promisor objects at walk time
  2019-04-09 15:14   ` Junio C Hamano
@ 2019-04-09 15:15     ` Jeff King
  2019-04-09 15:43       ` Junio C Hamano
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff King @ 2019-04-09 15:15 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Josh Steadmon, git, jonathantanmy, jrnieder

On Wed, Apr 10, 2019 at 12:14:41AM +0900, Junio C Hamano wrote:

> Josh Steadmon <steadmon@google.com> writes:
> 
> > For large repositories, enumerating the list of all promisor objects (in
> > order to exclude them from a rev-list walk) can take a significant
> > amount of time).
> >
> > When --exclude-promisor-objects is passed to rev-list, don't enumerate
> > the promisor objects. Instead, filter them (and any children objects)
> > during the actual graph walk.
> >
> > Remove the mark_uninteresting() function as it's not used anywhere else.
> >
> > When testing against a large repo [1], this patch reduces the
> > connectivity check runtime from 3 minutes to ~7 seconds.
> >
> > [1]: https://android.googlesource.com/platform/frameworks/base/
> >
> > Helped-By: Jonathan Tan <jonathantanmy@google.com>
> > Helped-By: Jeff King <peff@peff.net>
> > Helped-By: Jonathan Nieder <jrnieder@gmail.com>
> > Signed-off-by: Josh Steadmon <steadmon@google.com>
> >
> > Signed-off-by: Josh Steadmon <steadmon@google.com>
> 
> I've dealt with the stray double-sign-off locally, but is there
> anything else planned for v4 or later?  Is this performance-only
> change, or does it have an externally observable behaviour change
> that we can easily add to our test suite?

I am OK if we do not include it, but even if this is "just" a
performance-only change, we can always add to our perf regression suite.

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3] rev-list: exclude promisor objects at walk time
  2019-04-09 15:15     ` Jeff King
@ 2019-04-09 15:43       ` Junio C Hamano
  2019-04-09 16:35         ` Josh Steadmon
  0 siblings, 1 reply; 27+ messages in thread
From: Junio C Hamano @ 2019-04-09 15:43 UTC (permalink / raw)
  To: Jeff King; +Cc: Josh Steadmon, git, jonathantanmy, jrnieder

Jeff King <peff@peff.net> writes:

> On Wed, Apr 10, 2019 at 12:14:41AM +0900, Junio C Hamano wrote:
>
>> I've dealt with the stray double-sign-off locally, but is there
>> anything else planned for v4 or later?  Is this performance-only
>> change, or does it have an externally observable behaviour change
>> that we can easily add to our test suite?
>
> I am OK if we do not include it, but even if this is "just" a
> performance-only change, we can always add to our perf regression suite.

Hmph, that does not say much about a possible change in behaviour in
corner cases you guys were discuussing near the beginning of the
thread when an object can be reached from both a non-promisor and a
promisor object, does it?

Shouldn't we at least tweak the log message to record that we were
aware of the possibility even though we couldn't readily come up
with a case where this optimization breaks things?  I suspect that
it would help the next person who needs to deal with a possible
regression coming from this change to understand the problem better
and hopefully faster.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3] rev-list: exclude promisor objects at walk time
  2019-04-09 15:43       ` Junio C Hamano
@ 2019-04-09 16:35         ` Josh Steadmon
  0 siblings, 0 replies; 27+ messages in thread
From: Josh Steadmon @ 2019-04-09 16:35 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, git, jonathantanmy, jrnieder

On 2019.04.10 00:43, Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
> 
> > On Wed, Apr 10, 2019 at 12:14:41AM +0900, Junio C Hamano wrote:
> >
> >> I've dealt with the stray double-sign-off locally, but is there
> >> anything else planned for v4 or later?  Is this performance-only
> >> change, or does it have an externally observable behaviour change
> >> that we can easily add to our test suite?
> >
> > I am OK if we do not include it, but even if this is "just" a
> > performance-only change, we can always add to our perf regression suite.
> 
> Hmph, that does not say much about a possible change in behaviour in
> corner cases you guys were discuussing near the beginning of the
> thread when an object can be reached from both a non-promisor and a
> promisor object, does it?
> 
> Shouldn't we at least tweak the log message to record that we were
> aware of the possibility even though we couldn't readily come up
> with a case where this optimization breaks things?  I suspect that
> it would help the next person who needs to deal with a possible
> regression coming from this change to understand the problem better
> and hopefully faster.
> 

I'll update the log message and send a v4 in a few minutes.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3] rev-list: exclude promisor objects at walk time
  2019-04-08 21:06 ` [PATCH v3] " Josh Steadmon
  2019-04-08 22:23   ` Christian Couder
  2019-04-09 15:14   ` Junio C Hamano
@ 2019-04-09 18:04   ` SZEDER Gábor
  2019-04-09 23:42     ` Josh Steadmon
  2 siblings, 1 reply; 27+ messages in thread
From: SZEDER Gábor @ 2019-04-09 18:04 UTC (permalink / raw)
  To: Josh Steadmon; +Cc: git, peff, jonathantanmy, jrnieder

On Mon, Apr 08, 2019 at 02:06:04PM -0700, Josh Steadmon wrote:
> For large repositories, enumerating the list of all promisor objects (in
> order to exclude them from a rev-list walk) can take a significant
> amount of time).
> 
> When --exclude-promisor-objects is passed to rev-list, don't enumerate
> the promisor objects. Instead, filter them (and any children objects)
> during the actual graph walk.
> 
> Remove the mark_uninteresting() function as it's not used anywhere else.
> 
> When testing against a large repo [1], this patch reduces the
> connectivity check runtime from 3 minutes to ~7 seconds.

This patch breaks test 'repack -d does not irreversibly delete
promisor objects' in 't0410-partial-clone.sh' when run with
GIT_TEST_COMMIT_GRAPH=1.

  +rm -rf repo
  +test_create_repo repo
  +test 1 = 1
  +repo=repo
  +mkdir -p repo
  +cd repo
  +/home/travis/build/git/git/t/../git init --template=/home/travis/build/git/git/t/../templates/blt/
  Initialized empty Git repository in /home/travis/build/git/git/t/trash directory.t0410-partial-clone/repo/.git/
  +mv .git/hooks .git/hooks-disabled
  +git -C repo config core.repositoryformatversion 1
  +git -C repo config extensions.partialclone arbitrary string
  +git -C repo commit --allow-empty -m one
  [master (root-commit) 71905df] one
   Author: A U Thor <author@example.com>
  +git -C repo commit --allow-empty -m two
  [master 202c4a3] two
   Author: A U Thor <author@example.com>
  +git -C repo commit --allow-empty -m three
  [master 4737577] three
   Author: A U Thor <author@example.com>
  +git -C repo commit --allow-empty -m four
  [master d6ba7e0] four
   Author: A U Thor <author@example.com>
  +git -C repo rev-parse HEAD^^^
  +ONE=71905dfcd543b7cbb0b4b66fbd20379e67220557
  +git -C repo rev-parse HEAD^^
  +TWO=202c4a3dd9a2dac927f056abb747cce9ea2eb67b
  +git -C repo rev-parse HEAD^
  +THREE=47375779ebcca4b422e3afdd14aa37a358081297
  +pack_as_from_promisor
  +printf 202c4a3dd9a2dac927f056abb747cce9ea2eb67b\n
  +git -C repo pack-objects .git/objects/pack/pack
  +HASH=2e675cd706e508d6c52a21d28cfcddde5ec02a06
  +
  +echo 2e675cd706e508d6c52a21d28cfcddde5ec02a06
  2e675cd706e508d6c52a21d28cfcddde5ec02a06
  +printf 47375779ebcca4b422e3afdd14aa37a358081297\n
  +pack_as_from_promisor
  +git -C repo pack-objects .git/objects/pack/pack
  +HASH=31f7d2797549ab9b1c425a9e60eb2030481170e5
  +
  +echo 31f7d2797549ab9b1c425a9e60eb2030481170e5
  31f7d2797549ab9b1c425a9e60eb2030481170e5
  +delete_object repo 71905dfcd543b7cbb0b4b66fbd20379e67220557
  +sed -e s|^..|&/|
  +echo 71905dfcd543b7cbb0b4b66fbd20379e67220557
  +rm repo/.git/objects/71/905dfcd543b7cbb0b4b66fbd20379e67220557
  +repack_and_check -a 202c4a3dd9a2dac927f056abb747cce9ea2eb67b 47375779ebcca4b422e3afdd14aa37a358081297
  +rm -rf repo2
  +cp -r repo repo2
  +git -C repo2 repack -a -d
  warning: reflog of 'HEAD' references pruned commits
  warning: reflog of 'refs/heads/master' references pruned commits
  fatal: unable to read 71905dfcd543b7cbb0b4b66fbd20379e67220557
  error: last command exited with $?=128
  not ok 23 - repack -d does not irreversibly delete promisor objects


  https://travis-ci.org/git/git/jobs/517874310#L5822


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3] rev-list: exclude promisor objects at walk time
  2019-04-09 18:04   ` SZEDER Gábor
@ 2019-04-09 23:42     ` Josh Steadmon
  2019-04-11  4:06       ` Jeff King
  0 siblings, 1 reply; 27+ messages in thread
From: Josh Steadmon @ 2019-04-09 23:42 UTC (permalink / raw)
  To: SZEDER Gábor; +Cc: git, peff, jonathantanmy, jrnieder

On 2019.04.09 20:04, SZEDER Gábor wrote:
> On Mon, Apr 08, 2019 at 02:06:04PM -0700, Josh Steadmon wrote:
> > For large repositories, enumerating the list of all promisor objects (in
> > order to exclude them from a rev-list walk) can take a significant
> > amount of time).
> > 
> > When --exclude-promisor-objects is passed to rev-list, don't enumerate
> > the promisor objects. Instead, filter them (and any children objects)
> > during the actual graph walk.
> > 
> > Remove the mark_uninteresting() function as it's not used anywhere else.
> > 
> > When testing against a large repo [1], this patch reduces the
> > connectivity check runtime from 3 minutes to ~7 seconds.
> 
> This patch breaks test 'repack -d does not irreversibly delete
> promisor objects' in 't0410-partial-clone.sh' when run with
> GIT_TEST_COMMIT_GRAPH=1.
> 
>   +rm -rf repo
>   +test_create_repo repo
>   +test 1 = 1
>   +repo=repo
>   +mkdir -p repo
>   +cd repo
>   +/home/travis/build/git/git/t/../git init --template=/home/travis/build/git/git/t/../templates/blt/
>   Initialized empty Git repository in /home/travis/build/git/git/t/trash directory.t0410-partial-clone/repo/.git/
>   +mv .git/hooks .git/hooks-disabled
>   +git -C repo config core.repositoryformatversion 1
>   +git -C repo config extensions.partialclone arbitrary string
>   +git -C repo commit --allow-empty -m one
>   [master (root-commit) 71905df] one
>    Author: A U Thor <author@example.com>
>   +git -C repo commit --allow-empty -m two
>   [master 202c4a3] two
>    Author: A U Thor <author@example.com>
>   +git -C repo commit --allow-empty -m three
>   [master 4737577] three
>    Author: A U Thor <author@example.com>
>   +git -C repo commit --allow-empty -m four
>   [master d6ba7e0] four
>    Author: A U Thor <author@example.com>
>   +git -C repo rev-parse HEAD^^^
>   +ONE=71905dfcd543b7cbb0b4b66fbd20379e67220557
>   +git -C repo rev-parse HEAD^^
>   +TWO=202c4a3dd9a2dac927f056abb747cce9ea2eb67b
>   +git -C repo rev-parse HEAD^
>   +THREE=47375779ebcca4b422e3afdd14aa37a358081297
>   +pack_as_from_promisor
>   +printf 202c4a3dd9a2dac927f056abb747cce9ea2eb67b\n
>   +git -C repo pack-objects .git/objects/pack/pack
>   +HASH=2e675cd706e508d6c52a21d28cfcddde5ec02a06
>   +
>   +echo 2e675cd706e508d6c52a21d28cfcddde5ec02a06
>   2e675cd706e508d6c52a21d28cfcddde5ec02a06
>   +printf 47375779ebcca4b422e3afdd14aa37a358081297\n
>   +pack_as_from_promisor
>   +git -C repo pack-objects .git/objects/pack/pack
>   +HASH=31f7d2797549ab9b1c425a9e60eb2030481170e5
>   +
>   +echo 31f7d2797549ab9b1c425a9e60eb2030481170e5
>   31f7d2797549ab9b1c425a9e60eb2030481170e5
>   +delete_object repo 71905dfcd543b7cbb0b4b66fbd20379e67220557
>   +sed -e s|^..|&/|
>   +echo 71905dfcd543b7cbb0b4b66fbd20379e67220557
>   +rm repo/.git/objects/71/905dfcd543b7cbb0b4b66fbd20379e67220557
>   +repack_and_check -a 202c4a3dd9a2dac927f056abb747cce9ea2eb67b 47375779ebcca4b422e3afdd14aa37a358081297
>   +rm -rf repo2
>   +cp -r repo repo2
>   +git -C repo2 repack -a -d
>   warning: reflog of 'HEAD' references pruned commits
>   warning: reflog of 'refs/heads/master' references pruned commits
>   fatal: unable to read 71905dfcd543b7cbb0b4b66fbd20379e67220557
>   error: last command exited with $?=128
>   not ok 23 - repack -d does not irreversibly delete promisor objects
> 
> 
>   https://travis-ci.org/git/git/jobs/517874310#L5822
> 

Thank you for catching this. I haven't yet figured out the cause. I will
look into this more tomorrow and then send a V4 once I've fixed it.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3] rev-list: exclude promisor objects at walk time
  2019-04-09 23:42     ` Josh Steadmon
@ 2019-04-11  4:06       ` Jeff King
  2019-04-12 22:38         ` Josh Steadmon
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff King @ 2019-04-11  4:06 UTC (permalink / raw)
  To: Josh Steadmon, SZEDER Gábor, git, jonathantanmy, jrnieder

On Tue, Apr 09, 2019 at 04:42:55PM -0700, Josh Steadmon wrote:

> >   warning: reflog of 'HEAD' references pruned commits
> >   warning: reflog of 'refs/heads/master' references pruned commits
> >   fatal: unable to read 71905dfcd543b7cbb0b4b66fbd20379e67220557
> >   error: last command exited with $?=128
> >   not ok 23 - repack -d does not irreversibly delete promisor objects
> > 
> 
> Thank you for catching this. I haven't yet figured out the cause. I will
> look into this more tomorrow and then send a V4 once I've fixed it.

I'm concerned that this is a sign that the approach I suggested does not
actually work everywhere. I.e., could this be a case where we have some
non-promisor object that points to a sub-object that is reachable from
the promisor pack, but not a direct tip? Before your patch we'd consider
that sub-object a promisor (because we enumerate all of the graph that
we do have and mark each such object), but afterwards we would not.

And I wonder if that confuses pack-objects. Though I think it would
confuse it in the _opposite_ direction. I.e., using
--exclude-promisor-objects would count such an object as not-a-promisor
and would be more inclined to include it in the new pack.

It is curious that this only turns up with GIT_TEST_COMMIT_GRAPH=1, too.
It seems like any such problem ought to be independent of that.

Puzzling...

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3] rev-list: exclude promisor objects at walk time
  2019-04-11  4:06       ` Jeff King
@ 2019-04-12 22:38         ` Josh Steadmon
  2019-04-13  5:34           ` Jeff King
  0 siblings, 1 reply; 27+ messages in thread
From: Josh Steadmon @ 2019-04-12 22:38 UTC (permalink / raw)
  To: Jeff King; +Cc: SZEDER Gábor, git, jonathantanmy, jrnieder

On 2019.04.11 00:06, Jeff King wrote:
> On Tue, Apr 09, 2019 at 04:42:55PM -0700, Josh Steadmon wrote:
> 
> > >   warning: reflog of 'HEAD' references pruned commits
> > >   warning: reflog of 'refs/heads/master' references pruned commits
> > >   fatal: unable to read 71905dfcd543b7cbb0b4b66fbd20379e67220557
> > >   error: last command exited with $?=128
> > >   not ok 23 - repack -d does not irreversibly delete promisor objects
> > > 
> > 
> > Thank you for catching this. I haven't yet figured out the cause. I will
> > look into this more tomorrow and then send a V4 once I've fixed it.
> 
> I'm concerned that this is a sign that the approach I suggested does not
> actually work everywhere. I.e., could this be a case where we have some
> non-promisor object that points to a sub-object that is reachable from
> the promisor pack, but not a direct tip? Before your patch we'd consider
> that sub-object a promisor (because we enumerate all of the graph that
> we do have and mark each such object), but afterwards we would not.
> 
> And I wonder if that confuses pack-objects. Though I think it would
> confuse it in the _opposite_ direction. I.e., using
> --exclude-promisor-objects would count such an object as not-a-promisor
> and would be more inclined to include it in the new pack.
> 
> It is curious that this only turns up with GIT_TEST_COMMIT_GRAPH=1, too.
> It seems like any such problem ought to be independent of that.
> 
> Puzzling...
> 
> -Peff

Do you think this justifies going back to the V1 approach (only checking
presence of objects pointed to by refs when doing a partial clone)?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3] rev-list: exclude promisor objects at walk time
  2019-04-12 22:38         ` Josh Steadmon
@ 2019-04-13  5:34           ` Jeff King
  2019-04-19 20:26             ` Josh Steadmon
  0 siblings, 1 reply; 27+ messages in thread
From: Jeff King @ 2019-04-13  5:34 UTC (permalink / raw)
  To: Josh Steadmon, SZEDER Gábor, git, jonathantanmy, jrnieder

On Fri, Apr 12, 2019 at 03:38:47PM -0700, Josh Steadmon wrote:

> > > Thank you for catching this. I haven't yet figured out the cause. I will
> > > look into this more tomorrow and then send a V4 once I've fixed it.
> > 
> > I'm concerned that this is a sign that the approach I suggested does not
> > actually work everywhere. I.e., could this be a case where we have some
> > non-promisor object that points to a sub-object that is reachable from
> > the promisor pack, but not a direct tip? Before your patch we'd consider
> > that sub-object a promisor (because we enumerate all of the graph that
> > we do have and mark each such object), but afterwards we would not.
> > 
> > And I wonder if that confuses pack-objects. Though I think it would
> > confuse it in the _opposite_ direction. I.e., using
> > --exclude-promisor-objects would count such an object as not-a-promisor
> > and would be more inclined to include it in the new pack.
> > 
> > It is curious that this only turns up with GIT_TEST_COMMIT_GRAPH=1, too.
> > It seems like any such problem ought to be independent of that.
> > 
> > Puzzling...
> 
> Do you think this justifies going back to the V1 approach (only checking
> presence of objects pointed to by refs when doing a partial clone)?

Yes, I think it might. Especially coupled with your other report that
the V1 approach is 500ms compared to several seconds for this one. Which
I'd guess is probably because we actually parse the ref tip objects in
rev-list, whereas your V1 just skipped that step entirely (which is
perfectly fine for a clone, as we'd have just hashed the objects via
index-pack anyway).

It might be interesting to know if the problem is indeed insurmountable
with the V3 approach here, or if it's simply another bug. But diving
into it is going to be rather tricky, and I am not volunteering to do
it. :) So if you want to punt and go back to the more clearly correct V1
approach, I can live with that. We can always revisit this approach
later (it wouldn't be necessary for the clone case after your V1, but in
theory it could be helping other cases, too).

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3] rev-list: exclude promisor objects at walk time
  2019-04-13  5:34           ` Jeff King
@ 2019-04-19 20:26             ` Josh Steadmon
  0 siblings, 0 replies; 27+ messages in thread
From: Josh Steadmon @ 2019-04-19 20:26 UTC (permalink / raw)
  To: Jeff King; +Cc: SZEDER Gábor, git, jonathantanmy, jrnieder

On 2019.04.13 01:34, Jeff King wrote:
> On Fri, Apr 12, 2019 at 03:38:47PM -0700, Josh Steadmon wrote:
> 
> > > > Thank you for catching this. I haven't yet figured out the cause. I will
> > > > look into this more tomorrow and then send a V4 once I've fixed it.
> > > 
> > > I'm concerned that this is a sign that the approach I suggested does not
> > > actually work everywhere. I.e., could this be a case where we have some
> > > non-promisor object that points to a sub-object that is reachable from
> > > the promisor pack, but not a direct tip? Before your patch we'd consider
> > > that sub-object a promisor (because we enumerate all of the graph that
> > > we do have and mark each such object), but afterwards we would not.
> > > 
> > > And I wonder if that confuses pack-objects. Though I think it would
> > > confuse it in the _opposite_ direction. I.e., using
> > > --exclude-promisor-objects would count such an object as not-a-promisor
> > > and would be more inclined to include it in the new pack.
> > > 
> > > It is curious that this only turns up with GIT_TEST_COMMIT_GRAPH=1, too.
> > > It seems like any such problem ought to be independent of that.
> > > 
> > > Puzzling...
> > 
> > Do you think this justifies going back to the V1 approach (only checking
> > presence of objects pointed to by refs when doing a partial clone)?
> 
> Yes, I think it might. Especially coupled with your other report that
> the V1 approach is 500ms compared to several seconds for this one. Which
> I'd guess is probably because we actually parse the ref tip objects in
> rev-list, whereas your V1 just skipped that step entirely (which is
> perfectly fine for a clone, as we'd have just hashed the objects via
> index-pack anyway).
> 
> It might be interesting to know if the problem is indeed insurmountable
> with the V3 approach here, or if it's simply another bug. But diving
> into it is going to be rather tricky, and I am not volunteering to do
> it. :) So if you want to punt and go back to the more clearly correct V1
> approach, I can live with that. We can always revisit this approach
> later (it wouldn't be necessary for the clone case after your V1, but in
> theory it could be helping other cases, too).
> 
> -Peff

I have not made any progress in figuring out the repack + commit-graph
failure, so I will resend V1.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v4] clone: do faster object check for partial clones
  2019-04-03 17:27 [PATCH] clone: do faster object check for partial clones Josh Steadmon
                   ` (3 preceding siblings ...)
  2019-04-08 21:06 ` [PATCH v3] " Josh Steadmon
@ 2019-04-19 21:00 ` Josh Steadmon
  2019-04-22 21:31   ` Jeff King
  4 siblings, 1 reply; 27+ messages in thread
From: Josh Steadmon @ 2019-04-19 21:00 UTC (permalink / raw)
  To: git; +Cc: jonathantanmy, peff

For partial clones, doing a full connectivity check is wasteful; we skip
promisor objects (which, for a partial clone, is all known objects), and
enumerating them all to exclude them from the connectivity check can
take a significant amount of time on large repos.

At most, we want to make sure that we get the objects referred to by any
wanted refs. For partial clones, just check that these objects were
transferred.

Signed-off-by: Josh Steadmon <steadmon@google.com>
---
This is an update of the original V1 approach (skipping the fully
connectivity check when doing a partial clone) with updated commit
message and comments to address the review concerns.


Range-diff against v1:
1:  9c29e1ce8d ! 1:  b4a285e2a1 clone: do faster object check for partial clones
    @@ -4,8 +4,8 @@
     
         For partial clones, doing a full connectivity check is wasteful; we skip
         promisor objects (which, for a partial clone, is all known objects), and
    -    excluding them all from the connectivity check can take a significant
    -    amount of time on large repos.
    +    enumerating them all to exclude them from the connectivity check can
    +    take a significant amount of time on large repos.
     
         At most, we want to make sure that we get the objects referred to by any
         wanted refs. For partial clones, just check that these objects were
    @@ -59,10 +59,12 @@
      
     +	if (opt->check_refs_only) {
     +		/*
    -+		 * For partial clones, we don't want to walk the full commit
    -+		 * graph because we're skipping promisor objects anyway. We
    -+		 * should just check that objects referenced by wanted refs were
    -+		 * transferred.
    ++		 * For partial clones, we don't want to have to do a regular
    ++		 * connectivity check because we have to enumerate and exclude
    ++		 * all promisor objects (slow), and then the connectivity check
    ++		 * itself becomes a no-op because in a partial clone every
    ++		 * object is a promisor object. Instead, just make sure we
    ++		 * received the objects pointed to by each wanted ref.
     +		 */
     +		do {
     +			if (!repo_has_object_file(the_repository, &oid))
    @@ -86,8 +88,8 @@
     +	/*
     +	 * If non-zero, only check the top-level objects referenced by the
     +	 * wanted refs (passed in as cb_data). This is useful for partial
    -+	 * clones, where this can be much faster than excluding all promisor
    -+	 * objects prior to walking the commit graph.
    ++	 * clones, where enumerating and excluding all promisor objects is very
    ++	 * slow and the commit-walk itself becomes a no-op.
     +	 */
     +	unsigned check_refs_only : 1;
      };

 builtin/clone.c |  6 ++++--
 connected.c     | 17 +++++++++++++++++
 connected.h     |  8 ++++++++
 3 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index 50bde99618..fdbbd8942a 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -657,7 +657,8 @@ static void update_remote_refs(const struct ref *refs,
 			       const char *branch_top,
 			       const char *msg,
 			       struct transport *transport,
-			       int check_connectivity)
+			       int check_connectivity,
+			       int check_refs_only)
 {
 	const struct ref *rm = mapped_refs;
 
@@ -666,6 +667,7 @@ static void update_remote_refs(const struct ref *refs,
 
 		opt.transport = transport;
 		opt.progress = transport->progress;
+		opt.check_refs_only = !!check_refs_only;
 
 		if (check_connected(iterate_ref_map, &rm, &opt))
 			die(_("remote did not send all necessary objects"));
@@ -1224,7 +1226,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 
 	update_remote_refs(refs, mapped_refs, remote_head_points_at,
 			   branch_top.buf, reflog_msg.buf, transport,
-			   !is_local);
+			   !is_local, filter_options.choice);
 
 	update_head(our_head_points_at, remote_head, reflog_msg.buf);
 
diff --git a/connected.c b/connected.c
index 1bba888eff..1ab481fed6 100644
--- a/connected.c
+++ b/connected.c
@@ -1,4 +1,5 @@
 #include "cache.h"
+#include "object-store.h"
 #include "run-command.h"
 #include "sigchain.h"
 #include "connected.h"
@@ -49,6 +50,22 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 		strbuf_release(&idx_file);
 	}
 
+	if (opt->check_refs_only) {
+		/*
+		 * For partial clones, we don't want to have to do a regular
+		 * connectivity check because we have to enumerate and exclude
+		 * all promisor objects (slow), and then the connectivity check
+		 * itself becomes a no-op because in a partial clone every
+		 * object is a promisor object. Instead, just make sure we
+		 * received the objects pointed to by each wanted ref.
+		 */
+		do {
+			if (!repo_has_object_file(the_repository, &oid))
+				return 1;
+		} while (!fn(cb_data, &oid));
+		return 0;
+	}
+
 	if (opt->shallow_file) {
 		argv_array_push(&rev_list.args, "--shallow-file");
 		argv_array_push(&rev_list.args, opt->shallow_file);
diff --git a/connected.h b/connected.h
index 8d5a6b3ad6..ce2e7d8f2e 100644
--- a/connected.h
+++ b/connected.h
@@ -46,6 +46,14 @@ struct check_connected_options {
 	 * during a fetch.
 	 */
 	unsigned is_deepening_fetch : 1;
+
+	/*
+	 * If non-zero, only check the top-level objects referenced by the
+	 * wanted refs (passed in as cb_data). This is useful for partial
+	 * clones, where enumerating and excluding all promisor objects is very
+	 * slow and the commit-walk itself becomes a no-op.
+	 */
+	unsigned check_refs_only : 1;
 };
 
 #define CHECK_CONNECTED_INIT { 0 }
-- 
2.21.0.593.g511ec345e18-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v4] clone: do faster object check for partial clones
  2019-04-19 21:00 ` [PATCH v4] clone: do faster object check for partial clones Josh Steadmon
@ 2019-04-22 21:31   ` Jeff King
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff King @ 2019-04-22 21:31 UTC (permalink / raw)
  To: Josh Steadmon; +Cc: git, jonathantanmy

On Fri, Apr 19, 2019 at 02:00:13PM -0700, Josh Steadmon wrote:

> For partial clones, doing a full connectivity check is wasteful; we skip
> promisor objects (which, for a partial clone, is all known objects), and
> enumerating them all to exclude them from the connectivity check can
> take a significant amount of time on large repos.
> 
> At most, we want to make sure that we get the objects referred to by any
> wanted refs. For partial clones, just check that these objects were
> transferred.
> 
> Signed-off-by: Josh Steadmon <steadmon@google.com>
> ---
> This is an update of the original V1 approach (skipping the fully
> connectivity check when doing a partial clone) with updated commit
> message and comments to address the review concerns.

This looks OK to me. Just trying to think of ways it could bite us, the
obvious line of thinking is where "--reference" is used. If we tell the
other side we already have object X, it will not be sent to us, and we
are relying on our local non-promisor alternate to have all of the
required objects. But I think this is OK:

  - for it to be mentioned in a ref, then the server must have been
    advertising it. Which means that it should similarly be on the hook
    for providing it to us via the promisor mechanism. That's a little
    hand-wavy, but then so is the entire promisor concept. We're
    inherently at the server's mercy, so if they're lying to us about
    what they're willing or able to provide, there's not much we can do
    anyway.

  - if we sent it as a "have" to the server, that means that our
    alternate was advertising it from a ref tip. Which means that unless
    it's corrupted, we're fine (which is no different than the
    connectivity promise we'd make for our own refs). I actually think
    that the connectivity check should "--not" any advertised alternate
    tips. I even have a patch to do that, but I need to polish it a
    little.

So I think this optimization will yield correct results. If we later
figure out a better way for rev-list to optimize this case itself, then
we can rip this out.

I think you had dug up some numbers to put in the commit message after
the last discussion? Likewise, is there a t/perf test which shows this
off (and will help us catch regressions)? If not, it might be worth
adding one (AFAICT a simple no-blobs partial-clone would be enough).

-Peff

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2019-04-22 21:31 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-03 17:27 [PATCH] clone: do faster object check for partial clones Josh Steadmon
2019-04-03 18:58 ` Jonathan Tan
2019-04-03 19:41 ` Jeff King
2019-04-03 20:57   ` Jonathan Tan
2019-04-04  0:21     ` Josh Steadmon
2019-04-04  1:33     ` Jeff King
2019-04-04 22:53 ` [PATCH v2] rev-list: exclude promisor objects at walk time Josh Steadmon
2019-04-04 23:08   ` Jeff King
2019-04-04 23:47     ` Josh Steadmon
2019-04-05  0:00       ` Jeff King
2019-04-05  0:09         ` Josh Steadmon
2019-04-08 20:59           ` Josh Steadmon
2019-04-08 21:06 ` [PATCH v3] " Josh Steadmon
2019-04-08 22:23   ` Christian Couder
2019-04-08 23:12     ` Josh Steadmon
2019-04-09 15:14   ` Junio C Hamano
2019-04-09 15:15     ` Jeff King
2019-04-09 15:43       ` Junio C Hamano
2019-04-09 16:35         ` Josh Steadmon
2019-04-09 18:04   ` SZEDER Gábor
2019-04-09 23:42     ` Josh Steadmon
2019-04-11  4:06       ` Jeff King
2019-04-12 22:38         ` Josh Steadmon
2019-04-13  5:34           ` Jeff King
2019-04-19 20:26             ` Josh Steadmon
2019-04-19 21:00 ` [PATCH v4] clone: do faster object check for partial clones Josh Steadmon
2019-04-22 21:31   ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).