git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Josh Steadmon <steadmon@google.com>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: peff@peff.net, git@vger.kernel.org
Subject: Re: [PATCH] clone: do faster object check for partial clones
Date: Wed, 3 Apr 2019 17:21:01 -0700	[thread overview]
Message-ID: <20190404002101.GF60888@google.com> (raw)
In-Reply-To: <20190403205748.107979-1-jonathantanmy@google.com>

On 2019.04.03 13:57, Jonathan Tan wrote:
> > On Wed, Apr 03, 2019 at 10:27:21AM -0700, Josh Steadmon wrote:
> > 
> > > For partial clones, doing a full connectivity check is wasteful; we skip
> > > promisor objects (which, for a partial clone, is all known objects), and
> > > excluding them all from the connectivity check can take a significant
> > > amount of time on large repos.
> > > 
> > > At most, we want to make sure that we get the objects referred to by any
> > > wanted refs. For partial clones, just check that these objects were
> > > transferred.
> > 
> > This isn't strictly true, since we could get objects from elsewhere via
> > --shared or --reference. Those might not be promisor objects.
> 
> I don't think local clones (which --shared or --reference implies) can
> be partial, but the bigger point is below.
> 
> > Shouldn't we be able to stop a traversal as soon as we see that an
> > object is in a promisor pack?
> > 
> > I.e., here:
> > 
> > > +	if (opt->check_refs_only) {
> > > +		/*
> > > +		 * For partial clones, we don't want to walk the full commit
> > > +		 * graph because we're skipping promisor objects anyway. We
> > > +		 * should just check that objects referenced by wanted refs were
> > > +		 * transferred.
> > > +		 */
> > > +		do {
> > > +			if (!repo_has_object_file(the_repository, &oid))
> > > +				return 1;
> > > +		} while (!fn(cb_data, &oid));
> > > +		return 0;
> > > +	}
> > 
> > for each object where you ask "do we have this?" we could, for the same
> > cost, ask "do we have this in a promisor pack?". And the answer would be
> > yes for each in a partial clone.
> > 
> > But that would also cleanly handle --shared, etc, that I mentioned. And
> > more importantly, it would _also_ work on fetches. If I fetch from you
> > and get a promisor pack, then there is no point in me enumerating every
> > tree you sent. As long as you gave me a tip tree, then you have promised
> > that you'd give me all the others if I ask.
> > 
> > So it seems like this should be a feature of the child rev-list, to stop
> > walking the graph at any object that is in a promisor pack.
> 
> We currently already do a less optimal version of this - we pass
> --exclude-promisor-objects to rev-list, which indeed stops traversal at
> any promisor objects (whether in a promisor pack or referenced by one).
> As far as I know, the problem is that to do so, we currently enumerate
> all the objects in all promisor packs, and all objects that those
> objects reference (which means we inflate them too) - so that we have an
> oidset that we can check objects against.
> 
> A partial solution is for is_promisor_object() to first check if the
> given object is in a promisor pack, avoiding generating the set of
> promisor objects until necessary. This would work in a blob:none clone
> with the refs pointing all to commits or all to blobs, but would not
> work in a tree:none clone (or maybe, in this case, the clone would be
> small enough that performance is not a concern, hmm).
> 
> Maybe the ideal solution is for rev-list to check if an object is in a
> promisor pack and if --exclude-promisor-objects is active, we do not
> follow any outgoing links.

I am not sure that this actually saves us any work vs. the original
method of marking promisor objects as uninteresting. Marking the objects
uninteresting involves calling for_each_packed_object(some_callback_fn, ...,
FOR_EACH_OBJECT_PROMISOR_ONLY). But this is also exactly the setup that
is_promisor_object() runs the first time it's called. In fact, it looks
to me like the callback function used by the latter is more expensive
than the former. Is there some alternative to is_promisor_object() that
doesn't involve this object enumeration?

I've tried implementing this approach anyway (diff against master is
below), but I can't get a version that passes t0410-partial-clone.sh
(cases 17 and 19). In case 17, the rev-list emits one extra line
corresponding to the blob for baz.t. I'm not sure why the original
implementation thinks that baz.t is a promisor object but my WIP version
doesn't.

I'll keep poking at this tomorrow, I just thought I should comment on
what I've found so far in case anyone can spot where I'm going wrong.



diff --git a/list-objects.c b/list-objects.c
index dc77361e11..1cb85f1662 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -326,6 +326,12 @@ static void traverse_trees_and_blobs(struct traversal_context *ctx,
 		struct object *obj = pending->item;
 		const char *name = pending->name;
 		const char *path = pending->path;
+		trace_printf("DEBUG: Examining object: %s\n", oid_to_hex(&obj->oid));
+		if (ctx->revs->exclude_promisor_objects &&
+		    is_promisor_object(&obj->oid)) {
+			trace_printf("DEBUG: Skipping object: %s\n", oid_to_hex(&obj->oid));
+			continue;
+		}
 		if (obj->flags & (UNINTERESTING | SEEN))
 			continue;
 		if (obj->type == OBJ_TAG) {
@@ -356,6 +362,13 @@ static void do_traverse(struct traversal_context *ctx)
 	strbuf_init(&csp, PATH_MAX);
 
 	while ((commit = get_revision(ctx->revs)) != NULL) {
+		trace_printf("DEBUG: Examining commit: %s\n", oid_to_hex(&commit->object.oid));
+		if (ctx->revs->exclude_promisor_objects &&
+		    is_promisor_object(&commit->object.oid)) {
+			trace_printf("DEBUG: Skipping commit: %s\n", oid_to_hex(&commit->object.oid));
+			continue;
+		}
+
 		/*
 		 * an uninteresting boundary commit may not have its tree
 		 * parsed yet, but we are not going to show them anyway
diff --git a/revision.c b/revision.c
index eb8e51bc63..85974e941d 100644
--- a/revision.c
+++ b/revision.c
@@ -3067,17 +3067,6 @@ void reset_revision_walk(void)
 	clear_object_flags(SEEN | ADDED | SHOWN);
 }
 
-static int mark_uninteresting(const struct object_id *oid,
-			      struct packed_git *pack,
-			      uint32_t pos,
-			      void *cb)
-{
-	struct rev_info *revs = cb;
-	struct object *o = parse_object(revs->repo, oid);
-	o->flags |= UNINTERESTING | SEEN;
-	return 0;
-}
-
 define_commit_slab(indegree_slab, int);
 define_commit_slab(author_date_slab, timestamp_t);
 
@@ -3316,11 +3305,6 @@ int prepare_revision_walk(struct rev_info *revs)
 	    (revs->limited && limiting_can_increase_treesame(revs)))
 		revs->treesame.name = "treesame";
 
-	if (revs->exclude_promisor_objects) {
-		for_each_packed_object(mark_uninteresting, revs,
-				       FOR_EACH_OBJECT_PROMISOR_ONLY);
-	}
-
 	if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
 		commit_list_sort_by_date(&revs->commits);
 	if (revs->no_walk)
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index bce02788e6..55ac29f650 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -342,7 +342,7 @@ test_expect_success 'rev-list stops traversal at promisor commit, tree, and blob
 
 	git -C repo config core.repositoryformatversion 1 &&
 	git -C repo config extensions.partialclone "arbitrary string" &&
-	git -C repo rev-list --exclude-promisor-objects --objects HEAD >out &&
+	GIT_TRACE="$(pwd)/rev_trace" git -C repo rev-list --exclude-promisor-objects --objects HEAD >out &&
 	! grep $COMMIT out &&
 	! grep $TREE out &&
 	! grep $BLOB out &&

  reply	other threads:[~2019-04-04  0:21 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-03 17:27 [PATCH] clone: do faster object check for partial clones Josh Steadmon
2019-04-03 18:58 ` Jonathan Tan
2019-04-03 19:41 ` Jeff King
2019-04-03 20:57   ` Jonathan Tan
2019-04-04  0:21     ` Josh Steadmon [this message]
2019-04-04  1:33     ` Jeff King
2019-04-04 22:53 ` [PATCH v2] rev-list: exclude promisor objects at walk time Josh Steadmon
2019-04-04 23:08   ` Jeff King
2019-04-04 23:47     ` Josh Steadmon
2019-04-05  0:00       ` Jeff King
2019-04-05  0:09         ` Josh Steadmon
2019-04-08 20:59           ` Josh Steadmon
2019-04-08 21:06 ` [PATCH v3] " Josh Steadmon
2019-04-08 22:23   ` Christian Couder
2019-04-08 23:12     ` Josh Steadmon
2019-04-09 15:14   ` Junio C Hamano
2019-04-09 15:15     ` Jeff King
2019-04-09 15:43       ` Junio C Hamano
2019-04-09 16:35         ` Josh Steadmon
2019-04-09 18:04   ` SZEDER Gábor
2019-04-09 23:42     ` Josh Steadmon
2019-04-11  4:06       ` Jeff King
2019-04-12 22:38         ` Josh Steadmon
2019-04-13  5:34           ` Jeff King
2019-04-19 20:26             ` Josh Steadmon
2019-04-19 21:00 ` [PATCH v4] clone: do faster object check for partial clones Josh Steadmon
2019-04-22 21:31   ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190404002101.GF60888@google.com \
    --to=steadmon@google.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --cc=peff@peff.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).