git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [BUG?] fetch into shallow sends a large number of objects
@ 2016-03-07 22:15 Jeff King
  2016-03-07 23:47 ` Junio C Hamano
  0 siblings, 1 reply; 16+ messages in thread
From: Jeff King @ 2016-03-07 22:15 UTC (permalink / raw)
  To: git; +Cc: Nguyễn Thái Ngọc Duy

I came across an interesting shallow-fetch case, and I suspect git is
doing something very sub-optimal.

You can reproduce with:

	git clone --bare git://github.com/CocoaPods/Specs
	cd Specs.git

	time git pack-objects --revs --thin --stdout \
	  --delta-base-offset --include-tag <<\EOF | wc -c
	d7a6d9295d718c6015be496880f1a293bdd89185
	--not
	067f265bb512c95b22b83ccd121b9facbddcf6b1
	EOF

	time git pack-objects --revs --thin --stdout --shallow \
	  --delta-base-offset --include-tag <<\EOF | wc -c
	--shallow ecd7ea6033fe8a05d5c21f3a54355fded6942659
	d7a6d9295d718c6015be496880f1a293bdd89185
	--not
	067f265bb512c95b22b83ccd121b9facbddcf6b1
	EOF

The first is a non-shallow clone; it takes a few hundred milliseconds to
generate the pack, and the result is a few hundred kilobytes. The second
is a shallow clone (logged from a real-world request); it sends 200
times as many objects, totaling 270MB, and takes almost a minute of CPU.

I'm trying to figure out why that is, and whether git can do better.

I think what's happening here is that the history looks like this:

  F ... ecd7ea6 ... 067f265 ... M ... d7a6d929
   \                           /
    X ....................... Y

That is, we are asking for 067f265 to d7a6d929, but that includes some
merge M which _crosses_ our grafted shallow-point ecd7ea6. That pulls in
essentially all of the history for the entire (missing only the commits
from the fork point F up to ecd7ea6).

So I _think_ that pack-objects is doing the best it can with the
information it was given. But presumably in a shallow repo the user
would prefer to have a segmented history rather than pull in all of
those old commits.

I don't know how the client invoked git, but we can guess what happened
and simulate with:

  git tag shallow ecd7ea6033fe8a05d5c21f3a54355fded6942659
  git tag old 067f265bb512c95b22b83ccd121b9facbddcf6b1
  git tag new d7a6d9295d718c6015be496880f1a293bdd89185

  git clone --no-local --bare --branch=shallow --depth=1 . clone.git
  cd clone.git
  git fetch origin old:refs/tags/old
  git fetch origin new:refs/tags/new

Of the two follow-up fetches in the clone, the first is reasonably fast
(it just grabs a few new commits on top of the shallow base), but the
second is expensive (it grabs the merge which pulls in the whole
history). If we add "--depth=1" to each of those fetches, everything
remains fast.

Is this user error to call "git fetch" without "--depth" in the
subsequent cases? Or should git remember that we are in a shallow repo,
and presume that the user by default wants to keep things shallow?

-Peff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG?] fetch into shallow sends a large number of objects
  2016-03-07 22:15 [BUG?] fetch into shallow sends a large number of objects Jeff King
@ 2016-03-07 23:47 ` Junio C Hamano
  2016-03-08  0:53   ` Duy Nguyen
  2016-03-08 12:14   ` Jeff King
  0 siblings, 2 replies; 16+ messages in thread
From: Junio C Hamano @ 2016-03-07 23:47 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Nguyễn Thái Ngọc Duy

Jeff King <peff@peff.net> writes:

> I don't know how the client invoked git, but we can guess what happened
> and simulate with:
>
>   git tag shallow ecd7ea6033fe8a05d5c21f3a54355fded6942659
>   git tag old 067f265bb512c95b22b83ccd121b9facbddcf6b1
>   git tag new d7a6d9295d718c6015be496880f1a293bdd89185
>
>   git clone --no-local --bare --branch=shallow --depth=1 . clone.git
>   cd clone.git
>   git fetch origin old:refs/tags/old
>   git fetch origin new:refs/tags/new
>
> Of the two follow-up fetches in the clone, the first is reasonably fast
> (it just grabs a few new commits on top of the shallow base), but the
> second is expensive (it grabs the merge which pulls in the whole
> history). If we add "--depth=1" to each of those fetches, everything
> remains fast.
>
> Is this user error to call "git fetch" without "--depth" in the
> subsequent cases? Or should git remember that we are in a shallow repo,
> and presume that the user by default wants to keep things shallow?

Hmph, you shouldn't, and I somehow thought that you do not, have to
explicitly say things like "--deepen" to break the original
shallowness, but your example illustrates that the logic to do so is
not well thought out.  A new side branch will prevent you from
hitting an already-known shallow cut-off and traverse down to the
root.

Giving a random "depth" in subsequent fetch would however not work
very well, I suspect, as that is very prone to make the part of the
history the user originally obtained, and presumably used to build
her own history, into an island that is unconnected to the updated
tip of the history.  

I also do not offhand think of a good way to use the topology or
timestamp to figure out the best "depth" to truncate the side branch
at.  The server side may be able to figure out that things before 'F'
in your picture is not relevant for a client that has the shallow
cut-off at 067f265, but the side branch can be forked arbitrarily
long in the past, or it may not even share the ancient part of the
history and has its own root commit.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG?] fetch into shallow sends a large number of objects
  2016-03-07 23:47 ` Junio C Hamano
@ 2016-03-08  0:53   ` Duy Nguyen
  2016-03-08 12:21     ` Jeff King
  2016-03-08 12:14   ` Jeff King
  1 sibling, 1 reply; 16+ messages in thread
From: Duy Nguyen @ 2016-03-08  0:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Git Mailing List

On Tue, Mar 8, 2016 at 6:47 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Jeff King <peff@peff.net> writes:
>
>> I don't know how the client invoked git, but we can guess what happened
>> and simulate with:
>>
>>   git tag shallow ecd7ea6033fe8a05d5c21f3a54355fded6942659
>>   git tag old 067f265bb512c95b22b83ccd121b9facbddcf6b1
>>   git tag new d7a6d9295d718c6015be496880f1a293bdd89185
>>
>>   git clone --no-local --bare --branch=shallow --depth=1 . clone.git
>>   cd clone.git
>>   git fetch origin old:refs/tags/old
>>   git fetch origin new:refs/tags/new
>>
>> Of the two follow-up fetches in the clone, the first is reasonably fast
>> (it just grabs a few new commits on top of the shallow base), but the
>> second is expensive (it grabs the merge which pulls in the whole
>> history). If we add "--depth=1" to each of those fetches, everything
>> remains fast.
>>
>> Is this user error to call "git fetch" without "--depth" in the
>> subsequent cases? Or should git remember that we are in a shallow repo,
>> and presume that the user by default wants to keep things shallow?
>
> Hmph, you shouldn't, and I somehow thought that you do not, have to
> explicitly say things like "--deepen" to break the original
> shallowness, but your example illustrates that the logic to do so is
> not well thought out.  A new side branch will prevent you from
> hitting an already-known shallow cut-off and traverse down to the
> root.

Yep. It "works" by design.

> Giving a random "depth" in subsequent fetch would however not work
> very well, I suspect, as that is very prone to make the part of the
> history the user originally obtained, and presumably used to build
> her own history, into an island that is unconnected to the updated
> tip of the history.

The new --deepen, --shallow-since and --shallow-exclude should be
better in this aspect and we can send them all the time without
affecting original cut points. Well, deepen can't be used here because
it needs shallow cut points as anchor in the first place.

> I also do not offhand think of a good way to use the topology or
> timestamp to figure out the best "depth" to truncate the side branch
> at.  The server side may be able to figure out that things before 'F'
> in your picture is not relevant for a client that has the shallow
> cut-off at 067f265, but the side branch can be forked arbitrarily
> long in the past, or it may not even share the ancient part of the
> history and has its own root commit.

If a shallow point can reach root without seeing another shallow
point, we can mark all reachable commits from it shallow. If it sees
another shallow point, maybe we can mark at the merge point of them..
We can also send "here is --depth=10, but only apply it on new refs".
That should mitigate the problem a bit. But I'm not sure if I can
solve it completely.
-- 
Duy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG?] fetch into shallow sends a large number of objects
  2016-03-07 23:47 ` Junio C Hamano
  2016-03-08  0:53   ` Duy Nguyen
@ 2016-03-08 12:14   ` Jeff King
  2016-03-08 12:33     ` Duy Nguyen
  1 sibling, 1 reply; 16+ messages in thread
From: Jeff King @ 2016-03-08 12:14 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Nguyễn Thái Ngọc Duy

On Mon, Mar 07, 2016 at 03:47:53PM -0800, Junio C Hamano wrote:

> > Is this user error to call "git fetch" without "--depth" in the
> > subsequent cases? Or should git remember that we are in a shallow repo,
> > and presume that the user by default wants to keep things shallow?
> 
> Hmph, you shouldn't, and I somehow thought that you do not, have to
> explicitly say things like "--deepen" to break the original
> shallowness, but your example illustrates that the logic to do so is
> not well thought out.  A new side branch will prevent you from
> hitting an already-known shallow cut-off and traverse down to the
> root.
> 
> Giving a random "depth" in subsequent fetch would however not work
> very well, I suspect, as that is very prone to make the part of the
> history the user originally obtained, and presumably used to build
> her own history, into an island that is unconnected to the updated
> tip of the history.  

Yeah, if we just recorded the original "--depth=1", each fetch would
make its own little island. Eventually your "shallow" file would bloat.
I'm not sure what that would do for the performance of subsequent
operations.

> I also do not offhand think of a good way to use the topology or
> timestamp to figure out the best "depth" to truncate the side branch
> at.  The server side may be able to figure out that things before 'F'
> in your picture is not relevant for a client that has the shallow
> cut-off at 067f265, but the side branch can be forked arbitrarily
> long in the past, or it may not even share the ancient part of the
> history and has its own root commit.

Right. I think conceptually that we want a graph cut, across all
branches. But you can't compute that cut at the time of the original
shallow clone, as we might not even have the side branches yet. So it
has to be computed at the time of the subsequent fetch.

You could make an arbitrary cut across all branches based on the
timestamp of the --shallow bottom. That would create new shallow bottoms
on any parallel lines of development that are brought in, from "around
the same time". But I'm uncomfortable relying on timestamps, and I
suspect there are some pretty bad corner cases with skew.

What if we instead marked the parent of a shallow commit UNINTERESTING?
Something like:

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index abed871..fc7a469 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2780,6 +2780,9 @@ static void get_object_list(int ac, const char **av)
 				unsigned char sha1[20];
 				if (get_sha1_hex(line + 10, sha1))
 					die("not an SHA-1 '%s'", line + 10);
+				handle_revision_arg(xstrfmt("^%s^", line + 10),
+						    &revs, flags,
+						    REVARG_CANNOT_BE_FILENAME);
 				register_shallow(sha1);
 				use_bitmap_index = 0;
 				continue;

In this case it means we pull in all of that side branch, down to the
fork point. If it went back to ancient history (or even had its own root
commit), then so be it; that's what the user asked for.  But I think
this would cover the common case of people shallow-cloning and fetching
a normal workflow.

There are two problems, though:

  1. We use UNINTERESTING for two things in pack-objects: to avoid
     sending the commits, but also to assume that the user has the
     objects, for making thin-pack deltas.

     We want only the former, but not the latter (if you apply my patch
     and try a fetch, you'll see that the client is missing bases from
     the resulting pack).

  2. We've effectively re-shallowed the client at the fork point. We
     have to tell them that somehow, or they're going to complain that
     we didn't send them all of the objects.

So I think the solution to both is that we need to do a _separate_
traversal with all of the positive tips we're going to send, and the
parents of any shallow commits the client has, to find their fork points
(i.e., merge bases). And then we add those fork points to the shallow
list (grafting out their parents), and communicate them to the client to
add to its shallow setup.

-Peff

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [BUG?] fetch into shallow sends a large number of objects
  2016-03-08  0:53   ` Duy Nguyen
@ 2016-03-08 12:21     ` Jeff King
  0 siblings, 0 replies; 16+ messages in thread
From: Jeff King @ 2016-03-08 12:21 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Junio C Hamano, Git Mailing List

On Tue, Mar 08, 2016 at 07:53:35AM +0700, Duy Nguyen wrote:

> > I also do not offhand think of a good way to use the topology or
> > timestamp to figure out the best "depth" to truncate the side branch
> > at.  The server side may be able to figure out that things before 'F'
> > in your picture is not relevant for a client that has the shallow
> > cut-off at 067f265, but the side branch can be forked arbitrarily
> > long in the past, or it may not even share the ancient part of the
> > history and has its own root commit.
> 
> If a shallow point can reach root without seeing another shallow
> point, we can mark all reachable commits from it shallow. If it sees
> another shallow point, maybe we can mark at the merge point of them..

Hmph, I read your email before sending my other response, but somehow I
didn't quite understand what you were saying. Now after having written
my long-winded other one, I think I just re-invented the same thing you
are proposing here. ;)

> We can also send "here is --depth=10, but only apply it on new refs".
> That should mitigate the problem a bit. But I'm not sure if I can
> solve it completely.

I think "new refs" isn't something we can rely on. For example, in this
case the old history may have been merged in and the ref deleted before
the fetcher shows up.

-Peff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG?] fetch into shallow sends a large number of objects
  2016-03-08 12:14   ` Jeff King
@ 2016-03-08 12:33     ` Duy Nguyen
  2016-03-08 13:25       ` Jeff King
  0 siblings, 1 reply; 16+ messages in thread
From: Duy Nguyen @ 2016-03-08 12:33 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Git Mailing List

On Tue, Mar 8, 2016 at 7:14 PM, Jeff King <peff@peff.net> wrote:
> ...
>
> So I think the solution to both is that we need to do a _separate_
> traversal with all of the positive tips we're going to send, and the
> parents of any shallow commits the client has, to find their fork points
> (i.e., merge bases). And then we add those fork points to the shallow
> list (grafting out their parents), and communicate them to the client to
> add to its shallow setup.

Good news. We have the mechanism in place, I think.
get_shallow_commits_by_rev_list() (from 'pu') will produce the right
shallow points for sending back to the client if you pass "--not
<current shallow points>" to it. It's meant to be used for
--shallow-exclude and --shallow-since, but if neither is given (nor
--depth) I guess we can run it with current shallow points. I wonder
if we can detect some common cases and avoid commit traversing this
way though.
-- 
Duy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG?] fetch into shallow sends a large number of objects
  2016-03-08 12:33     ` Duy Nguyen
@ 2016-03-08 13:25       ` Jeff King
  2016-03-08 13:30         ` Jeff King
  2016-03-10 12:20         ` Duy Nguyen
  0 siblings, 2 replies; 16+ messages in thread
From: Jeff King @ 2016-03-08 13:25 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Junio C Hamano, Git Mailing List

On Tue, Mar 08, 2016 at 07:33:43PM +0700, Duy Nguyen wrote:

> On Tue, Mar 8, 2016 at 7:14 PM, Jeff King <peff@peff.net> wrote:
> > ...
> >
> > So I think the solution to both is that we need to do a _separate_
> > traversal with all of the positive tips we're going to send, and the
> > parents of any shallow commits the client has, to find their fork points
> > (i.e., merge bases). And then we add those fork points to the shallow
> > list (grafting out their parents), and communicate them to the client to
> > add to its shallow setup.
> 
> Good news. We have the mechanism in place, I think.
> get_shallow_commits_by_rev_list() (from 'pu') will produce the right
> shallow points for sending back to the client if you pass "--not
> <current shallow points>" to it. It's meant to be used for
> --shallow-exclude and --shallow-since, but if neither is given (nor
> --depth) I guess we can run it with current shallow points. I wonder
> if we can detect some common cases and avoid commit traversing this
> way though.

I tried that, but I couldn't quite get it to work. I don't think we need
any special rev-list, though; we can just find the boundary points of
that traversal and mark them as new shallows.

I think this patch does roughly the right thing:

diff --git a/upload-pack.c b/upload-pack.c
index 4859535..da76f70 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -833,12 +833,41 @@ static void receive_needs(void)
 		deepen_by_rev_list(av.argc, av.argv, &shallows);
 		argv_array_clear(&av);
 	}
-	else
-		if (shallows.nr > 0) {
-			int i;
-			for (i = 0; i < shallows.nr; i++)
-				register_shallow(shallows.objects[i].item->oid.hash);
+	else if (shallows.nr > 0) {
+		struct rev_info revs;
+		struct argv_array av = ARGV_ARRAY_INIT;
+		struct commit *c;
+		int i;
+
+		argv_array_push(&av, "rev-list");
+		argv_array_push(&av, "--boundary");
+		for (i = 0; i < want_obj.nr; i++) {
+			struct object *o = want_obj.objects[i].item;
+			argv_array_push(&av, oid_to_hex(&o->oid));
 		}
+		for (i = 0; i < shallows.nr; i++) {
+			struct object *o = shallows.objects[i].item;
+			argv_array_pushf(&av, "^%s", oid_to_hex(&o->oid));
+		}
+
+		init_revisions(&revs, NULL);
+		setup_revisions(av.argc, av.argv, &revs, NULL);
+		if (prepare_revision_walk(&revs))
+			die("revision walk setup failed");
+
+		while ((c = get_revision(&revs))) {
+			if (!(c->object.flags & BOUNDARY))
+				continue;
+			register_shallow(c->object.oid.hash);
+			packet_write(1, "shallow %s",
+				     oid_to_hex(&c->object.oid));
+		}
+		packet_flush(1);
+		argv_array_clear(&av);
+
+		for (i = 0; i < shallows.nr; i++)
+			register_shallow(shallows.objects[i].item->oid.hash);
+	}
 
 	shallow_nr += shallows.nr;
 	free(shallows.objects);

Though I think perhaps we should also be adding those BOUNDARY commits
to the "shallows" object array? This works because the "--shallow" we
pass to pack-objects comes by reading the commit-graft list manipulated
by register_shallow(), so I'm not sure if it matters.

_But_, the client is not prepared to handle this. We send "shallow"
lines that it is not expecting, since it did not ask for any depth. So I
think this logic would have to kick in only when the client tells us to
do so.

I hacked around it with:

diff --git a/fetch-pack.c b/fetch-pack.c
index e8ae6d1..988c808 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -830,6 +830,7 @@ static struct ref *do_fetch_pack(struct fetch_pack_args *args,
 
 	sort_ref_list(&ref, ref_compare_name);
 	qsort(sought, nr_sought, sizeof(*sought), cmp_ref_by_name);
+	args->deepen = 1;
 
 	if ((args->depth > 0 || is_repository_shallow()) && !server_supports("shallow"))
 		die("Server does not support shallow clients");

and confirmed that the resulting "git fetch origin new" from my earlier
example does a sane thing.

So what next? I think there's some protocol work here, and I think the
overall design of that needs to be considered alongside the other
"deepen" options your topic in pu adds (and of which I'm largely
ignorant). Does this sufficiently interest you to pick up and roll into
your other shallow work?

-Peff

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [BUG?] fetch into shallow sends a large number of objects
  2016-03-08 13:25       ` Jeff King
@ 2016-03-08 13:30         ` Jeff King
  2016-03-08 23:02           ` Duy Nguyen
  2016-03-10 12:20         ` Duy Nguyen
  1 sibling, 1 reply; 16+ messages in thread
From: Jeff King @ 2016-03-08 13:30 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Junio C Hamano, Git Mailing List

On Tue, Mar 08, 2016 at 08:25:24AM -0500, Jeff King wrote:

> > Good news. We have the mechanism in place, I think.
> > get_shallow_commits_by_rev_list() (from 'pu') will produce the right
> > shallow points for sending back to the client if you pass "--not
> > <current shallow points>" to it. It's meant to be used for
> > --shallow-exclude and --shallow-since, but if neither is given (nor
> > --depth) I guess we can run it with current shallow points. I wonder
> > if we can detect some common cases and avoid commit traversing this
> > way though.
> 
> I tried that, but I couldn't quite get it to work. I don't think we need
> any special rev-list, though; we can just find the boundary points of
> that traversal and mark them as new shallows.

By the way, I found a bug during my initial attempts with
get_shallow_commits_by_rev_list(). One of the loops uses "p" to traverse
a linked list, and then re-uses "p" again for another list traversal
inside the body of the loop. When the inner loop finishes, "p" is
left as NULL, and then the outer loop tries to access "p->next", which
segfaults.

I _think_ this is just a mistaken re-use of the variable, and can be
fixed with a new iteration variable for the inner loop, like:

diff --git a/shallow.c b/shallow.c
index 6ceb3f8..d600947 100644
--- a/shallow.c
+++ b/shallow.c
@@ -188,13 +188,14 @@ struct commit_list *get_shallow_commits_by_rev_list(int ac, const char **av,
 	 */
 	for (p = not_shallow_list; p; p = p->next) {
 		struct commit *c = p->item;
+		struct commit_list *parent;
 
 		if (parse_commit(c))
 			die("unable to parse commit %s",
 			    oid_to_hex(&c->object.oid));
 
-		for (p = c->parents; p; p = p->next)
-			if (!(p->item->object.flags & not_shallow_flag)) {
+		for (parent = c->parents; parent; parent = parent->next)
+			if (!(parent->item->object.flags & not_shallow_flag)) {
 				c->object.flags |= shallow_flag;
 				commit_list_insert(c, &result);
 				break;

As I said, I didn't end up using this function either way, but you
probably want the fix above for the rest of your series. :)

-Peff

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [BUG?] fetch into shallow sends a large number of objects
  2016-03-08 13:30         ` Jeff King
@ 2016-03-08 23:02           ` Duy Nguyen
  0 siblings, 0 replies; 16+ messages in thread
From: Duy Nguyen @ 2016-03-08 23:02 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Git Mailing List

On Tue, Mar 8, 2016 at 8:30 PM, Jeff King <peff@peff.net> wrote:
> On Tue, Mar 08, 2016 at 08:25:24AM -0500, Jeff King wrote:
>
>> > Good news. We have the mechanism in place, I think.
>> > get_shallow_commits_by_rev_list() (from 'pu') will produce the right
>> > shallow points for sending back to the client if you pass "--not
>> > <current shallow points>" to it. It's meant to be used for
>> > --shallow-exclude and --shallow-since, but if neither is given (nor
>> > --depth) I guess we can run it with current shallow points. I wonder
>> > if we can detect some common cases and avoid commit traversing this
>> > way though.
>>
>> I tried that, but I couldn't quite get it to work. I don't think we need
>> any special rev-list, though; we can just find the boundary points of
>> that traversal and mark them as new shallows.
>
> By the way, I found a bug during my initial attempts with
> get_shallow_commits_by_rev_list().

Hehe. Thanks. Will check why tests missed it and reroll with Ramsay's
fix once -rc period is over.
-- 
Duy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG?] fetch into shallow sends a large number of objects
  2016-03-08 13:25       ` Jeff King
  2016-03-08 13:30         ` Jeff King
@ 2016-03-10 12:20         ` Duy Nguyen
  2016-03-10 21:10           ` Jeff King
  1 sibling, 1 reply; 16+ messages in thread
From: Duy Nguyen @ 2016-03-10 12:20 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Git Mailing List

On Tue, Mar 08, 2016 at 08:25:24AM -0500, Jeff King wrote:
> I think this patch does roughly the right thing:
> 
> diff --git a/upload-pack.c b/upload-pack.c
> index 4859535..da76f70 100644
> --- a/upload-pack.c
> +++ b/upload-pack.c
> @@ -833,12 +833,41 @@ static void receive_needs(void)
>  		deepen_by_rev_list(av.argc, av.argv, &shallows);
>  		argv_array_clear(&av);
>  	}
> -	else
> -		if (shallows.nr > 0) {
> -			int i;
> -			for (i = 0; i < shallows.nr; i++)
> -				register_shallow(shallows.objects[i].item->oid.hash);
> +	else if (shallows.nr > 0) {
> +		struct rev_info revs;
> +		struct argv_array av = ARGV_ARRAY_INIT;
> +		struct commit *c;
> +		int i;
> +
> +		argv_array_push(&av, "rev-list");
> +		argv_array_push(&av, "--boundary");

Nice. I didn't know about --boundary. But will it work correctly in
this case?

       --- B ---- C ---- F
          /      /
     --- D ---- E ---- G

C and D will be current shallow cut points. People "want" F and G.
"rev-list --boundary F G ^C ^D" would mark E as boundary/shallow too,
correct? If so the history from G will be one depth short on a normal
fetch.

> +		for (i = 0; i < want_obj.nr; i++) {
> +			struct object *o = want_obj.objects[i].item;
> +			argv_array_push(&av, oid_to_hex(&o->oid));
>  		}
> +		for (i = 0; i < shallows.nr; i++) {
> +			struct object *o = shallows.objects[i].item;
> +			argv_array_pushf(&av, "^%s", oid_to_hex(&o->oid));
> +		}
> +
> +		init_revisions(&revs, NULL);
> +		setup_revisions(av.argc, av.argv, &revs, NULL);
> +		if (prepare_revision_walk(&revs))
> +			die("revision walk setup failed");
> +
> +		while ((c = get_revision(&revs))) {
> +			if (!(c->object.flags & BOUNDARY))
> +				continue;
> +			register_shallow(c->object.oid.hash);
> +			packet_write(1, "shallow %s",
> +				     oid_to_hex(&c->object.oid));
> +		}

>  ...
> _But_, the client is not prepared to handle this. We send "shallow"
> lines that it is not expecting, since it did not ask for any depth. So I
> think this logic would have to kick in only when the client tells us to
> do so.

Urgh.. not good. Perhaps a new extension to let the server know the
client can handle spontaneous "deepen" commands and only activate new
mode when the extension is present?

> So what next? I think there's some protocol work here, and I think the
> overall design of that needs to be considered alongside the other
> "deepen" options your topic in pu adds (and of which I'm largely
> ignorant). Does this sufficiently interest you to pick up and roll into
> your other shallow work?

I can pick it up if you are busy with other stuff. But I'm also having
a couple other topics at the moment, so it may not progress very fast.
--
Duy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG?] fetch into shallow sends a large number of objects
  2016-03-10 12:20         ` Duy Nguyen
@ 2016-03-10 21:10           ` Jeff King
  2016-03-10 21:26             ` Junio C Hamano
  0 siblings, 1 reply; 16+ messages in thread
From: Jeff King @ 2016-03-10 21:10 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Junio C Hamano, Git Mailing List

On Thu, Mar 10, 2016 at 07:20:20PM +0700, Duy Nguyen wrote:

> > +	else if (shallows.nr > 0) {
> > +		struct rev_info revs;
> > +		struct argv_array av = ARGV_ARRAY_INIT;
> > +		struct commit *c;
> > +		int i;
> > +
> > +		argv_array_push(&av, "rev-list");
> > +		argv_array_push(&av, "--boundary");
> 
> Nice. I didn't know about --boundary. But will it work correctly in
> this case?
> 
>        --- B ---- C ---- F
>           /      /
>      --- D ---- E ---- G
> 
> C and D will be current shallow cut points. People "want" F and G.
> "rev-list --boundary F G ^C ^D" would mark E as boundary/shallow too,
> correct? If so the history from G will be one depth short on a normal
> fetch.

IMHO, that is the right thing. They asked for "C" as a shallow cut-off
point, so anything that is a parent of "C" should be omitted as shallow,
too. It has nothing to do with the numeric depth, which was just the
starting point for generating the shallow cutoffs.

That's just my mental model, though. I admit I don't actually use
shallow clones myself, and maybe people would expect something else.

> > _But_, the client is not prepared to handle this. We send "shallow"
> > lines that it is not expecting, since it did not ask for any depth. So I
> > think this logic would have to kick in only when the client tells us to
> > do so.
> 
> Urgh.. not good. Perhaps a new extension to let the server know the
> client can handle spontaneous "deepen" commands and only activate new
> mode when the extension is present?

Yeah, we definitely need an extension. I'm not sure if the extension
should be "I know about spontaneous shallow/deepen responses; it's OK to
send them to me" or "I want you to include the shallow points I send as
boundary cutoffs for further shallow-ing of newly fetched history".

They amount to the same thing when implementing _this_ feature, but the
latter leaves us room in the future for a client to say "sure, I
understand your spontaneous responses, but I explicitly _don't_ want you
to do the boundary computation". I don't know if that is useful or not,
but it might not hurt to have later on (and by adding it now, it "just
works" later on with older servers/clients).

> > So what next? I think there's some protocol work here, and I think the
> > overall design of that needs to be considered alongside the other
> > "deepen" options your topic in pu adds (and of which I'm largely
> > ignorant). Does this sufficiently interest you to pick up and roll into
> > your other shallow work?
> 
> I can pick it up if you are busy with other stuff. But I'm also having
> a couple other topics at the moment, so it may not progress very fast.

Thanks. I don't think it is too urgent; it has been that way for a
while. I certainly have plenty of other things to work on, but mostly I
just feel a bit out of my depth on the shallow stuff. I haven't given it
any real thought, and you obviously have.

-Peff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG?] fetch into shallow sends a large number of objects
  2016-03-10 21:10           ` Jeff King
@ 2016-03-10 21:26             ` Junio C Hamano
  2016-03-10 21:40               ` Jeff King
  0 siblings, 1 reply; 16+ messages in thread
From: Junio C Hamano @ 2016-03-10 21:26 UTC (permalink / raw)
  To: Jeff King; +Cc: Duy Nguyen, Git Mailing List

Jeff King <peff@peff.net> writes:

> On Thu, Mar 10, 2016 at 07:20:20PM +0700, Duy Nguyen wrote:
>
>> > +	else if (shallows.nr > 0) {
>> > +		struct rev_info revs;
>> > +		struct argv_array av = ARGV_ARRAY_INIT;
>> > +		struct commit *c;
>> > +		int i;
>> > +
>> > +		argv_array_push(&av, "rev-list");
>> > +		argv_array_push(&av, "--boundary");
>> 
>> Nice. I didn't know about --boundary. But will it work correctly in
>> this case?
>> 
>>        --- B ---- C ---- F
>>           /      /
>>      --- D ---- E ---- G
>> 
>> C and D will be current shallow cut points. People "want" F and G.
>> "rev-list --boundary F G ^C ^D" would mark E as boundary/shallow too,
>> correct? If so the history from G will be one depth short on a normal
>> fetch.
>
> IMHO, that is the right thing. They asked for "C" as a shallow cut-off
> point, so anything that is a parent of "C" should be omitted as shallow,
> too. It has nothing to do with the numeric depth, which was just the
> starting point for generating the shallow cutoffs.

I think that is the right mental model.  The statement that "C and D
are current cut points" does not make much sense.  As you cannot
rewrite parents of commits after the fact, you cannot construct a
case like "when the shallow clone originally was made, two histories
were forked long time before B and D, and the cloner ended up with C
and D as the cutoff point, but now that we have the ancestry linkage
between B and D (and C and E), we need to make E a new cutoff".  The
original "shallow" implementation does not store "starting point +
number of depth" and instead translates that to the cut-off point
for this exact reason.

> Yeah, we definitely need an extension. I'm not sure if the extension
> should be "I know about spontaneous shallow/deepen responses; it's OK to
> send them to me" or "I want you to include the shallow points I send as
> boundary cutoffs for further shallow-ing of newly fetched history".
>
> They amount to the same thing when implementing _this_ feature, but the
> latter leaves us room in the future for a client to say "sure, I
> understand your spontaneous responses, but I explicitly _don't_ want you
> to do the boundary computation". I don't know if that is useful or not,
> but it might not hurt to have later on (and by adding it now, it "just
> works" later on with older servers/clients).

I am not sure what distinction you are worried about.  An updated
client that is capable of saying "you may give shallow/deepen
responses to me" can optionally be told not to say it to the server,
and that is equivalent to saying "I don't want you to send them", no?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG?] fetch into shallow sends a large number of objects
  2016-03-10 21:26             ` Junio C Hamano
@ 2016-03-10 21:40               ` Jeff King
  2016-03-11  0:47                 ` Duy Nguyen
  0 siblings, 1 reply; 16+ messages in thread
From: Jeff King @ 2016-03-10 21:40 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Duy Nguyen, Git Mailing List

On Thu, Mar 10, 2016 at 01:26:08PM -0800, Junio C Hamano wrote:

> > IMHO, that is the right thing. They asked for "C" as a shallow cut-off
> > point, so anything that is a parent of "C" should be omitted as shallow,
> > too. It has nothing to do with the numeric depth, which was just the
> > starting point for generating the shallow cutoffs.
> 
> I think that is the right mental model.  The statement that "C and D
> are current cut points" does not make much sense.  As you cannot
> rewrite parents of commits after the fact, you cannot construct a
> case like "when the shallow clone originally was made, two histories
> were forked long time before B and D, and the cloner ended up with C
> and D as the cutoff point, but now that we have the ancestry linkage
> between B and D (and C and E), we need to make E a new cutoff".  The
> original "shallow" implementation does not store "starting point +
> number of depth" and instead translates that to the cut-off point
> for this exact reason.

OK, good. Now there are at least two of us who view it that way. :)

> > Yeah, we definitely need an extension. I'm not sure if the extension
> > should be "I know about spontaneous shallow/deepen responses; it's OK to
> > send them to me" or "I want you to include the shallow points I send as
> > boundary cutoffs for further shallow-ing of newly fetched history".
> >
> > They amount to the same thing when implementing _this_ feature, but the
> > latter leaves us room in the future for a client to say "sure, I
> > understand your spontaneous responses, but I explicitly _don't_ want you
> > to do the boundary computation". I don't know if that is useful or not,
> > but it might not hurt to have later on (and by adding it now, it "just
> > works" later on with older servers/clients).
> 
> I am not sure what distinction you are worried about.  An updated
> client that is capable of saying "you may give shallow/deepen
> responses to me" can optionally be told not to say it to the server,
> and that is equivalent to saying "I don't want you to send them", no?

Mostly, I wondered if we would need to send spontaneous shallow lines
for any other cases, and the client would not be able to say "I
understand them and want them in case A, but not in case B".

I do not have any case A in mind; it was just a general sense of "let's
make feature flags as specific as possible to avoid painting ourselves
into a corner". I'm OK with implementing it either way.

-Peff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG?] fetch into shallow sends a large number of objects
  2016-03-10 21:40               ` Jeff King
@ 2016-03-11  0:47                 ` Duy Nguyen
  2016-03-11 16:53                   ` Junio C Hamano
  2016-03-11 18:16                   ` Jeff King
  0 siblings, 2 replies; 16+ messages in thread
From: Duy Nguyen @ 2016-03-11  0:47 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Git Mailing List

On Fri, Mar 11, 2016 at 4:40 AM, Jeff King <peff@peff.net> wrote:
> On Thu, Mar 10, 2016 at 01:26:08PM -0800, Junio C Hamano wrote:
>
>> > IMHO, that is the right thing. They asked for "C" as a shallow cut-off
>> > point, so anything that is a parent of "C" should be omitted as shallow,
>> > too. It has nothing to do with the numeric depth, which was just the
>> > starting point for generating the shallow cutoffs.
>>
>> I think that is the right mental model.  The statement that "C and D
>> are current cut points" does not make much sense.  As you cannot
>> rewrite parents of commits after the fact, you cannot construct a
>> case like "when the shallow clone originally was made, two histories
>> were forked long time before B and D, and the cloner ended up with C
>> and D as the cutoff point, but now that we have the ancestry linkage
>> between B and D (and C and E), we need to make E a new cutoff".  The
>> original "shallow" implementation does not store "starting point +
>> number of depth" and instead translates that to the cut-off point
>> for this exact reason.

Well, assume again that F and G are ref heads, and their respective
distance to C and D are the same (like the below graph), then "fetch
--deptch=<distance>" can mark C and D as shallow cut points because
--depth traverses from refs until the distance is met, it does not do
total exclusion ^C like rev-list.

       --- B ---- C ---- H ---- F
          /      /
     --- D ---- E ---- G


> OK, good. Now there are at least two of us who view it that way. :)

But going that direction just gives me more headache. If you two are
ok with this and nobody else complains, I'm ok too :-D I guess it's a
corner case anyway that's probably hard to occur in practice.

>> > So what next? I think there's some protocol work here, and I think the
>> > overall design of that needs to be considered alongside the other
>> > "deepen" options your topic in pu adds (and of which I'm largely
>> > ignorant). Does this sufficiently interest you to pick up and roll into
>> > your other shallow work?
>>
>> I can pick it up if you are busy with other stuff. But I'm also having
>> a couple other topics at the moment, so it may not progress very fast.
>
> Thanks. I don't think it is too urgent; it has been that way for a
> while. I certainly have plenty of other things to work on, but mostly I
> just feel a bit out of my depth on the shallow stuff. I haven't given it
> any real thought, and you obviously have.

More people understanding this code is always better though. But don't
worry I'll take care of this.

>> > Yeah, we definitely need an extension. I'm not sure if the extension
>> > should be "I know about spontaneous shallow/deepen responses; it's OK to
>> > send them to me" or "I want you to include the shallow points I send as
>> > boundary cutoffs for further shallow-ing of newly fetched history".
>> >
>> > They amount to the same thing when implementing _this_ feature, but the
>> > latter leaves us room in the future for a client to say "sure, I
>> > understand your spontaneous responses, but I explicitly _don't_ want you
>> > to do the boundary computation". I don't know if that is useful or not,
>> > but it might not hurt to have later on (and by adding it now, it "just
>> > works" later on with older servers/clients).
>>
>> I am not sure what distinction you are worried about.  An updated
>> client that is capable of saying "you may give shallow/deepen
>> responses to me" can optionally be told not to say it to the server,
>> and that is equivalent to saying "I don't want you to send them", no?
>
> Mostly, I wondered if we would need to send spontaneous shallow lines
> for any other cases, and the client would not be able to say "I
> understand them and want them in case A, but not in case B".
>
> I do not have any case A in mind; it was just a general sense of "let's
> make feature flags as specific as possible to avoid painting ourselves
> into a corner". I'm OK with implementing it either way.

Another note to myself is, we do not want these spontaneous cutoff
requests to cut client's history even shorter (possibly by accident
because of buggy servers). As long as they are about sealing commits
that lead to non-existing objects, i think it's ok to make it a
general feature.
-- 
Duy

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG?] fetch into shallow sends a large number of objects
  2016-03-11  0:47                 ` Duy Nguyen
@ 2016-03-11 16:53                   ` Junio C Hamano
  2016-03-11 18:16                   ` Jeff King
  1 sibling, 0 replies; 16+ messages in thread
From: Junio C Hamano @ 2016-03-11 16:53 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Jeff King, Git Mailing List

Duy Nguyen <pclouds@gmail.com> writes:

> Well, assume again that F and G are ref heads, and their respective
> distance to C and D are the same (like the below graph), then "fetch
> --deptch=<distance>" can mark C and D as shallow cut points because
> --depth traverses from refs until the distance is met, it does not do
> total exclusion ^C like rev-list.
>
>        --- B ---- C ---- H ---- F
>           /      /
>      --- D ---- E ---- G

Hmph, so we do not realize that D is reachable from C because we do
not even look at the parents of the latter.  Is there a remification
that breaks the proposed mental view coming from this?

Let me think aloud.  Later when we fetch again (without asking to
change the depth but without refusing an unsolicited depth change),
a fetch of history that lead to descendants of F or G will still not
realize that C and D are related, because the traversal down to C
would stop there without noticing that E is reachable from C.  That
won't break.  If the tips of the history have been rewound and now
points at the tips of a history that is forked before B and D (let's
say the new tip to replace G is Z), would that be a problem?  The
side that does have the full history would notice that the fork
point X is an ancestors of cutoff C and D, and can tell that the
side-branch proper, Y and Z, are interesting but X is where their
truncated history must stop.

        --- B ---- C ---- H ---- F
           /      /
 -- X --- D ---- E ---- G
     \
      Y --- Z

While doing so the side that originally had C and D as the cutoff
would be told that now X is also a cutoff.  In general, because the
side that has the full history and gets fetched would not have to
know about _all_ the refs and objects the fetching side has, so if
it didn't learn about the shallow side still having G, it wouldn't
be able to say that D is no longer an interesting cutoff, but would
that be a problem?  The side with full history should also be able
to notice that E can also be added as a cutoff.  Would that be a
good thing to do?  If so, perhaps we should have done that when the
original history was shallowing cloned with depth=2 in your picture,
perhaps?  After all, even though the shallow side may not know C and
D are related, the side that supplied the truncated history knows,
so after depth traversal finds C and D, it perhaps can postprocess
it to realize C and E is the set that should be sent as the cutoff?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG?] fetch into shallow sends a large number of objects
  2016-03-11  0:47                 ` Duy Nguyen
  2016-03-11 16:53                   ` Junio C Hamano
@ 2016-03-11 18:16                   ` Jeff King
  1 sibling, 0 replies; 16+ messages in thread
From: Jeff King @ 2016-03-11 18:16 UTC (permalink / raw)
  To: Duy Nguyen; +Cc: Junio C Hamano, Git Mailing List

On Fri, Mar 11, 2016 at 07:47:34AM +0700, Duy Nguyen wrote:

> Well, assume again that F and G are ref heads, and their respective
> distance to C and D are the same (like the below graph), then "fetch
> --deptch=<distance>" can mark C and D as shallow cut points because
> --depth traverses from refs until the distance is met, it does not do
> total exclusion ^C like rev-list.
> 
>        --- B ---- C ---- H ---- F
>           /      /
>      --- D ---- E ---- G

Right, so I think we would only apply the "use existing cutoffs as
bounds" thing when we were not otherwise given a --depth. Because it can
definitely cause us to override the depth (and there's no need to; the
point is to avoid linking in all of history, and --depth already solves
that). So we probably do want the client to ask "I'm not giving you a
depth, but please use my existing shallows as cutoffs".

I think a more interesting case here is when we have C as a cutoff, and
somebody fetches "E" directly. They are part of the truncated history.
So we should exclude them from a fetch of "G", but if the user asked for
them directly, that probably needs to override the existing shallow
cutoff.

We probably want to compute the --boundary of "E ^C", but then omit from
that any items that are directly in want_obj. IOW, it is OK to truncate
at depth=1 due to an existing cutoff, but not at depth=0. :)

That does mean we would then fetch all of the history leading up to E,
but I think that's OK. If the user didn't specify --depth and they are
fetching from behind the shallow-cutoff point, we don't have any real
way of knowing how much they wanted (though I guess it would also be
sensible to just apply depth=1 in such a case).

-Peff

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-03-11 18:24 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-07 22:15 [BUG?] fetch into shallow sends a large number of objects Jeff King
2016-03-07 23:47 ` Junio C Hamano
2016-03-08  0:53   ` Duy Nguyen
2016-03-08 12:21     ` Jeff King
2016-03-08 12:14   ` Jeff King
2016-03-08 12:33     ` Duy Nguyen
2016-03-08 13:25       ` Jeff King
2016-03-08 13:30         ` Jeff King
2016-03-08 23:02           ` Duy Nguyen
2016-03-10 12:20         ` Duy Nguyen
2016-03-10 21:10           ` Jeff King
2016-03-10 21:26             ` Junio C Hamano
2016-03-10 21:40               ` Jeff King
2016-03-11  0:47                 ` Duy Nguyen
2016-03-11 16:53                   ` Junio C Hamano
2016-03-11 18:16                   ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).