git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH] upload-pack: do not lazy-fetch "have" objects
@ 2020-07-15 22:31 Jonathan Tan
  2020-07-15 22:55 ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Jonathan Tan @ 2020-07-15 22:31 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

When upload-pack receives a request containing "have" hashes, it (among
other things) checks if the served repository has the corresponding
objects. However, it does not do so with the
OBJECT_INFO_SKIP_FETCH_OBJECT flag, so if serving a partial clone, a
lazy fetch will be triggered first.

This was discovered at $DAYJOB when a user fetched from a partial clone
(into another partial clone - although this would also happen if the
repo to be fetched into is not a partial clone).

Therefore, whenever "have" hashes are checked for existence, pass the
OBJECT_INFO_SKIP_FETCH_OBJECT flag.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
There is also the greater issue that if a lazy fetch fails, the fetch is
usually fatal (and possibly always fatal - I haven't checked all the
code paths) when the calling code could just as easily continue without
the object (which is the case for upload-pack when checking "have"s),
but I haven't addressed that here.
---
 t/t5616-partial-clone.sh | 38 ++++++++++++++++++++++++++++++++++++++
 upload-pack.c            |  5 +++--
 2 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index 8a27452a51..37de0afb02 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -422,6 +422,44 @@ test_expect_success 'single-branch tag following respects partial clone' '
 	test_must_fail git -C single rev-parse --verify refs/tags/C
 '
 
+test_expect_success 'fetch from a partial clone, protocol v0' '
+	rm -rf server client trace &&
+
+	# Pretend that the server is a partial clone
+	git init server &&
+	git -C server remote add a_remote "file://$(pwd)/" &&
+	test_config -C server core.repositoryformatversion 1 &&
+	test_config -C server extensions.partialclone a_remote &&
+	test_config -C server protocol.version 0 &&
+	test_commit -C server foo &&
+
+	# Fetch from the server
+	git init client &&
+	test_config -C client protocol.version 0 &&
+	test_commit -C client bar &&
+	GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "file://$(pwd)/server" &&
+	! grep "version 2" trace
+'
+
+test_expect_success 'fetch from a partial clone, protocol v2' '
+	rm -rf server client trace &&
+
+	# Pretend that the server is a partial clone
+	git init server &&
+	git -C server remote add a_remote "file://$(pwd)/" &&
+	test_config -C server core.repositoryformatversion 1 &&
+	test_config -C server extensions.partialclone a_remote &&
+	test_config -C server protocol.version 2 &&
+	test_commit -C server foo &&
+
+	# Fetch from the server
+	git init client &&
+	test_config -C client protocol.version 2 &&
+	test_commit -C client bar &&
+	GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "file://$(pwd)/server" &&
+	grep "version 2" trace
+'
+
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
 
diff --git a/upload-pack.c b/upload-pack.c
index 951a2b23aa..af9d621755 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -482,7 +482,7 @@ static int got_oid(struct upload_pack_data *data,
 {
 	if (get_oid_hex(hex, oid))
 		die("git upload-pack: expected SHA1 object, got '%s'", hex);
-	if (!has_object_file(oid))
+	if (!has_object_file_with_flags(oid, OBJECT_INFO_SKIP_FETCH_OBJECT))
 		return -1;
 	return do_got_oid(data, oid);
 }
@@ -1423,7 +1423,8 @@ static int process_haves(struct upload_pack_data *data, struct oid_array *common
 	for (i = 0; i < data->haves.nr; i++) {
 		const struct object_id *oid = &data->haves.oid[i];
 
-		if (!has_object_file(oid))
+		if (!has_object_file_with_flags(oid,
+						OBJECT_INFO_SKIP_FETCH_OBJECT))
 			continue;
 
 		oid_array_append(common, oid);
-- 
2.27.0.389.gc38d7665816-goog


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] upload-pack: do not lazy-fetch "have" objects
  2020-07-15 22:31 [PATCH] upload-pack: do not lazy-fetch "have" objects Jonathan Tan
@ 2020-07-15 22:55 ` Junio C Hamano
  2020-07-16 10:41   ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2020-07-15 22:55 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

Jonathan Tan <jonathantanmy@google.com> writes:

> When upload-pack receives a request containing "have" hashes, it (among
> other things) checks if the served repository has the corresponding
> objects. However, it does not do so with the
> OBJECT_INFO_SKIP_FETCH_OBJECT flag, so if serving a partial clone, a
> lazy fetch will be triggered first.

OK.  

Fixing issues hit by real users reactively is a necessary and good
thing, but this is not the first time we patch callers of
has_object_file() for this kind of "we are merely trying to
determine the boundary of what we have, so that we know what we need
to add to this repository" queries, I am afraid.

Perhaps it is a good idea to sweep all the hits from "git grep -e
has_object_file \*.c" and audit the codebase to see if there are
other problematic ones?

For example, list-objects.c::process_blob() tries to if the object
exists when --exclude-promisor-objects is in effect so that it can
return early if the object is missing and it is a promisor object.
I would imagine that we would not want to lazy-fetch the object in
this case.

Thanks.  Will queue.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] upload-pack: do not lazy-fetch "have" objects
  2020-07-15 22:55 ` Junio C Hamano
@ 2020-07-16 10:41   ` Jeff King
  2020-07-16 17:36     ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2020-07-16 10:41 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jonathan Tan, git

On Wed, Jul 15, 2020 at 03:55:18PM -0700, Junio C Hamano wrote:

> Fixing issues hit by real users reactively is a necessary and good
> thing, but this is not the first time we patch callers of
> has_object_file() for this kind of "we are merely trying to
> determine the boundary of what we have, so that we know what we need
> to add to this repository" queries, I am afraid.
> 
> Perhaps it is a good idea to sweep all the hits from "git grep -e
> has_object_file \*.c" and audit the codebase to see if there are
> other problematic ones?

Interestingly, the case fixed here was mentioned after the three-dashes
in this patch:

  https://lore.kernel.org/git/20191127123211.GG22221@sigill.intra.peff.net/

That thread was about OBJECT_INFO_QUICK, but I think any "we are finding
the boundary" hunt should consider whether that's appropriate, too. In
that case they are really the same issue: we do not want to expend
effort A on the unlikely chance P that we will find the object and save
ourselves effort B. The magnitude of A and B, and the probability P may
be different for the two optimizations, though.

I think the instance fixed here _probably_ could be converted to QUICK
as well. It saves extra refreshes of the pack directory when a client
asks for an object we don't have. That's usually not too expensive, but
can be in some cases (many alternates, slow NFS filesystems, client has
many objects we don't). In the worst case, using QUICK would mean we can
get fooled by a simultaneous repack into thinking we don't have an
object we _do_ have, and end up negotiating a worse pack for the client.

> Thanks.  Will queue.

That said, yeah, I think Jonathan's patch is a strict improvement by
itself.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] upload-pack: do not lazy-fetch "have" objects
  2020-07-16 10:41   ` Jeff King
@ 2020-07-16 17:36     ` Junio C Hamano
  2020-07-16 18:09       ` [PATCH v2] " Jonathan Tan
  0 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2020-07-16 17:36 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, git

Jeff King <peff@peff.net> writes:

> I think the instance fixed here _probably_ could be converted to QUICK
> as well. It saves extra refreshes of the pack directory when a client
> asks for an object we don't have. That's usually not too expensive, but
> can be in some cases (many alternates, slow NFS filesystems, client has
> many objects we don't). In the worst case, using QUICK would mean we can
> get fooled by a simultaneous repack into thinking we don't have an
> object we _do_ have, and end up negotiating a worse pack for the client.
>
>> Thanks.  Will queue.
>
> That said, yeah, I think Jonathan's patch is a strict improvement by
> itself.

I'll see what Jonathan would say, hoping that he'd consider if QUICK
is appropriate or not sufficiently enough to result in an updated
patch, that either has QUICK in the code or the reason why QUICK is
not used in the log message.

Thanks.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2] upload-pack: do not lazy-fetch "have" objects
  2020-07-16 17:36     ` Junio C Hamano
@ 2020-07-16 18:09       ` Jonathan Tan
  2020-07-20 17:42         ` Jeff King
  0 siblings, 1 reply; 6+ messages in thread
From: Jonathan Tan @ 2020-07-16 18:09 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

When upload-pack receives a request containing "have" hashes, it (among
other things) checks if the served repository has the corresponding
objects. However, it does not do so with the
OBJECT_INFO_SKIP_FETCH_OBJECT flag, so if serving a partial clone, a
lazy fetch will be triggered first.

This was discovered at $DAYJOB when a user fetched from a partial clone
(into another partial clone - although this would also happen if the
repo to be fetched into is not a partial clone).

Therefore, whenever "have" hashes are checked for existence, pass the
OBJECT_INFO_SKIP_FETCH_OBJECT flag. Also add the OBJECT_INFO_QUICK flag
to improve performance, as it is typical that such objects do not exist
in the serving repo, and the consequences of a false negative are minor
(usually, a slightly larger pack sent).

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
Differences from v1: OBJECT_INFO_QUICK is also used wherever
OBJECT_INFO_SKIP_FETCH_OBJECT is added.

Using OBJECT_INFO_QUICK makes sense to me, so here's an updated patch.
---
 t/t5616-partial-clone.sh | 38 ++++++++++++++++++++++++++++++++++++++
 upload-pack.c            |  6 ++++--
 2 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/t/t5616-partial-clone.sh b/t/t5616-partial-clone.sh
index 8a27452a51..37de0afb02 100755
--- a/t/t5616-partial-clone.sh
+++ b/t/t5616-partial-clone.sh
@@ -422,6 +422,44 @@ test_expect_success 'single-branch tag following respects partial clone' '
 	test_must_fail git -C single rev-parse --verify refs/tags/C
 '
 
+test_expect_success 'fetch from a partial clone, protocol v0' '
+	rm -rf server client trace &&
+
+	# Pretend that the server is a partial clone
+	git init server &&
+	git -C server remote add a_remote "file://$(pwd)/" &&
+	test_config -C server core.repositoryformatversion 1 &&
+	test_config -C server extensions.partialclone a_remote &&
+	test_config -C server protocol.version 0 &&
+	test_commit -C server foo &&
+
+	# Fetch from the server
+	git init client &&
+	test_config -C client protocol.version 0 &&
+	test_commit -C client bar &&
+	GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "file://$(pwd)/server" &&
+	! grep "version 2" trace
+'
+
+test_expect_success 'fetch from a partial clone, protocol v2' '
+	rm -rf server client trace &&
+
+	# Pretend that the server is a partial clone
+	git init server &&
+	git -C server remote add a_remote "file://$(pwd)/" &&
+	test_config -C server core.repositoryformatversion 1 &&
+	test_config -C server extensions.partialclone a_remote &&
+	test_config -C server protocol.version 2 &&
+	test_commit -C server foo &&
+
+	# Fetch from the server
+	git init client &&
+	test_config -C client protocol.version 2 &&
+	test_commit -C client bar &&
+	GIT_TRACE_PACKET="$(pwd)/trace" git -C client fetch "file://$(pwd)/server" &&
+	grep "version 2" trace
+'
+
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
 
diff --git a/upload-pack.c b/upload-pack.c
index 951a2b23aa..8673741070 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -482,7 +482,8 @@ static int got_oid(struct upload_pack_data *data,
 {
 	if (get_oid_hex(hex, oid))
 		die("git upload-pack: expected SHA1 object, got '%s'", hex);
-	if (!has_object_file(oid))
+	if (!has_object_file_with_flags(oid,
+					OBJECT_INFO_QUICK | OBJECT_INFO_SKIP_FETCH_OBJECT))
 		return -1;
 	return do_got_oid(data, oid);
 }
@@ -1423,7 +1424,8 @@ static int process_haves(struct upload_pack_data *data, struct oid_array *common
 	for (i = 0; i < data->haves.nr; i++) {
 		const struct object_id *oid = &data->haves.oid[i];
 
-		if (!has_object_file(oid))
+		if (!has_object_file_with_flags(oid,
+						OBJECT_INFO_QUICK | OBJECT_INFO_SKIP_FETCH_OBJECT))
 			continue;
 
 		oid_array_append(common, oid);
-- 
2.27.0.389.gc38d7665816-goog


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] upload-pack: do not lazy-fetch "have" objects
  2020-07-16 18:09       ` [PATCH v2] " Jonathan Tan
@ 2020-07-20 17:42         ` Jeff King
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2020-07-20 17:42 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

On Thu, Jul 16, 2020 at 11:09:50AM -0700, Jonathan Tan wrote:

> Differences from v1: OBJECT_INFO_QUICK is also used wherever
> OBJECT_INFO_SKIP_FETCH_OBJECT is added.
> 
> Using OBJECT_INFO_QUICK makes sense to me, so here's an updated patch.

Yeah, I think this is reasonable, and the patch looks good to me.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-07-20 17:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-15 22:31 [PATCH] upload-pack: do not lazy-fetch "have" objects Jonathan Tan
2020-07-15 22:55 ` Junio C Hamano
2020-07-16 10:41   ` Jeff King
2020-07-16 17:36     ` Junio C Hamano
2020-07-16 18:09       ` [PATCH v2] " Jonathan Tan
2020-07-20 17:42         ` Jeff King

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).