git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Jonathan Tan <jonathantanmy@google.com>
Cc: git@vger.kernel.org, gitster@pobox.com
Subject: Re: [PATCH v2 6/9] Documentation: add Packfile URIs design doc
Date: Tue, 01 Dec 2020 13:48:59 +0100	[thread overview]
Message-ID: <87tut5vghw.fsf@evledraar.gmail.com> (raw)
In-Reply-To: <20201125190957.1113461-1-jonathantanmy@google.com>


On Wed, Nov 25 2020, Jonathan Tan wrote:

>> On Wed, Jun 10 2020, Jonathan Tan wrote:
>> 
>> > +This is the implementation: a feature, marked experimental, that allows the
>> > +server to be configured by one or more `uploadpack.blobPackfileUri=<sha1>
>> > +<uri>` entries. Whenever the list of objects to be sent is assembled, all such
>> > +blobs are excluded, replaced with URIs. The client will download those URIs,
>> > +expecting them to each point to packfiles containing single blobs.
>> 
>> I was poking at this recently to see whether I could change it into the
>> more dumb method I noted in
>> https://public-inbox.org/git/87k1hv6eel.fsf@evledraar.gmail.com/
>> 
>> As an aside on a protocol level could that be supported with this
>> current verb by having the client say "packfile-uris=early" or something
>> like that instead of "packfile-uris"? 
>
> Hmm...would the advantage of this be that the client can subsequently
> report any OIDs it finds as "want"s?

Yes, as a means-to-an-end of allowing the server CDN part to be dumb and
possibly out-of-date.

I.e. the current "here's some blobs and then a complimentary pack"
requires very close CDN/server coordination.

Whereas if you can say early in the dialog "here's a pack that should
have most of what you need, then do a want/fetch to get up-to-date" you
can just make that pack in a cronjob, the pack could even be corrupted
etc. and we'd just optimistically fall back on a full clone.

It wouldn't be reporting any OID it finds (too noisy). I haven't
experimented, but one dumb way to do it is for the server to serve up a
pack with "start negotiating with this SHA", which would just be a
recent revision of HEAD guaranteed to be in the served pack.

Or the client could scour the pack and e.g. find all commit tips in it
by running the equivalent of commit-graph on it first. More expensive,
but expensive on the client-side, and allows e.g. re-using that codepath
to clone on the basis of a GIT_ALTERNATE_OBJECT_DIRECTORIES containing
objects the <remote url> might have already.

> I guess the protocol could be extended to support "packfile-uris" at any
> negotiation step.

*nod*

>> The server advertising the same,
>> and the client then just requesting packfile-uris before ls-refs or
>> whatever? The server would need to be stateful about what's requested
>> when and serve up something different than the current
>> one-blob-per-pack. 
>
> Statefulness will be difficult. Right now, protocol v2 is stateless,
> and updating it to be stateful will be difficult, I believe - at least
> for HTTP, the statelessness design has been long there and other
> implementations of Git or systems that use Git might have already made
> that assumption (it is stateless both for protocol v0 and v2).

Sorry, on second thought what I'm suggesting doesn't really require
state, just the server to get/read/understand the "client supports this"
flags that it already does.

> As for serving more than one blob per pack, the current protocol and
> implementation already allows this. You can see a demonstration by
> cloning the following repository, which supports it on the server side:
>
>   GIT_TRACE_PACKET=1 git -c fetch.uriprotocols=https clone \
>     https://chromium.googlesource.com/chromium/src/base

So cloning that produces two packs (the packfile-uri one, and the
negotiated one), the packfile-uri one being:
    
    $ git show-index <objects/pack/pack-d917d9803d3acb03da7ea9e8ebb8e643364ba051.idx
    12 3ea853619c232488e683139217585f520ec636e0 (8e33883c)
    9371 83d1358e4979bf9aff879cb4150276cd3894463a (0ea17153)
    13640 d0f8c611c044b36b8ac57b7a3af18a8d88e4dde2 (af8be7a0)
    572 da9ca8b36d5029bd9b18addc054ba9c0b016e6bc (d57fdbac)

So 3ea853619 is a commit with a tree da9ca8b36/ , that tree has pointer
to win/ via 83d1358e4, which has a blob win/.clang-tidy at d0f8c611c.

So this is intended & you're already using it like that. So shouldn't
the docs be updated from:

    [...]one or more `uploadpack.blobPackfileUri=<sha1> <uri>`
    entries. Whenever the list of objects to be sent is assembled, all
    such blobs are excluded, replaced with URIs. The client will
    download those URIs, expecting them to each point to packfiles
    containing single blobs.

To something like:

    [...]one or more `uploadpack.postWantPackfileUri=<sha1> <uri>`
    entries. When the server sends the subsequent full pack any objects
    sent out-of-bound may be excluded. The client will download those
    URIs, expecting them to each contain some amount of objects. The
    union of the packs found at these URIs and the server's own returned
    packfile is expected to yield a valid repository that'll pass an
    fsck.

Aside from the hassle of renaming the "uploadpack.blobPackfileUri"
variable to some more accurate "this clearly has nothing to do with
blobs per-se, really" name that re-done paragraph seems to more
accurately reflect what this is doing & intended to do.

Also, why it it necessary to make it an error if the expected hash for
the packfile doesn't match, since we're about to do a full fsck
connectivity check anyway? The POC patch at the end of this mail[1]
shows that we mostly support it not matching now.

I suppose you might want to point to an evil CDN, but since it currently
needs to fsck the potential for that evilness seems rather limited. It's
not a hassle in the current closely coupled server/CDN implementation,
but to e.g. have a server dumbly pointing at
https://some-cdn.example/REPONAME/some-big-recent-pack.pack it's a
hassle, so having it at least optionally support the NULL_SHA would be
ideal for that. So I'm curious what you think it adds to the overall
security of the feature.

>> Any pointers to where/how to implement that would be
>> welcome, I got lost in the non-linearity of the
>> connect.c/fetch-pack.c/upload-pack.c code yesterday.
>
> upload_pack_v2() in upload-pack.c and do_fetch_pack_v2() in fetch-pack.c
> have the state machines of the server and client side respectively - I
> think those would be the first places to look.

Thanks.

>> But I'm mainly replying here to ask if it's intentional that clients are
>> tolerant of the server sending whatever it pleases in the supposedly
>> "single blob" packs. As demonstrated by the tests passing with this
>> patch:
>
> [snip test]
>
> Yes, it has the same tolerance w.r.t. the packfile URI packs as w.r.t.
> the inline packfile that gets sent.
>
>> As you may guess from the "shattered" I was trying to find if the
>> particulars around the partial fsck allowed me to exploit this somehow,
>> I haven't found a way to do that, just be annoying by sending the client
>> more than they asked for, but I could also do that with the normal
>> dialog. Just wondering if the client should be opening the pack and
>> barfing if it has more than one object, or not care.
>
> Ah yes, as you said, it's the same as with the normal dialog.

1.:

diff --git a/fetch-pack.c b/fetch-pack.c
index b10c432315..87d948b023 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -1661,9 +1661,9 @@ static struct ref *do_fetch_pack_v2(struct fetch_pack_args *args,
 
 		if (memcmp(packfile_uris.items[i].string, packname,
 			   the_hash_algo->hexsz))
-			die("fetch-pack: pack downloaded from %s does not match expected hash %.*s",
-			    uri, (int) the_hash_algo->hexsz,
-			    packfile_uris.items[i].string);
+			warning("fetch-pack: pack downloaded from %s does not match expected hash %.*s",
+				uri, (int) the_hash_algo->hexsz,
+				packfile_uris.items[i].string);
 
 		string_list_append_nodup(pack_lockfiles,
 					 xstrfmt("%s/pack/pack-%s.keep",
diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh
index 7d5b17909b..bc0fc4d8e3 100755
--- a/t/t5702-protocol-v2.sh
+++ b/t/t5702-protocol-v2.sh
@@ -798,9 +798,13 @@ test_expect_success 'when server does not send "ready", expect FLUSH' '
 configure_exclusion () {
 	git -C "$1" hash-object "$2" >objh &&
 	git -C "$1" pack-objects "$HTTPD_DOCUMENT_ROOT_PATH/mypack" <objh >packh &&
+	fake_sha=$(git hash-object --stdin <packh) &&
+	mv \
+		"$HTTPD_DOCUMENT_ROOT_PATH/mypack-$(cat packh).pack" \
+		"$HTTPD_DOCUMENT_ROOT_PATH/mypack-$fake_sha.pack"
 	git -C "$1" config --add \
 		"uploadpack.blobpackfileuri" \
-		"$(cat objh) $(cat packh) $HTTPD_URL/dumb/mypack-$(cat packh).pack" &&
+		"$(cat objh) $fake_sha $HTTPD_URL/dumb/mypack-$fake_sha.pack" &&
 	cat objh
 }
 
@@ -852,7 +856,7 @@ test_expect_success 'part of packfile response provided as URI' '
 	test_line_count = 6 filelist
 '
 
-test_expect_success 'fetching with valid packfile URI but invalid hash fails' '
+-test_expect_success 'fetching with valid packfile URI but invalid hash warns' '
 	P="$HTTPD_DOCUMENT_ROOT_PATH/http_parent" &&
 	rm -rf "$P" http_child log &&
 
@@ -870,13 +874,12 @@ test_expect_success 'fetching with valid packfile URI but invalid hash fails' '
 	# the hash of the packfile, since the hash does not matter for this
 	# test as long as it is not the hash of the pack, and it is of the
 	# expected length.
-	git -C "$P" hash-object other-blob >objh &&
 	git -C "$P" pack-objects "$HTTPD_DOCUMENT_ROOT_PATH/mypack" <objh >packh &&
 	git -C "$P" config --add \
 		"uploadpack.blobpackfileuri" \
 		"$(cat objh) $(cat objh) $HTTPD_URL/dumb/mypack-$(cat packh).pack" &&
 
-	test_must_fail env GIT_TEST_SIDEBAND_ALL=1 \
+	env GIT_TEST_SIDEBAND_ALL=1 \
 		git -c protocol.version=2 \
 		-c fetch.uriprotocols=http,https \
 		clone "$HTTPD_URL/smart/http_parent" http_child 2>err &&

  reply	other threads:[~2020-12-01 12:51 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-29 22:30 [PATCH 0/8] CDN offloading update Jonathan Tan
2020-05-29 22:30 ` [PATCH 1/8] http: use --stdin when getting dumb HTTP pack Jonathan Tan
2020-05-29 23:00   ` Junio C Hamano
2020-06-01 20:37     ` Jonathan Tan
2020-05-29 22:30 ` [PATCH 2/8] http: improve documentation of http_pack_request Jonathan Tan
2020-05-29 22:30 ` [PATCH 3/8] http-fetch: support fetching packfiles by URL Jonathan Tan
2020-05-29 23:25   ` Junio C Hamano
2020-06-01 20:54     ` Jonathan Tan
2020-05-29 22:30 ` [PATCH 4/8] Documentation: order protocol v2 sections Jonathan Tan
2020-05-29 23:32   ` Junio C Hamano
2020-06-01 20:57     ` Jonathan Tan
2020-05-29 22:30 ` [PATCH 5/8] Documentation: add Packfile URIs design doc Jonathan Tan
2020-05-30  0:15   ` Junio C Hamano
2020-05-30  0:22     ` Junio C Hamano
2020-06-01 23:10       ` Jonathan Tan
2020-06-01 23:07     ` Jonathan Tan
2020-06-10  1:14     ` Jonathan Tan
2020-06-10 17:16       ` Junio C Hamano
2020-06-10 18:04         ` Jonathan Tan
2020-05-29 22:30 ` [PATCH 6/8] upload-pack: refactor reading of pack-objects out Jonathan Tan
2020-05-29 22:30 ` [PATCH 7/8] fetch-pack: support more than one pack lockfile Jonathan Tan
2020-05-29 22:30 ` [PATCH 8/8] upload-pack: send part of packfile response as uri Jonathan Tan
2020-05-31 16:59   ` Junio C Hamano
2020-05-31 17:35   ` Junio C Hamano
2020-06-01 23:20     ` Jonathan Tan
2020-06-01 20:00   ` Jonathan Nieder
2020-06-10 20:57 ` [PATCH v2 0/9] CDN offloading update Jonathan Tan
2020-06-10 20:57   ` [PATCH v2 1/9] http: use --stdin when indexing dumb HTTP pack Jonathan Tan
2020-06-11  1:10     ` Junio C Hamano
2020-06-10 20:57   ` [PATCH v2 2/9] http: refactor finish_http_pack_request() Jonathan Tan
2020-06-10 20:57   ` [PATCH v2 3/9] http-fetch: refactor into function Jonathan Tan
2020-06-10 20:57   ` [PATCH v2 4/9] http-fetch: support fetching packfiles by URL Jonathan Tan
2020-06-11  1:30     ` Junio C Hamano
2020-06-10 20:57   ` [PATCH v2 5/9] Documentation: order protocol v2 sections Jonathan Tan
2020-06-10 20:57   ` [PATCH v2 6/9] Documentation: add Packfile URIs design doc Jonathan Tan
2020-06-11  1:55     ` Junio C Hamano
2020-11-25  9:15     ` Ævar Arnfjörð Bjarmason
2020-11-25 19:09       ` Jonathan Tan
2020-12-01 12:48         ` Ævar Arnfjörð Bjarmason [this message]
2020-06-10 20:57   ` [PATCH v2 7/9] upload-pack: refactor reading of pack-objects out Jonathan Tan
2020-06-10 20:57   ` [PATCH v2 8/9] fetch-pack: support more than one pack lockfile Jonathan Tan
2020-06-11  1:41     ` Junio C Hamano
2020-06-10 20:57   ` [PATCH v2 9/9] upload-pack: send part of packfile response as uri Jonathan Tan
2020-06-10 23:16   ` [PATCH v2 0/9] CDN offloading update Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87tut5vghw.fsf@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).