[WIP 0/7] CDN offloading of fetch response

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [WIP 0/7] CDN offloading of fetch response
@ 2019-02-23 23:38 Jonathan Tan
  2019-02-23 23:38 ` [WIP 1/7] http: use --stdin and --keep when downloading pack Jonathan Tan
                   ` (8 more replies)
  0 siblings, 9 replies; 45+ messages in thread
From: Jonathan Tan @ 2019-02-23 23:38 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peff, christian.couder, avarab

It's been a while, so here is an updated version of what I previously
sent [1]. The main difference is that fetch-pack now actually downloads
whatever the server tells it to. The second main difference is that
we no longer buffer progress messages and suspend keepalives - we
no longer need to, because the sideband-all patches have been merged.

I think I've address all the technical comments in [1], except one
comment of Junio's [2] that I still don't understand:

> And the "fix" I was alluding to was to update that "else" clause to
> make it a no-op that keeps os->used non-zero, which would not call
> send-client-data.
>
> When that fix happens, the part that early in the function this
> patch added "now we know we will call send-client-data, so let's say
> 'here comes packdata' unless we have already said that" is making
> the decision too early.  Depending on the value of os->used when we
> enter the code and the number of bytes xread() reads from the
> upstream, we might not call send-client-data yet (namely, when we
> have no buffered data and we happened to get only one byte).

With or without this fix, I don't think there is ever a time when we
say "here comes packdata" without calling send-client-data - we say
"here comes packdata" only when we see the string "PACK", which forms
part of the packfile, and thus we should have no problem sending any
client data. Having said that, maybe this is a moot point - Junio says
that this only happens when the fix is implemented, and the fix is not
implemented.

There are probably some more design discussions to be had:

 - Requirement that all pointed-to CDNs support byte ranges for
   resumability, and to guarantee that the packfiles will be there
   permanently. After some thought, it is a good idea for CDNs to
   support that, but I think that we should support CDNs that can only
   give temporal guarantees (e.g. if/when we start implementing
   resumption, we could read the Cache headers). I didn't add any
   mention of this issue in the documentation.

 - Client-side whitelist of protocol and hostnames. I've implemented
   whitelist of protocol, but not hostnames.

 - Any sort of follow-up fetch - for example, if the download from the
   CDN fails or if we allow the server to tell us of best-effort
   packfiles (but the client still must check and formulate the correct
   request to the server to fetch the rest). This protocol seems like a
   prerequisite to all those, and is independently useful, so maybe all
   of those can be future work.

Please take a look. Feel free to comment on anything, but I prefer
comments on the major things first (e.g. my usage of a separate process
(http-fetch) to fetch packfiles, since as far as I know, Git doesn't
link to libcurl; any of the design decisions I described above). I know
that there are some implementation details that could be improved (e.g.
parallelization of the CDN downloads, starting CDN downloads *after*
closing the first HTTP request, holding on to the .keep locks until
after the refs are set), but will work on those once the overall design
is more or less finalized.

Note that the first patch is exactly the same as one I've previously
sent [3].

[1] https://public-inbox.org/git/cover.1543879256.git.jonathantanmy@google.com/
[2] https://public-inbox.org/git/xmqqmupi89ub.fsf@gitster-ct.c.googlers.com/
[3] https://public-inbox.org/git/20190221001447.124088-1-jonathantanmy@google.com/

Jonathan Tan (7):
  http: use --stdin and --keep when downloading pack
  http: improve documentation of http_pack_request
  http-fetch: support fetching packfiles by URL
  Documentation: order protocol v2 sections
  Documentation: add Packfile URIs design doc
  upload-pack: refactor reading of pack-objects out
  upload-pack: send part of packfile response as uri

 Documentation/git-http-fetch.txt         |   7 +-
 Documentation/technical/packfile-uri.txt |  79 ++++++++++++
 Documentation/technical/protocol-v2.txt  |  22 ++--
 builtin/pack-objects.c                   |  63 +++++++++
 fetch-pack.c                             |  58 +++++++++
 http-fetch.c                             |  65 ++++++++--
 http-push.c                              |   7 +-
 http-walker.c                            |   5 +-
 http.c                                   |  83 +++++++-----
 http.h                                   |  26 +++-
 t/t5550-http-fetch-dumb.sh               |  18 +++
 t/t5702-protocol-v2.sh                   |  54 ++++++++
 upload-pack.c                            | 155 +++++++++++++++++------
 13 files changed, 542 insertions(+), 100 deletions(-)
 create mode 100644 Documentation/technical/packfile-uri.txt

-- 
2.19.0.271.gfe8321ec05.dirty

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [WIP 1/7] http: use --stdin and --keep when downloading pack
  2019-02-23 23:38 [WIP 0/7] CDN offloading of fetch response Jonathan Tan
@ 2019-02-23 23:38 ` Jonathan Tan
  2019-02-23 23:38 ` [WIP 2/7] http: improve documentation of http_pack_request Jonathan Tan
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Jonathan Tan @ 2019-02-23 23:38 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peff, christian.couder, avarab

When Git fetches a pack using dumb HTTP, it does at least 2 things
differently from when it fetches using fetch-pack or receive-pack: (1)
it reuses the server's name for the packfile (which incorporates a hash)
for the packfile, and (2) it does not create a .keep file to avoid race
conditions with repack.

A subsequent patch will allow downloading packs over HTTP(S) as part of
a fetch. These downloads will not necessarily be from a Git repository,
and thus may not have a hash as part of its name. Also, generating a
.keep file will be necessary to avoid race conditions with repack (until
the fetch has successfully written the new refs).

Thus, teach http to pass --stdin and --keep to index-pack, the former so
that we have no reliance on the server's name for the packfile, and the
latter so that we have the .keep file.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 http-push.c   |  7 ++++++-
 http-walker.c |  5 ++++-
 http.c        | 42 ++++++++++++++++++++----------------------
 http.h        |  2 +-
 4 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/http-push.c b/http-push.c
index b22c7caea0..409b266b0c 100644
--- a/http-push.c
+++ b/http-push.c
@@ -586,11 +586,16 @@ static void finish_request(struct transfer_request *request)
 			fprintf(stderr, "Unable to get pack file %s\n%s",
 				request->url, curl_errorstr);
 		} else {
+			char *lockfile;
+
 			preq = (struct http_pack_request *)request->userData;
 
 			if (preq) {
-				if (finish_http_pack_request(preq) == 0)
+				if (finish_http_pack_request(preq,
+							     &lockfile) == 0) {
+					unlink(lockfile);
 					fail = 0;
+				}
 				release_http_pack_request(preq);
 			}
 		}
diff --git a/http-walker.c b/http-walker.c
index 8ae5d76c6a..804dc82304 100644
--- a/http-walker.c
+++ b/http-walker.c
@@ -425,6 +425,7 @@ static int http_fetch_pack(struct walker *walker, struct alt_base *repo, unsigne
 	int ret;
 	struct slot_results results;
 	struct http_pack_request *preq;
+	char *lockfile;
 
 	if (fetch_indices(walker, repo))
 		return -1;
@@ -457,7 +458,9 @@ static int http_fetch_pack(struct walker *walker, struct alt_base *repo, unsigne
 		goto abort;
 	}
 
-	ret = finish_http_pack_request(preq);
+	ret = finish_http_pack_request(preq, &lockfile);
+	if (!ret)
+		unlink(lockfile);
 	release_http_pack_request(preq);
 	if (ret)
 		return ret;
diff --git a/http.c b/http.c
index a32ad36ddf..5f8e602cd2 100644
--- a/http.c
+++ b/http.c
@@ -2200,13 +2200,13 @@ void release_http_pack_request(struct http_pack_request *preq)
 	free(preq);
 }
 
-int finish_http_pack_request(struct http_pack_request *preq)
+int finish_http_pack_request(struct http_pack_request *preq, char **lockfile)
 {
 	struct packed_git **lst;
 	struct packed_git *p = preq->target;
-	char *tmp_idx;
-	size_t len;
 	struct child_process ip = CHILD_PROCESS_INIT;
+	int tmpfile_fd;
+	int ret = 0;
 
 	close_pack_index(p);
 
@@ -2218,35 +2218,33 @@ int finish_http_pack_request(struct http_pack_request *preq)
 		lst = &((*lst)->next);
 	*lst = (*lst)->next;
 
-	if (!strip_suffix(preq->tmpfile.buf, ".pack.temp", &len))
-		BUG("pack tmpfile does not end in .pack.temp?");
-	tmp_idx = xstrfmt("%.*s.idx.temp", (int)len, preq->tmpfile.buf);
+	tmpfile_fd = xopen(preq->tmpfile.buf, O_RDONLY);
 
 	argv_array_push(&ip.args, "index-pack");
-	argv_array_pushl(&ip.args, "-o", tmp_idx, NULL);
-	argv_array_push(&ip.args, preq->tmpfile.buf);
+	argv_array_push(&ip.args, "--stdin");
+	argv_array_pushf(&ip.args, "--keep=git %"PRIuMAX, (uintmax_t)getpid());
 	ip.git_cmd = 1;
-	ip.no_stdin = 1;
-	ip.no_stdout = 1;
+	ip.in = tmpfile_fd;
+	ip.out = -1;
 
-	if (run_command(&ip)) {
-		unlink(preq->tmpfile.buf);
-		unlink(tmp_idx);
-		free(tmp_idx);
-		return -1;
+	if (start_command(&ip)) {
+		ret = -1;
+		goto cleanup;
 	}
 
-	unlink(sha1_pack_index_name(p->sha1));
+	*lockfile = index_pack_lockfile(ip.out);
+	close(ip.out);
 
-	if (finalize_object_file(preq->tmpfile.buf, sha1_pack_name(p->sha1))
-	 || finalize_object_file(tmp_idx, sha1_pack_index_name(p->sha1))) {
-		free(tmp_idx);
-		return -1;
+	if (finish_command(&ip)) {
+		ret = -1;
+		goto cleanup;
 	}
 
 	install_packed_git(the_repository, p);
-	free(tmp_idx);
-	return 0;
+cleanup:
+	close(tmpfile_fd);
+	unlink(preq->tmpfile.buf);
+	return ret;
 }
 
 struct http_pack_request *new_http_pack_request(
diff --git a/http.h b/http.h
index 4eb4e808e5..20d1c85d0b 100644
--- a/http.h
+++ b/http.h
@@ -212,7 +212,7 @@ struct http_pack_request {
 
 extern struct http_pack_request *new_http_pack_request(
 	struct packed_git *target, const char *base_url);
-extern int finish_http_pack_request(struct http_pack_request *preq);
+int finish_http_pack_request(struct http_pack_request *preq, char **lockfile);
 extern void release_http_pack_request(struct http_pack_request *preq);
 
 /* Helpers for fetching object */
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [WIP 2/7] http: improve documentation of http_pack_request
  2019-02-23 23:38 [WIP 0/7] CDN offloading of fetch response Jonathan Tan
  2019-02-23 23:38 ` [WIP 1/7] http: use --stdin and --keep when downloading pack Jonathan Tan
@ 2019-02-23 23:38 ` Jonathan Tan
  2019-02-23 23:38 ` [WIP 3/7] http-fetch: support fetching packfiles by URL Jonathan Tan
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Jonathan Tan @ 2019-02-23 23:38 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peff, christian.couder, avarab

struct http_pack_request and the functions that use it will be modified
in a subsequent patch. Using it is complicated (to use, call the
initialization function, then set some but not all fields in the
returned struct), so add some documentation to help future users.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 http.h | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/http.h b/http.h
index 20d1c85d0b..1aa556257e 100644
--- a/http.h
+++ b/http.h
@@ -202,14 +202,31 @@ extern int http_get_info_packs(const char *base_url,
 	struct packed_git **packs_head);
 
 struct http_pack_request {
+	/*
+	 * Initialized by new_http_pack_request().
+	 */
 	char *url;
 	struct packed_git *target;
+	struct active_request_slot *slot;
+
+	/*
+	 * After calling new_http_pack_request(), point lst to the head of the
+	 * pack list that target is in. finish_http_pack_request() will remove
+	 * target from lst and call install_packed_git() on target.
+	 */
 	struct packed_git **lst;
+
+	/*
+	 * State managed by functions in http.c.
+	 */
 	FILE *packfile;
 	struct strbuf tmpfile;
-	struct active_request_slot *slot;
 };
 
+/*
+ * target must be an element in a pack list obtained from
+ * http_get_info_packs().
+ */
 extern struct http_pack_request *new_http_pack_request(
 	struct packed_git *target, const char *base_url);
 int finish_http_pack_request(struct http_pack_request *preq, char **lockfile);
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [WIP 3/7] http-fetch: support fetching packfiles by URL
  2019-02-23 23:38 [WIP 0/7] CDN offloading of fetch response Jonathan Tan
  2019-02-23 23:38 ` [WIP 1/7] http: use --stdin and --keep when downloading pack Jonathan Tan
  2019-02-23 23:38 ` [WIP 2/7] http: improve documentation of http_pack_request Jonathan Tan
@ 2019-02-23 23:38 ` Jonathan Tan
  2019-02-23 23:38 ` [WIP 4/7] Documentation: order protocol v2 sections Jonathan Tan
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Jonathan Tan @ 2019-02-23 23:38 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peff, christian.couder, avarab

Teach http-fetch the ability to download packfiles directly, given a
URL, and to verify them.

The http_pack_request suite of functions have been modified to support a
NULL target. When target is NULL, the given URL is downloaded directly
instead of being treated as the root of a repository.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/git-http-fetch.txt |  7 +++-
 http-fetch.c                     | 65 +++++++++++++++++++++++++-------
 http.c                           | 41 ++++++++++++++------
 http.h                           | 11 ++++--
 t/t5550-http-fetch-dumb.sh       | 18 +++++++++
 5 files changed, 113 insertions(+), 29 deletions(-)

diff --git a/Documentation/git-http-fetch.txt b/Documentation/git-http-fetch.txt
index 666b042679..e667544bb1 100644
--- a/Documentation/git-http-fetch.txt
+++ b/Documentation/git-http-fetch.txt
@@ -9,7 +9,7 @@ git-http-fetch - Download from a remote Git repository via HTTP
 SYNOPSIS
 --------
 [verse]
-'git http-fetch' [-c] [-t] [-a] [-d] [-v] [-w filename] [--recover] [--stdin] <commit> <url>
+'git http-fetch' [-c] [-t] [-a] [-d] [-v] [-w filename] [--recover] [--stdin | --packfile | <commit>] <url>
 
 DESCRIPTION
 -----------
@@ -40,6 +40,11 @@ commit-id::
 
 		<commit-id>['\t'<filename-as-in--w>]
 
+--packfile::
+	Instead of a commit id on the command line (which is not expected in
+	this case), 'git http-fetch' fetches the packfile directly at the given
+	URL and generates the corresponding .idx file.
+
 --recover::
 	Verify that everything reachable from target is fetched.  Used after
 	an earlier fetch is interrupted.
diff --git a/http-fetch.c b/http-fetch.c
index a32ac118d9..d283ce83a5 100644
--- a/http-fetch.c
+++ b/http-fetch.c
@@ -5,7 +5,7 @@
 #include "walker.h"
 
 static const char http_fetch_usage[] = "git http-fetch "
-"[-c] [-t] [-a] [-v] [--recover] [-w ref] [--stdin] commit-id url";
+"[-c] [-t] [-a] [-v] [--recover] [-w ref] [--stdin | --packfile | commit-id] url";
 
 int cmd_main(int argc, const char **argv)
 {
@@ -19,6 +19,7 @@ int cmd_main(int argc, const char **argv)
 	int rc = 0;
 	int get_verbosely = 0;
 	int get_recover = 0;
+	int packfile = 0;
 
 	while (arg < argc && argv[arg][0] == '-') {
 		if (argv[arg][1] == 't') {
@@ -35,43 +36,81 @@ int cmd_main(int argc, const char **argv)
 			get_recover = 1;
 		} else if (!strcmp(argv[arg], "--stdin")) {
 			commits_on_stdin = 1;
+		} else if (!strcmp(argv[arg], "--packfile")) {
+			packfile = 1;
 		}
 		arg++;
 	}
-	if (argc != arg + 2 - commits_on_stdin)
+	if (argc != arg + 2 - (commits_on_stdin || packfile))
 		usage(http_fetch_usage);
 	if (commits_on_stdin) {
 		commits = walker_targets_stdin(&commit_id, &write_ref);
+	} else if (packfile) {
+		/* URL will be set later */
 	} else {
 		commit_id = (char **) &argv[arg++];
 		commits = 1;
 	}
 
-	if (argv[arg])
-		str_end_url_with_slash(argv[arg], &url);
+	if (packfile) {
+		url = xstrdup(argv[arg]);
+	} else {
+		if (argv[arg])
+			str_end_url_with_slash(argv[arg], &url);
+	}
 
 	setup_git_directory();
 
 	git_config(git_default_config, NULL);
 
 	http_init(NULL, url, 0);
-	walker = get_http_walker(url);
-	walker->get_verbosely = get_verbosely;
-	walker->get_recover = get_recover;
 
-	rc = walker_fetch(walker, commits, commit_id, write_ref, url);
+	if (packfile) {
+		struct http_pack_request *preq;
+		struct slot_results results;
+		int ret;
+		char *lockfile;
+
+		preq = new_http_pack_request(NULL, url);
+		if (preq == NULL)
+			die("couldn't create http pack request");
+		preq->slot->results = &results;
+
+		if (start_active_slot(preq->slot)) {
+			run_active_slot(preq->slot);
+			if (results.curl_result != CURLE_OK) {
+				die("Unable to get pack file %s\n%s", preq->url,
+				    curl_errorstr);
+			}
+		} else {
+			die("Unable to start request");
+		}
+
+		if ((ret = finish_http_pack_request(preq, &lockfile)))
+			die("finish_http_pack_request gave result %d", ret);
+		unlink(lockfile);
+		release_http_pack_request(preq);
+		rc = 0;
+	} else {
+		walker = get_http_walker(url);
+		walker->get_verbosely = get_verbosely;
+		walker->get_recover = get_recover;
+
+		rc = walker_fetch(walker, commits, commit_id, write_ref, url);
 
-	if (commits_on_stdin)
-		walker_targets_free(commits, commit_id, write_ref);
+		if (commits_on_stdin)
+			walker_targets_free(commits, commit_id, write_ref);
 
-	if (walker->corrupt_object_found) {
-		fprintf(stderr,
+		if (walker->corrupt_object_found) {
+			fprintf(stderr,
 "Some loose object were found to be corrupt, but they might be just\n"
 "a false '404 Not Found' error message sent with incorrect HTTP\n"
 "status code.  Suggest running 'git fsck'.\n");
+		}
+
+		walker_free(walker);
 	}
 
-	walker_free(walker);
 	http_cleanup();
 
 	free(url);
diff --git a/http.c b/http.c
index 5f8e602cd2..73c3e6295b 100644
--- a/http.c
+++ b/http.c
@@ -2208,15 +2208,18 @@ int finish_http_pack_request(struct http_pack_request *preq, char **lockfile)
 	int tmpfile_fd;
 	int ret = 0;
 
-	close_pack_index(p);
+	if (p)
+		close_pack_index(p);
 
 	fclose(preq->packfile);
 	preq->packfile = NULL;
 
-	lst = preq->lst;
-	while (*lst != p)
-		lst = &((*lst)->next);
-	*lst = (*lst)->next;
+	if (p) {
+		lst = preq->lst;
+		while (*lst != p)
+			lst = &((*lst)->next);
+		*lst = (*lst)->next;
+	}
 
 	tmpfile_fd = xopen(preq->tmpfile.buf, O_RDONLY);
 
@@ -2240,7 +2243,8 @@ int finish_http_pack_request(struct http_pack_request *preq, char **lockfile)
 		goto cleanup;
 	}
 
-	install_packed_git(the_repository, p);
+	if (p)
+		install_packed_git(the_repository, p);
 cleanup:
 	close(tmpfile_fd);
 	unlink(preq->tmpfile.buf);
@@ -2258,12 +2262,24 @@ struct http_pack_request *new_http_pack_request(
 	strbuf_init(&preq->tmpfile, 0);
 	preq->target = target;
 
-	end_url_with_slash(&buf, base_url);
-	strbuf_addf(&buf, "objects/pack/pack-%s.pack",
-		sha1_to_hex(target->sha1));
-	preq->url = strbuf_detach(&buf, NULL);
+	if (target) {
+		end_url_with_slash(&buf, base_url);
+		strbuf_addf(&buf, "objects/pack/pack-%s.pack",
+			sha1_to_hex(target->sha1));
+		preq->url = strbuf_detach(&buf, NULL);
+	} else {
+		preq->url = xstrdup(base_url);
+	}
+
+	if (target) {
+		strbuf_addf(&preq->tmpfile, "%s.temp",
+			    sha1_pack_name(target->sha1));
+	} else {
+		strbuf_addf(&preq->tmpfile, "%s/pack/pack-", get_object_directory());
+		strbuf_addstr_urlencode(&preq->tmpfile, base_url, 1);
+		strbuf_addstr(&preq->tmpfile, ".temp");
+	}
 
-	strbuf_addf(&preq->tmpfile, "%s.temp", sha1_pack_name(target->sha1));
 	preq->packfile = fopen(preq->tmpfile.buf, "a");
 	if (!preq->packfile) {
 		error("Unable to open local file %s for pack",
@@ -2287,7 +2303,8 @@ struct http_pack_request *new_http_pack_request(
 		if (http_is_verbose)
 			fprintf(stderr,
 				"Resuming fetch of pack %s at byte %"PRIuMAX"\n",
-				sha1_to_hex(target->sha1), (uintmax_t)prev_posn);
+				target ? sha1_to_hex(target->sha1) : base_url,
+				(uintmax_t)prev_posn);
 		http_opt_request_remainder(preq->slot->curl, prev_posn);
 	}
 
diff --git a/http.h b/http.h
index 1aa556257e..a33f2aa4b9 100644
--- a/http.h
+++ b/http.h
@@ -210,7 +210,8 @@ struct http_pack_request {
 	struct active_request_slot *slot;
 
 	/*
-	 * After calling new_http_pack_request(), point lst to the head of the
+	 * After calling new_http_pack_request(), if fetching a pack that
+	 * http_get_info_packs() told us about, point lst to the head of the
 	 * pack list that target is in. finish_http_pack_request() will remove
 	 * target from lst and call install_packed_git() on target.
 	 */
@@ -224,8 +225,12 @@ struct http_pack_request {
 };
 
 /*
- * target must be an element in a pack list obtained from
- * http_get_info_packs().
+ * If fetching a pack that http_get_info_packs() told us about, set target to
+ * an element in a pack list obtained from http_get_info_packs(). The actual
+ * URL fetched will be base_url followed by a suffix with the hash of the pack.
+ *
+ * Otherwise, set target to NULL. The actual URL fetched will be base_url
+ * itself.
  */
 extern struct http_pack_request *new_http_pack_request(
 	struct packed_git *target, const char *base_url);
diff --git a/t/t5550-http-fetch-dumb.sh b/t/t5550-http-fetch-dumb.sh
index 6d7d88ccc9..be81d33d68 100755
--- a/t/t5550-http-fetch-dumb.sh
+++ b/t/t5550-http-fetch-dumb.sh
@@ -199,6 +199,16 @@ test_expect_success 'fetch packed objects' '
 	git clone $HTTPD_URL/dumb/repo_pack.git
 '
 
+test_expect_success 'http-fetch --packfile' '
+	git init packfileclient &&
+	p=$(cd "$HTTPD_DOCUMENT_ROOT_PATH"/repo_pack.git && ls objects/pack/pack-*.pack) &&
+	git -C packfileclient http-fetch --packfile "$HTTPD_URL"/dumb/repo_pack.git/$p &&
+
+	# Ensure that it has the HEAD of repo_pack, at least
+	HASH=$(git -C "$HTTPD_DOCUMENT_ROOT_PATH"/repo_pack.git rev-parse HEAD) &&
+	git -C packfileclient cat-file -e "$HASH"
+'
+
 test_expect_success 'fetch notices corrupt pack' '
 	cp -R "$HTTPD_DOCUMENT_ROOT_PATH"/repo_pack.git "$HTTPD_DOCUMENT_ROOT_PATH"/repo_bad1.git &&
 	(cd "$HTTPD_DOCUMENT_ROOT_PATH"/repo_bad1.git &&
@@ -214,6 +224,14 @@ test_expect_success 'fetch notices corrupt pack' '
 	)
 '
 
+test_expect_success 'http-fetch --packfile with corrupt pack' '
+	rm -rf packfileclient &&
+	git init packfileclient &&
+	p=$(cd "$HTTPD_DOCUMENT_ROOT_PATH"/repo_bad1.git && ls objects/pack/pack-*.pack) &&
+	test_must_fail git -C packfileclient http-fetch --packfile \
+		"$HTTPD_URL"/dumb/repo_bad1.git/$p
+'
+
 test_expect_success 'fetch notices corrupt idx' '
 	cp -R "$HTTPD_DOCUMENT_ROOT_PATH"/repo_pack.git "$HTTPD_DOCUMENT_ROOT_PATH"/repo_bad2.git &&
 	(cd "$HTTPD_DOCUMENT_ROOT_PATH"/repo_bad2.git &&
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [WIP 4/7] Documentation: order protocol v2 sections
  2019-02-23 23:38 [WIP 0/7] CDN offloading of fetch response Jonathan Tan
                   ` (2 preceding siblings ...)
  2019-02-23 23:38 ` [WIP 3/7] http-fetch: support fetching packfiles by URL Jonathan Tan
@ 2019-02-23 23:38 ` Jonathan Tan
  2019-02-23 23:38 ` [WIP 5/7] Documentation: add Packfile URIs design doc Jonathan Tan
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Jonathan Tan @ 2019-02-23 23:38 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peff, christian.couder, avarab

The current C Git implementation expects Git servers to follow a
specific order of sections when transmitting protocol v2 responses, but
this is not explicit in the documentation. Make the order explicit.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/technical/protocol-v2.txt | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
index ead85ce35c..36239ec7e9 100644
--- a/Documentation/technical/protocol-v2.txt
+++ b/Documentation/technical/protocol-v2.txt
@@ -325,11 +325,11 @@ included in the client's request:
 
 The response of `fetch` is broken into a number of sections separated by
 delimiter packets (0001), with each section beginning with its section
-header.
+header. Most sections are sent only when the packfile is sent.
 
-    output = *section
-    section = (acknowledgments | shallow-info | wanted-refs | packfile)
-	      (flush-pkt | delim-pkt)
+    output = acknowledgements flush-pkt |
+	     [acknowledgments delim-pkt] [shallow-info delim-pkt]
+	     [wanted-refs delim-pkt] packfile flush-pkt
 
     acknowledgments = PKT-LINE("acknowledgments" LF)
 		      (nak | *ack)
@@ -351,9 +351,10 @@ header.
 	       *PKT-LINE(%x01-03 *%x00-ff)
 
     acknowledgments section
-	* If the client determines that it is finished with negotiations
-	  by sending a "done" line, the acknowledgments sections MUST be
-	  omitted from the server's response.
+	* If the client determines that it is finished with negotiations by
+	  sending a "done" line (thus requiring the server to send a packfile),
+	  the acknowledgments sections MUST be omitted from the server's
+	  response.
 
 	* Always begins with the section header "acknowledgments"
 
@@ -404,9 +405,6 @@ header.
 	  which the client has not indicated was shallow as a part of
 	  its request.
 
-	* This section is only included if a packfile section is also
-	  included in the response.
-
     wanted-refs section
 	* This section is only included if the client has requested a
 	  ref using a 'want-ref' line and if a packfile section is also
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [WIP 5/7] Documentation: add Packfile URIs design doc
  2019-02-23 23:38 [WIP 0/7] CDN offloading of fetch response Jonathan Tan
                   ` (3 preceding siblings ...)
  2019-02-23 23:38 ` [WIP 4/7] Documentation: order protocol v2 sections Jonathan Tan
@ 2019-02-23 23:38 ` Jonathan Tan
  2019-02-23 23:39 ` [WIP 6/7] upload-pack: refactor reading of pack-objects out Jonathan Tan
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Jonathan Tan @ 2019-02-23 23:38 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peff, christian.couder, avarab

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/technical/packfile-uri.txt | 79 ++++++++++++++++++++++++
 Documentation/technical/protocol-v2.txt  |  6 +-
 2 files changed, 84 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/technical/packfile-uri.txt

diff --git a/Documentation/technical/packfile-uri.txt b/Documentation/technical/packfile-uri.txt
new file mode 100644
index 0000000000..97047dd1d2
--- /dev/null
+++ b/Documentation/technical/packfile-uri.txt
@@ -0,0 +1,79 @@
+Packfile URIs
+=============
+
+This feature allows servers to serve part of their packfile response as URIs.
+This allows server designs that improve scalability in bandwidth and CPU usage
+(for example, by serving some data through a CDN), and (in the future) provides
+some measure of resumability to clients.
+
+This feature is available only in protocol version 2.
+
+Protocol
+--------
+
+The server advertises `packfile-uris`.
+
+If the client then communicates which protocols (HTTPS, etc.) it supports with
+a `packfile-uris` argument, the server MAY send a `packfile-uris` section
+directly before the `packfile` section (right after `wanted-refs` if it is
+sent) containing URIs of any of the given protocols. The URIs point to
+packfiles that use only features that the client has declared that it supports
+(e.g. ofs-delta and thin-pack). See protocol-v2.txt for the documentation of
+this section.
+
+Clients then should understand that the returned packfile could be incomplete,
+and that it needs to download all the given URIs before the fetch or clone is
+complete.
+
+Server design
+-------------
+
+The server can be trivially made compatible with the proposed protocol by
+having it advertise `packfile-uris`, tolerating the client sending
+`packfile-uris`, and never sending any `packfile-uris` section. But we should
+include some sort of non-trivial implementation in the Minimum Viable Product,
+at least so that we can test the client.
+
+This is the implementation: a feature, marked experimental, that allows the
+server to be configured by one or more `uploadpack.blobPackfileUri=<sha1>
+<uri>` entries. Whenever the list of objects to be sent is assembled, a blob
+with the given sha1 can be replaced by the given URI. This allows, for example,
+servers to delegate serving of large blobs to CDNs.
+
+Client design
+-------------
+
+While fetching, the client needs to remember the list of URIs and cannot
+declare that the fetch is complete until all URIs have been downloaded as
+packfiles.
+
+The division of work (initial fetch + additional URIs) introduces convenient
+points for resumption of an interrupted clone - such resumption can be done
+after the Minimum Viable Product (see "Future work").
+
+The client can inhibit this feature (i.e. refrain from sending the
+`packfile-urls` parameter) by passing --no-packfile-urls to `git fetch`.
+
+Future work
+-----------
+
+The protocol design allows some evolution of the server and client without any
+need for protocol changes, so only a small-scoped design is included here to
+form the MVP. For example, the following can be done:
+
+ * On the server, a long-running process that takes in entire requests and
+   outputs a list of URIs and the corresponding inclusion and exclusion sets of
+   objects. This allows, e.g., signed URIs to be used and packfiles for common
+   requests to be cached.
+ * On the client, resumption of clone. If a clone is interrupted, information
+   could be recorded in the repository's config and a "clone-resume" command
+   can resume the clone in progress. (Resumption of subsequent fetches is more
+   difficult because that must deal with the user wanting to use the repository
+   even after the fetch was interrupted.)
+
+There are some possible features that will require a change in protocol:
+
+ * Additional HTTP headers (e.g. authentication)
+ * Byte range support
+ * Different file formats referenced by URIs (e.g. raw object)
+
diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
index 36239ec7e9..edb85c059b 100644
--- a/Documentation/technical/protocol-v2.txt
+++ b/Documentation/technical/protocol-v2.txt
@@ -329,7 +329,8 @@ header. Most sections are sent only when the packfile is sent.
 
     output = acknowledgements flush-pkt |
 	     [acknowledgments delim-pkt] [shallow-info delim-pkt]
-	     [wanted-refs delim-pkt] packfile flush-pkt
+	     [wanted-refs delim-pkt] [packfile-uris delim-pkt]
+	     packfile flush-pkt
 
     acknowledgments = PKT-LINE("acknowledgments" LF)
 		      (nak | *ack)
@@ -347,6 +348,9 @@ header. Most sections are sent only when the packfile is sent.
 		  *PKT-LINE(wanted-ref LF)
     wanted-ref = obj-id SP refname
 
+    packfile-uris = PKT-LINE("packfile-uris" LF) *packfile-uri
+    packfile-uri = PKT-LINE("uri" SP *%x20-ff LF)
+
     packfile = PKT-LINE("packfile" LF)
 	       *PKT-LINE(%x01-03 *%x00-ff)
 
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [WIP 6/7] upload-pack: refactor reading of pack-objects out
  2019-02-23 23:38 [WIP 0/7] CDN offloading of fetch response Jonathan Tan
                   ` (4 preceding siblings ...)
  2019-02-23 23:38 ` [WIP 5/7] Documentation: add Packfile URIs design doc Jonathan Tan
@ 2019-02-23 23:39 ` Jonathan Tan
  2019-02-23 23:39 ` [WIP 7/7] upload-pack: send part of packfile response as uri Jonathan Tan
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Jonathan Tan @ 2019-02-23 23:39 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peff, christian.couder, avarab

Subsequent patches will change how the output of pack-objects is
processed, so extract that processing into its own function.

Currently, at most 1 character can be buffered (in the "buffered" local
variable). One of those patches will require a larger buffer, so replace
that "buffered" local variable with a buffer array.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 upload-pack.c | 81 ++++++++++++++++++++++++++++++---------------------
 1 file changed, 47 insertions(+), 34 deletions(-)

diff --git a/upload-pack.c b/upload-pack.c
index d098ef5982..987d2e139b 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -102,14 +102,52 @@ static int write_one_shallow(const struct commit_graft *graft, void *cb_data)
 	return 0;
 }
 
+struct output_state {
+	char buffer[8193];
+	int used;
+};
+
+static int relay_pack_data(int pack_objects_out, struct output_state *os)
+{
+	/*
+	 * We keep the last byte to ourselves
+	 * in case we detect broken rev-list, so that we
+	 * can leave the stream corrupted.  This is
+	 * unfortunate -- unpack-objects would happily
+	 * accept a valid packdata with trailing garbage,
+	 * so appending garbage after we pass all the
+	 * pack data is not good enough to signal
+	 * breakage to downstream.
+	 */
+	ssize_t readsz;
+
+	readsz = xread(pack_objects_out, os->buffer + os->used,
+		       sizeof(os->buffer) - os->used);
+	if (readsz < 0) {
+		return readsz;
+	}
+	os->used += readsz;
+
+	if (os->used > 1) {
+		send_client_data(1, os->buffer, os->used - 1);
+		os->buffer[0] = os->buffer[os->used - 1];
+		os->used = 1;
+	} else {
+		send_client_data(1, os->buffer, os->used);
+		os->used = 0;
+	}
+
+	return readsz;
+}
+
 static void create_pack_file(const struct object_array *have_obj,
 			     const struct object_array *want_obj)
 {
 	struct child_process pack_objects = CHILD_PROCESS_INIT;
-	char data[8193], progress[128];
+	struct output_state output_state = {0};
+	char progress[128];
 	char abort_msg[] = "aborting due to possible repository "
 		"corruption on the remote side.";
-	int buffered = -1;
 	ssize_t sz;
 	int i;
 	FILE *pipe_fd;
@@ -239,39 +277,15 @@ static void create_pack_file(const struct object_array *have_obj,
 			continue;
 		}
 		if (0 <= pu && (pfd[pu].revents & (POLLIN|POLLHUP))) {
-			/* Data ready; we keep the last byte to ourselves
-			 * in case we detect broken rev-list, so that we
-			 * can leave the stream corrupted.  This is
-			 * unfortunate -- unpack-objects would happily
-			 * accept a valid packdata with trailing garbage,
-			 * so appending garbage after we pass all the
-			 * pack data is not good enough to signal
-			 * breakage to downstream.
-			 */
-			char *cp = data;
-			ssize_t outsz = 0;
-			if (0 <= buffered) {
-				*cp++ = buffered;
-				outsz++;
-			}
-			sz = xread(pack_objects.out, cp,
-				  sizeof(data) - outsz);
-			if (0 < sz)
-				;
-			else if (sz == 0) {
+			int result = relay_pack_data(pack_objects.out,
+						     &output_state);
+
+			if (result == 0) {
 				close(pack_objects.out);
 				pack_objects.out = -1;
-			}
-			else
+			} else if (result < 0) {
 				goto fail;
-			sz += outsz;
-			if (1 < sz) {
-				buffered = data[sz-1] & 0xFF;
-				sz--;
 			}
-			else
-				buffered = -1;
-			send_client_data(1, data, sz);
 		}
 
 		/*
@@ -296,9 +310,8 @@ static void create_pack_file(const struct object_array *have_obj,
 	}
 
 	/* flush the data */
-	if (0 <= buffered) {
-		data[0] = buffered;
-		send_client_data(1, data, 1);
+	if (output_state.used > 0) {
+		send_client_data(1, output_state.buffer, output_state.used);
 		fprintf(stderr, "flushed.\n");
 	}
 	if (use_sideband)
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [WIP 7/7] upload-pack: send part of packfile response as uri
  2019-02-23 23:38 [WIP 0/7] CDN offloading of fetch response Jonathan Tan
                   ` (5 preceding siblings ...)
  2019-02-23 23:39 ` [WIP 6/7] upload-pack: refactor reading of pack-objects out Jonathan Tan
@ 2019-02-23 23:39 ` Jonathan Tan
  2019-02-24 15:54   ` Junio C Hamano
                     ` (2 more replies)
  2019-02-25 21:30 ` [WIP 0/7] CDN offloading of fetch response Christian Couder
  2019-03-08 21:55 ` [PATCH v2 0/8] " Jonathan Tan
  8 siblings, 3 replies; 45+ messages in thread
From: Jonathan Tan @ 2019-02-23 23:39 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, peff, christian.couder, avarab

Teach upload-pack to send part of its packfile response as URIs.

An administrator may configure a repository with one or more
"uploadpack.blobpackfileuri" lines, each line containing an OID and a
URI. A client may configure fetch.uriprotocols to be a comma-separated
list of protocols that it is willing to use to fetch additional
packfiles - this list will be sent to the server. Whenever an object
with one of those OIDs would appear in the packfile transmitted by
upload-pack, the server may exclude that object, and instead send the
URI. The client will then download the packs referred to by those URIs
before performing the connectivity check.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/pack-objects.c | 63 ++++++++++++++++++++++++++++++++++
 fetch-pack.c           | 58 +++++++++++++++++++++++++++++++
 t/t5702-protocol-v2.sh | 54 +++++++++++++++++++++++++++++
 upload-pack.c          | 78 ++++++++++++++++++++++++++++++++++++++----
 4 files changed, 247 insertions(+), 6 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a9fac7c128..73309821e2 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -110,6 +110,8 @@ static unsigned long window_memory_limit = 0;
 
 static struct list_objects_filter_options filter_options;
 
+static struct string_list uri_protocols = STRING_LIST_INIT_NODUP;
+
 enum missing_action {
 	MA_ERROR = 0,      /* fail if any missing objects are encountered */
 	MA_ALLOW_ANY,      /* silently allow ALL missing objects */
@@ -118,6 +120,14 @@ enum missing_action {
 static enum missing_action arg_missing_action;
 static show_object_fn fn_show_object;
 
+struct configured_exclusion {
+	struct oidmap_entry e;
+	char *uri;
+};
+static struct oidmap configured_exclusions;
+
+static struct oidset excluded_by_config;
+
 /*
  * stats
  */
@@ -832,6 +842,23 @@ static off_t write_reused_pack(struct hashfile *f)
 	return reuse_packfile_offset - sizeof(struct pack_header);
 }
 
+static void write_excluded_by_configs(void)
+{
+	struct oidset_iter iter;
+	const struct object_id *oid;
+
+	oidset_iter_init(&excluded_by_config, &iter);
+	while ((oid = oidset_iter_next(&iter))) {
+		struct configured_exclusion *ex =
+			oidmap_get(&configured_exclusions, oid);
+
+		if (!ex)
+			BUG("configured exclusion wasn't configured");
+		write_in_full(1, ex->uri, strlen(ex->uri));
+		write_in_full(1, "\n", 1);
+	}
+}
+
 static const char no_split_warning[] = N_(
 "disabling bitmap writing, packs are split due to pack.packSizeLimit"
 );
@@ -1125,6 +1152,25 @@ static int want_object_in_pack(const struct object_id *oid,
 		}
 	}
 
+	if (uri_protocols.nr) {
+		struct configured_exclusion *ex =
+			oidmap_get(&configured_exclusions, oid);
+		int i;
+		const char *p;
+
+		if (ex) {
+			for (i = 0; i < uri_protocols.nr; i++) {
+				if (skip_prefix(ex->uri,
+						uri_protocols.items[i].string,
+						&p) &&
+				    *p == ':') {
+					oidset_insert(&excluded_by_config, oid);
+					return 0;
+				}
+			}
+		}
+	}
+
 	return 1;
 }
 
@@ -2726,6 +2772,19 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 			    pack_idx_opts.version);
 		return 0;
 	}
+	if (!strcmp(k, "uploadpack.blobpackfileuri")) {
+		struct configured_exclusion *ex = xmalloc(sizeof(*ex));
+		const char *end;
+
+		if (parse_oid_hex(v, &ex->e.oid, &end) || *end != ' ')
+			die(_("value of uploadpack.blobpackfileuri must be "
+			      "of the form '<sha-1> <uri>' (got '%s')"), v);
+		if (oidmap_get(&configured_exclusions, &ex->e.oid))
+			die(_("object already configured in another "
+			      "uploadpack.blobpackfileuri (got '%s')"), v);
+		ex->uri = xstrdup(end + 1);
+		oidmap_put(&configured_exclusions, ex);
+	}
 	return git_default_config(k, v, cb);
 }
 
@@ -3318,6 +3377,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			 N_("do not pack objects in promisor packfiles")),
 		OPT_BOOL(0, "delta-islands", &use_delta_islands,
 			 N_("respect islands during delta compression")),
+		OPT_STRING_LIST(0, "uri-protocol", &uri_protocols,
+				N_("protocol"),
+				N_("exclude any configured uploadpack.blobpackfileuri with this protocol")),
 		OPT_END(),
 	};
 
@@ -3492,6 +3554,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		return 0;
 	if (nr_result)
 		prepare_pack(window, depth);
+	write_excluded_by_configs();
 	write_pack_file();
 	if (progress)
 		fprintf_ln(stderr,
diff --git a/fetch-pack.c b/fetch-pack.c
index 812be15d7e..edbe4b3ec3 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -38,6 +38,7 @@ static struct lock_file shallow_lock;
 static const char *alternate_shallow_file;
 static char *negotiation_algorithm;
 static struct strbuf fsck_msg_types = STRBUF_INIT;
+static struct string_list uri_protocols = STRING_LIST_INIT_DUP;
 
 /* Remember to update object flag allocation in object.h */
 #define COMPLETE	(1U << 0)
@@ -1147,6 +1148,26 @@ static int send_fetch_request(struct fetch_negotiator *negotiator, int fd_out,
 		warning("filtering not recognized by server, ignoring");
 	}
 
+	if (server_supports_feature("fetch", "packfile-uris", 0)) {
+		int i;
+		struct strbuf to_send = STRBUF_INIT;
+
+		for (i = 0; i < uri_protocols.nr; i++) {
+			const char *s = uri_protocols.items[i].string;
+
+			if (!strcmp(s, "https") || !strcmp(s, "http")) {
+				if (to_send.len)
+					strbuf_addch(&to_send, ',');
+				strbuf_addstr(&to_send, s);
+			}
+		}
+		if (to_send.len) {
+			packet_buf_write(&req_buf, "packfile-uris %s",
+					 to_send.buf);
+			strbuf_release(&to_send);
+		}
+	}
+
 	/* add wants */
 	add_wants(args->no_dependents, wants, &req_buf);
 
@@ -1322,6 +1343,32 @@ static void receive_wanted_refs(struct packet_reader *reader,
 		die(_("error processing wanted refs: %d"), reader->status);
 }
 
+static void receive_packfile_uris(struct packet_reader *reader)
+{
+	process_section_header(reader, "packfile-uris", 0);
+	while (packet_reader_read(reader) == PACKET_READ_NORMAL) {
+		const char *p;
+		struct child_process cmd = CHILD_PROCESS_INIT;
+
+
+		if (!skip_prefix(reader->line, "uri ", &p))
+			die("expected 'uri <uri>', got: %s\n", reader->line);
+
+		argv_array_push(&cmd.args, "http-fetch");
+		argv_array_push(&cmd.args, "--packfile");
+		argv_array_push(&cmd.args, p);
+		cmd.git_cmd = 1;
+		cmd.no_stdin = 1;
+		cmd.no_stdout = 1;
+		if (start_command(&cmd))
+			die("fetch-pack: unable to spawn");
+		if (finish_command(&cmd))
+			die("fetch-pack: unable to finish");
+	}
+	if (reader->status != PACKET_READ_DELIM)
+		die("expected DELIM");
+}
+
 enum fetch_state {
 	FETCH_CHECK_LOCAL = 0,
 	FETCH_SEND_REQUEST,
@@ -1414,6 +1461,9 @@ static struct ref *do_fetch_pack_v2(struct fetch_pack_args *args,
 				receive_wanted_refs(&reader, sought, nr_sought);
 
 			/* get the pack */
+			if (process_section_header(&reader, "packfile-uris", 1)) {
+				receive_packfile_uris(&reader);
+			}
 			process_section_header(&reader, "packfile", 0);
 			if (get_pack(args, fd, pack_lockfile))
 				die(_("git fetch-pack: fetch failed."));
@@ -1464,6 +1514,14 @@ static void fetch_pack_config(void)
 	git_config_get_bool("transfer.fsckobjects", &transfer_fsck_objects);
 	git_config_get_string("fetch.negotiationalgorithm",
 			      &negotiation_algorithm);
+	if (!uri_protocols.nr) {
+		char *str;
+
+		if (!git_config_get_string("fetch.uriprotocols", &str) && str) {
+			string_list_split(&uri_protocols, str, ',', -1);
+			free(str);
+		}
+	}
 
 	git_config(fetch_pack_config_cb, NULL);
 }
diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh
index db4ae09f2f..6dbe2e9584 100755
--- a/t/t5702-protocol-v2.sh
+++ b/t/t5702-protocol-v2.sh
@@ -656,6 +656,60 @@ test_expect_success 'when server does not send "ready", expect FLUSH' '
 	test_i18ngrep "expected no other sections to be sent after no .ready." err
 '
 
+test_expect_success 'part of packfile response provided as URI' '
+	P="$HTTPD_DOCUMENT_ROOT_PATH/http_parent" &&
+	rm -rf "$P" http_child log &&
+
+	git init "$P" &&
+	git -C "$P" config "uploadpack.allowsidebandall" "true" &&
+
+	echo my-blob >"$P/my-blob" &&
+	git -C "$P" add my-blob &&
+	echo other-blob >"$P/other-blob" &&
+	git -C "$P" add other-blob &&
+	git -C "$P" commit -m x &&
+
+	# Create a packfile for my-blob and configure it for exclusion.
+	git -C "$P" hash-object my-blob >h &&
+	git -C "$P" pack-objects --stdout <h \
+		>"$HTTPD_DOCUMENT_ROOT_PATH/one.pack" &&
+	git -C "$P" config \
+		"uploadpack.blobpackfileuri" \
+		"$(cat h) $HTTPD_URL/dumb/one.pack" &&
+
+	# Do the same for other-blob.
+	git -C "$P" hash-object other-blob >h2 &&
+	git -C "$P" pack-objects --stdout <h2 \
+		>"$HTTPD_DOCUMENT_ROOT_PATH/two.pack" &&
+	git -C "$P" config --add \
+		"uploadpack.blobpackfileuri" \
+		"$(cat h2) $HTTPD_URL/dumb/two.pack" &&
+
+	GIT_TRACE_PACKET="$(pwd)/log" GIT_TEST_SIDEBAND_ALL=1 \
+		git -c protocol.version=2 \
+		-c fetch.uriprotocols=http,https \
+		clone  "$HTTPD_URL/smart/http_parent" http_child &&
+
+	# Ensure that my-blob and other-blob are in separate packfiles.
+	for idx in http_child/.git/objects/pack/*.idx
+	do
+		git verify-pack --verbose $idx >out &&
+		if test "$(grep "^[0-9a-f]\{40\} " out | wc -l)" = 1
+		then
+			if grep $(cat h) out
+			then
+				>hfound
+			fi &&
+			if grep $(cat h2) out
+			then
+				>h2found
+			fi
+		fi
+	done &&
+	test -f hfound &&
+	test -f h2found
+'
+
 stop_httpd
 
 test_done
diff --git a/upload-pack.c b/upload-pack.c
index 987d2e139b..2365b707bc 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -105,9 +105,12 @@ static int write_one_shallow(const struct commit_graft *graft, void *cb_data)
 struct output_state {
 	char buffer[8193];
 	int used;
+	unsigned packfile_uris_started : 1;
+	unsigned packfile_started : 1;
 };
 
-static int relay_pack_data(int pack_objects_out, struct output_state *os)
+static int relay_pack_data(int pack_objects_out, struct output_state *os,
+			   int write_packfile_line)
 {
 	/*
 	 * We keep the last byte to ourselves
@@ -128,6 +131,37 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os)
 	}
 	os->used += readsz;
 
+	while (!os->packfile_started) {
+		char *p;
+		if (os->used >= 4 && !memcmp(os->buffer, "PACK", 4)) {
+			os->packfile_started = 1;
+			if (write_packfile_line) {
+				if (os->packfile_uris_started)
+					packet_delim(1);
+				packet_write_fmt(1, "\1packfile\n");
+			}
+			break;
+		}
+		if ((p = memchr(os->buffer, '\n', os->used))) {
+			if (!os->packfile_uris_started) {
+				os->packfile_uris_started = 1;
+				if (!write_packfile_line)
+					BUG("packfile_uris requires sideband-all");
+				packet_write_fmt(1, "\1packfile-uris\n");
+			}
+			*p = '\0';
+			packet_write_fmt(1, "\1uri %s\n", os->buffer);
+
+			os->used -= p - os->buffer + 1;
+			memmove(os->buffer, p + 1, os->used);
+		} else {
+			/*
+			 * Incomplete line.
+			 */
+			return readsz;
+		}
+	}
+
 	if (os->used > 1) {
 		send_client_data(1, os->buffer, os->used - 1);
 		os->buffer[0] = os->buffer[os->used - 1];
@@ -141,7 +175,8 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os)
 }
 
 static void create_pack_file(const struct object_array *have_obj,
-			     const struct object_array *want_obj)
+			     const struct object_array *want_obj,
+			     const struct string_list *uri_protocols)
 {
 	struct child_process pack_objects = CHILD_PROCESS_INIT;
 	struct output_state output_state = {0};
@@ -192,6 +227,11 @@ static void create_pack_file(const struct object_array *have_obj,
 					 expanded_filter_spec.buf);
 		}
 	}
+	if (uri_protocols) {
+		for (i = 0; i < uri_protocols->nr; i++)
+			argv_array_pushf(&pack_objects.args, "--uri-protocol=%s",
+					 uri_protocols->items[0].string);
+	}
 
 	pack_objects.in = -1;
 	pack_objects.out = -1;
@@ -278,7 +318,8 @@ static void create_pack_file(const struct object_array *have_obj,
 		}
 		if (0 <= pu && (pfd[pu].revents & (POLLIN|POLLHUP))) {
 			int result = relay_pack_data(pack_objects.out,
-						     &output_state);
+						     &output_state,
+						     !!uri_protocols);
 
 			if (result == 0) {
 				close(pack_objects.out);
@@ -1123,7 +1164,7 @@ void upload_pack(struct upload_pack_options *options)
 	if (want_obj.nr) {
 		struct object_array have_obj = OBJECT_ARRAY_INIT;
 		get_common_commits(&reader, &have_obj, &want_obj);
-		create_pack_file(&have_obj, &want_obj);
+		create_pack_file(&have_obj, &want_obj, 0);
 	}
 }
 
@@ -1138,6 +1179,7 @@ struct upload_pack_data {
 	timestamp_t deepen_since;
 	int deepen_rev_list;
 	int deepen_relative;
+	struct string_list uri_protocols;
 
 	struct packet_writer writer;
 
@@ -1157,6 +1199,7 @@ static void upload_pack_data_init(struct upload_pack_data *data)
 	struct oid_array haves = OID_ARRAY_INIT;
 	struct object_array shallows = OBJECT_ARRAY_INIT;
 	struct string_list deepen_not = STRING_LIST_INIT_DUP;
+	struct string_list uri_protocols = STRING_LIST_INIT_DUP;
 
 	memset(data, 0, sizeof(*data));
 	data->wants = wants;
@@ -1164,6 +1207,7 @@ static void upload_pack_data_init(struct upload_pack_data *data)
 	data->haves = haves;
 	data->shallows = shallows;
 	data->deepen_not = deepen_not;
+	data->uri_protocols = uri_protocols;
 	packet_writer_init(&data->writer, 1);
 }
 
@@ -1322,9 +1366,17 @@ static void process_args(struct packet_reader *request,
 			continue;
 		}
 
+		if (skip_prefix(arg, "packfile-uris ", &p)) {
+			string_list_split(&data->uri_protocols, p, ',', -1);
+			continue;
+		}
+
 		/* ignore unknown lines maybe? */
 		die("unexpected line: '%s'", arg);
 	}
+
+	if (data->uri_protocols.nr && !data->writer.use_sideband)
+		string_list_clear(&data->uri_protocols, 0);
 }
 
 static int process_haves(struct oid_array *haves, struct oid_array *common,
@@ -1514,8 +1566,13 @@ int upload_pack_v2(struct repository *r, struct argv_array *keys,
 			send_wanted_ref_info(&data);
 			send_shallow_info(&data, &want_obj);
 
-			packet_writer_write(&data.writer, "packfile\n");
-			create_pack_file(&have_obj, &want_obj);
+			if (data.uri_protocols.nr) {
+				create_pack_file(&have_obj, &want_obj,
+						 &data.uri_protocols);
+			} else {
+				packet_write_fmt(1, "packfile\n");
+				create_pack_file(&have_obj, &want_obj, NULL);
+			}
 			state = FETCH_DONE;
 			break;
 		case FETCH_DONE:
@@ -1536,6 +1593,7 @@ int upload_pack_advertise(struct repository *r,
 		int allow_filter_value;
 		int allow_ref_in_want;
 		int allow_sideband_all_value;
+		char *str = NULL;
 
 		strbuf_addstr(value, "shallow");
 
@@ -1557,6 +1615,14 @@ int upload_pack_advertise(struct repository *r,
 					   &allow_sideband_all_value) &&
 		     allow_sideband_all_value))
 			strbuf_addstr(value, " sideband-all");
+
+		if (!repo_config_get_string(the_repository,
+					    "uploadpack.blobpackfileuri",
+					    &str) &&
+		    str) {
+			strbuf_addstr(value, " packfile-uris");
+			free(str);
+		}
 	}
 
 	return 1;
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [WIP 7/7] upload-pack: send part of packfile response as uri
  2019-02-23 23:39 ` [WIP 7/7] upload-pack: send part of packfile response as uri Jonathan Tan
@ 2019-02-24 15:54   ` Junio C Hamano
  2019-02-25 21:04   ` Christian Couder
  2019-03-01  0:09   ` Josh Steadmon
  2 siblings, 0 replies; 45+ messages in thread
From: Junio C Hamano @ 2019-02-24 15:54 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, peff, christian.couder, avarab

Jonathan Tan <jonathantanmy@google.com> writes:

> diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh
> index db4ae09f2f..6dbe2e9584 100755
> --- a/t/t5702-protocol-v2.sh
> +++ b/t/t5702-protocol-v2.sh
> @@ -656,6 +656,60 @@ test_expect_success 'when server does not send "ready", expect FLUSH' '
>  	test_i18ngrep "expected no other sections to be sent after no .ready." err
>  '
>  
> +test_expect_success 'part of packfile response provided as URI' '
> ...
> +		if test "$(grep "^[0-9a-f]\{40\} " out | wc -l)" = 1
> +		then

Against this, test-lint-shell-syntax barks.  You'd have seen it if
you did "make test".

I am not sure hard-coding 40 here is a good idea for longer term, as
we are *not* testing that the output from "verify-pack --verbose"
shows the full object name in SHA-1 hash.

Perhaps squash something like this in ("16" is of course negotiable)?

 t/t5702-protocol-v2.sh | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh
index 6dbe2e9584..e9950f0853 100755
--- a/t/t5702-protocol-v2.sh
+++ b/t/t5702-protocol-v2.sh
@@ -694,7 +694,10 @@ test_expect_success 'part of packfile response provided as URI' '
 	for idx in http_child/.git/objects/pack/*.idx
 	do
 		git verify-pack --verbose $idx >out &&
-		if test "$(grep "^[0-9a-f]\{40\} " out | wc -l)" = 1
+		{
+			grep "^[0-9a-f]\{16,\} " out || :
+		} >out.objectlist &&
+		if test_line_count = 1 out.objectlist
 		then
 			if grep $(cat h) out
 			then

^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [WIP 7/7] upload-pack: send part of packfile response as uri
  2019-02-23 23:39 ` [WIP 7/7] upload-pack: send part of packfile response as uri Jonathan Tan
  2019-02-24 15:54   ` Junio C Hamano
@ 2019-02-25 21:04   ` Christian Couder
  2019-02-26  1:53     ` Jonathan Nieder
  2019-03-01  0:09   ` Josh Steadmon
  2 siblings, 1 reply; 45+ messages in thread
From: Christian Couder @ 2019-02-25 21:04 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: git, Junio C Hamano, Jeff King,
	Ævar Arnfjörð Bjarmason

On Sun, Feb 24, 2019 at 12:39 AM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> Teach upload-pack to send part of its packfile response as URIs.
>
> An administrator may configure a repository with one or more
> "uploadpack.blobpackfileuri" lines, each line containing an OID and a
> URI. A client may configure fetch.uriprotocols to be a comma-separated
> list of protocols that it is willing to use to fetch additional
> packfiles - this list will be sent to the server.

So only packfiles will be fetched by the client using those protocols.

> Whenever an object
> with one of those OIDs would appear in the packfile transmitted by
> upload-pack, the server may exclude that object, and instead send the
> URI.

Ok, so each URI sent in the packfile corresponds to exactly one
object. And when the client fetches one such URI it gets a packfile
that contains only the corresponding object. Or is there something I
misunderstood?

> The client will then download the packs referred to by those URIs
> before performing the connectivity check.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [WIP 0/7] CDN offloading of fetch response
  2019-02-23 23:38 [WIP 0/7] CDN offloading of fetch response Jonathan Tan
                   ` (6 preceding siblings ...)
  2019-02-23 23:39 ` [WIP 7/7] upload-pack: send part of packfile response as uri Jonathan Tan
@ 2019-02-25 21:30 ` Christian Couder
  2019-02-25 23:45   ` Jonathan Nieder
  2019-03-08 21:55 ` [PATCH v2 0/8] " Jonathan Tan
  8 siblings, 1 reply; 45+ messages in thread
From: Christian Couder @ 2019-02-25 21:30 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: git, Junio C Hamano, Jeff King,
	Ævar Arnfjörð Bjarmason

On Sun, Feb 24, 2019 at 12:39 AM Jonathan Tan <jonathantanmy@google.com> wrote:

> There are probably some more design discussions to be had:

[...]

>  - Client-side whitelist of protocol and hostnames. I've implemented
>    whitelist of protocol, but not hostnames.

I would appreciate a more complete answer to my comments in:

https://public-inbox.org/git/CAP8UFD16fvtu_dg3S_J9BjGpxAYvgp8SXscdh=TJB5jvAbzi4A@mail.gmail.com/

Especially I'd like to know what should the client do if they find out
that for example a repo that contains a lot of large files is
configured so that the large files should be fetched from a CDN that
the client cannot use? Is the client forced to find or setup another
repo configured differently if the client still wants to use CDN
offloading?

Wouldn't it be better if the client could use the same repo and just
select a CDN configuration among many?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [WIP 0/7] CDN offloading of fetch response
  2019-02-25 21:30 ` [WIP 0/7] CDN offloading of fetch response Christian Couder
@ 2019-02-25 23:45   ` Jonathan Nieder
  2019-02-26  8:30     ` Christian Couder
  0 siblings, 1 reply; 45+ messages in thread
From: Jonathan Nieder @ 2019-02-25 23:45 UTC (permalink / raw)
  To: Christian Couder
  Cc: Jonathan Tan, git, Junio C Hamano, Jeff King,
	Ævar Arnfjörð Bjarmason

Hi,

Christian Couder wrote:
> On Sun, Feb 24, 2019 at 12:39 AM Jonathan Tan <jonathantanmy@google.com> wrote:

>> There are probably some more design discussions to be had:
>
> [...]
>
>>  - Client-side whitelist of protocol and hostnames. I've implemented
>>    whitelist of protocol, but not hostnames.
>
> I would appreciate a more complete answer to my comments in:
>
> https://public-inbox.org/git/CAP8UFD16fvtu_dg3S_J9BjGpxAYvgp8SXscdh=TJB5jvAbzi4A@mail.gmail.com/
>
> Especially I'd like to know what should the client do if they find out
> that for example a repo that contains a lot of large files is
> configured so that the large files should be fetched from a CDN that
> the client cannot use? Is the client forced to find or setup another
> repo configured differently if the client still wants to use CDN
> offloading?

The example from that message:

  For example I think the Great Firewall of China lets people in China
  use GitHub.com but not Google.com. So if people start configuring
  their repos on GitHub so that they send packs that contain Google.com
  CDN URLs (or actually anything that the Firewall blocks), it might
  create many problems for users in China if they don't have a way to
  opt out of receiving packs with those kind of URLs.

But the same thing can happen with redirects, with embedded assets in
web pages, and so on.  I think in this situation the user would likely
(and rightly) blame the host (github.com) for requiring access to a
separate inaccessible site, and the problem could be resolved with
them.

The beauty of this is that it's transparent to the client: the fact
that packfile transfer was offloaded to a CDN is an implementation
detail, and the server takes full responsibility for it.

This doesn't stop a hosting provider from using e.g. server options to
allow the client more control over how their response is served, just
like can be done for other features of how the transfer works (how
often to send progress updates, whether to prioritize latency or
throughput, etc).

What the client *can* do is turn off support for packfile URLs in a
request completely.  This is required for backward compatibility and
allows working around a host that has configured the feature
incorrectly.

Thanks for an interesting example,
Jonathan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [WIP 7/7] upload-pack: send part of packfile response as uri
  2019-02-25 21:04   ` Christian Couder
@ 2019-02-26  1:53     ` Jonathan Nieder
  2019-02-26  7:08       ` Christian Couder
  0 siblings, 1 reply; 45+ messages in thread
From: Jonathan Nieder @ 2019-02-26  1:53 UTC (permalink / raw)
  To: Christian Couder
  Cc: Jonathan Tan, git, Junio C Hamano, Jeff King,
	Ævar Arnfjörð Bjarmason

Hi,

Christian Couder wrote:
> On Sun, Feb 24, 2019 at 12:39 AM Jonathan Tan <jonathantanmy@google.com> wrote:

>> Teach upload-pack to send part of its packfile response as URIs.
>>
>> An administrator may configure a repository with one or more
>> "uploadpack.blobpackfileuri" lines, each line containing an OID and a
>> URI. A client may configure fetch.uriprotocols to be a comma-separated
>> list of protocols that it is willing to use to fetch additional
>> packfiles - this list will be sent to the server.
>
> So only packfiles will be fetched by the client using those protocols.

Yes.

>> Whenever an object
>> with one of those OIDs would appear in the packfile transmitted by
>> upload-pack, the server may exclude that object, and instead send the
>> URI.
>
> Ok, so each URI sent in the packfile corresponds to exactly one
> object. And when the client fetches one such URI it gets a packfile
> that contains only the corresponding object. Or is there something I
> misunderstood?

I think it's worth separating the protocol and the server
implementation:

The protocol allows arbitrary packfiles --- they do not have to be
single-object packfiles.

This patch for a server implementation only uses it to serve
single-object packfiles.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [WIP 7/7] upload-pack: send part of packfile response as uri
  2019-02-26  1:53     ` Jonathan Nieder
@ 2019-02-26  7:08       ` Christian Couder
  0 siblings, 0 replies; 45+ messages in thread
From: Christian Couder @ 2019-02-26  7:08 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Jonathan Tan, git, Junio C Hamano, Jeff King,
	Ævar Arnfjörð Bjarmason

Hi,

On Tue, Feb 26, 2019 at 2:53 AM Jonathan Nieder <jrnieder@gmail.com> wrote:
>
> Christian Couder wrote:
> > On Sun, Feb 24, 2019 at 12:39 AM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > Ok, so each URI sent in the packfile corresponds to exactly one
> > object. And when the client fetches one such URI it gets a packfile
> > that contains only the corresponding object. Or is there something I
> > misunderstood?
>
> I think it's worth separating the protocol and the server
> implementation:
>
> The protocol allows arbitrary packfiles --- they do not have to be
> single-object packfiles.
>
> This patch for a server implementation only uses it to serve
> single-object packfiles.

Ok, this explains things much better.

Thanks,
Christian.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [WIP 0/7] CDN offloading of fetch response
  2019-02-25 23:45   ` Jonathan Nieder
@ 2019-02-26  8:30     ` Christian Couder
  2019-02-26  9:12       ` Ævar Arnfjörð Bjarmason
  2019-02-28 23:21       ` Jonathan Nieder
  0 siblings, 2 replies; 45+ messages in thread
From: Christian Couder @ 2019-02-26  8:30 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Jonathan Tan, git, Junio C Hamano, Jeff King,
	Ævar Arnfjörð Bjarmason

Hi,

On Tue, Feb 26, 2019 at 12:45 AM Jonathan Nieder <jrnieder@gmail.com> wrote:
>
> Christian Couder wrote:
> > On Sun, Feb 24, 2019 at 12:39 AM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > Especially I'd like to know what should the client do if they find out
> > that for example a repo that contains a lot of large files is
> > configured so that the large files should be fetched from a CDN that
> > the client cannot use? Is the client forced to find or setup another
> > repo configured differently if the client still wants to use CDN
> > offloading?
>
> The example from that message:
>
>   For example I think the Great Firewall of China lets people in China
>   use GitHub.com but not Google.com. So if people start configuring
>   their repos on GitHub so that they send packs that contain Google.com
>   CDN URLs (or actually anything that the Firewall blocks), it might
>   create many problems for users in China if they don't have a way to
>   opt out of receiving packs with those kind of URLs.
>
> But the same thing can happen with redirects, with embedded assets in
> web pages, and so on.

I don't think it's the same situation, because the CDN offloading is
likely to be used for large objects that some hosting sites like
GitHub, GitLab and BitBucket might not be ok to have them store for
free on their machines. (I think the current limitations are around
10GB or 20GB, everything included, for each user.)

So it's likely that users will want a way to host on such sites
incomplete repos using CDN offloading to a CDN on another site. And
then if the CDN is not accessible for some reason, things will
completely break when users will clone.

You could say that it's the same issue as when a video is not
available on a web page, but the web browser can still render the page
when a video is not available. So I don't think it's the same kind of
issue.

And by the way that's a reason why I think it's important to think
about this in relation to promisor/partial clone remotes. Because with
them it's less of a big deal if the CDN is unavailable, temporarily or
not, for some reason.

> I think in this situation the user would likely
> (and rightly) blame the host (github.com) for requiring access to a
> separate inaccessible site, and the problem could be resolved with
> them.

The host will say that it's repo admins' responsibility to use a CDN
that works for the repo users (or to pay for more space on the host).
Then repo admins will say that they use this CDN because it's simpler
for them or the only thing they can afford or deal with. (For example
I don't think it would be easy for westerners to use a Chinese CDN.)
Then users will likely blame Git for not supporting a way to use a
different CDN than the one configured in each repo.

> The beauty of this is that it's transparent to the client: the fact
> that packfile transfer was offloaded to a CDN is an implementation
> detail, and the server takes full responsibility for it.

Who is "the server" in real life? Are you sure they would be ok to
take full responsibility?

And yes, I agree that transparency for the client is nice. And if it's
really nice, then why not have it for promisor/partial clone remotes
too? But then do we really need duplicated functionality between
promisor remotes and CDN offloading?

And also I just think that in real life there needs to be an easy way
to override this transparency, and we already have that with promisor
remotes.

> This doesn't stop a hosting provider from using e.g. server options to
> allow the client more control over how their response is served, just
> like can be done for other features of how the transfer works (how
> often to send progress updates, whether to prioritize latency or
> throughput, etc).

Could you give a more concrete example of what could be done?

> What the client *can* do is turn off support for packfile URLs in a
> request completely.  This is required for backward compatibility and
> allows working around a host that has configured the feature
> incorrectly.

If the full content of a repo is really large, the size of a full pack
file sent by an initial clone could be really big and many client
machines could not have enough memory to deal with that. And this
suppose that repo hosting providers would be ok to host very large
repos in the first place.

With promisor remotes, it's less of a problem if for example:

- a repo hosting provider is not ok with very large repos,
- a CDN is unavailable,
- a repo admin has not configured some repos very well.

Thanks for your answer,
Christian.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [WIP 0/7] CDN offloading of fetch response
  2019-02-26  8:30     ` Christian Couder
@ 2019-02-26  9:12       ` Ævar Arnfjörð Bjarmason
  2019-03-04  8:24         ` Christian Couder
  2019-02-28 23:21       ` Jonathan Nieder
  1 sibling, 1 reply; 45+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-02-26  9:12 UTC (permalink / raw)
  To: Christian Couder
  Cc: Jonathan Nieder, Jonathan Tan, git, Junio C Hamano, Jeff King


On Tue, Feb 26 2019, Christian Couder wrote:

> Hi,
>
> On Tue, Feb 26, 2019 at 12:45 AM Jonathan Nieder <jrnieder@gmail.com> wrote:
>>
>> Christian Couder wrote:
>> > On Sun, Feb 24, 2019 at 12:39 AM Jonathan Tan <jonathantanmy@google.com> wrote:
>>
>> > Especially I'd like to know what should the client do if they find out
>> > that for example a repo that contains a lot of large files is
>> > configured so that the large files should be fetched from a CDN that
>> > the client cannot use? Is the client forced to find or setup another
>> > repo configured differently if the client still wants to use CDN
>> > offloading?
>>
>> The example from that message:
>>
>>   For example I think the Great Firewall of China lets people in China
>>   use GitHub.com but not Google.com. So if people start configuring
>>   their repos on GitHub so that they send packs that contain Google.com
>>   CDN URLs (or actually anything that the Firewall blocks), it might
>>   create many problems for users in China if they don't have a way to
>>   opt out of receiving packs with those kind of URLs.
>>
>> But the same thing can happen with redirects, with embedded assets in
>> web pages, and so on.
>
> I don't think it's the same situation, because the CDN offloading is
> likely to be used for large objects that some hosting sites like
> GitHub, GitLab and BitBucket might not be ok to have them store for
> free on their machines. (I think the current limitations are around
> 10GB or 20GB, everything included, for each user.)
>
> So it's likely that users will want a way to host on such sites
> incomplete repos using CDN offloading to a CDN on another site. And
> then if the CDN is not accessible for some reason, things will
> completely break when users will clone.
>
> You could say that it's the same issue as when a video is not
> available on a web page, but the web browser can still render the page
> when a video is not available. So I don't think it's the same kind of
> issue.
>
> And by the way that's a reason why I think it's important to think
> about this in relation to promisor/partial clone remotes. Because with
> them it's less of a big deal if the CDN is unavailable, temporarily or
> not, for some reason.

I think all of that's correct. E.g. you can imagine a CDN where the CDN
serves literally one blob (not a pack), and the server the rest of the
trees/commits/blobs.

But for the purposes of reviewing this I think it's better to say that
we're going to have a limited initial introduction of CDN where those
more complex cases don't need to be handled.

That can always be added later, as far as I can tell from the protocol
alteration in the RFC nothing's closing the door on that, we could
always add another capability / protocol extension.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [WIP 0/7] CDN offloading of fetch response
  2019-02-26  8:30     ` Christian Couder
  2019-02-26  9:12       ` Ævar Arnfjörð Bjarmason
@ 2019-02-28 23:21       ` Jonathan Nieder
  2019-03-04  8:54         ` Christian Couder
  1 sibling, 1 reply; 45+ messages in thread
From: Jonathan Nieder @ 2019-02-28 23:21 UTC (permalink / raw)
  To: Christian Couder
  Cc: Jonathan Tan, git, Junio C Hamano, Jeff King,
	Ævar Arnfjörð Bjarmason

Hi,

Sorry for the slow followup.  Thanks for probing into the design ---
this should be useful for getting the docs to be clear.

Christian Couder wrote:

> So it's likely that users will want a way to host on such sites
> incomplete repos using CDN offloading to a CDN on another site. And
> then if the CDN is not accessible for some reason, things will
> completely break when users will clone.

I think this would be a broken setup --- we can make it clear in the
protocol and server docs that you should only point to a CDN for which
you control the contents, to avoid breaking clients.

That doesn't prevent adding additional features in the future e.g. for
"server suggested alternates" --- it's just that I consider that a
separate feature.

Using CDN offloading requires cooperation of the hosting provider.
It's a way to optimize how fetches work, not a way to have a partial
repository on the server side.

[...]
> On Tue, Feb 26, 2019 at 12:45 AM Jonathan Nieder <jrnieder@gmail.com> wrote:

>> This doesn't stop a hosting provider from using e.g. server options to
>> allow the client more control over how their response is served, just
>> like can be done for other features of how the transfer works (how
>> often to send progress updates, whether to prioritize latency or
>> throughput, etc).
>
> Could you give a more concrete example of what could be done?

What I mean is passing server options using "git fetch --server-option".
For example:

	git fetch -o priority=BATCH origin master

or

	git fetch -o avoid-cdn=badcdn.example.com origin master

The interpretation of server options is up to the server.

>> What the client *can* do is turn off support for packfile URLs in a
>> request completely.  This is required for backward compatibility and
>> allows working around a host that has configured the feature
>> incorrectly.
>
> If the full content of a repo is really large, the size of a full pack
> file sent by an initial clone could be really big and many client
> machines could not have enough memory to deal with that. And this
> suppose that repo hosting providers would be ok to host very large
> repos in the first place.

Do we require the packfile to fit in memory?  If so, we should fix
that (to use streaming instead).

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [WIP 7/7] upload-pack: send part of packfile response as uri
  2019-02-23 23:39 ` [WIP 7/7] upload-pack: send part of packfile response as uri Jonathan Tan
  2019-02-24 15:54   ` Junio C Hamano
  2019-02-25 21:04   ` Christian Couder
@ 2019-03-01  0:09   ` Josh Steadmon
  2019-03-01  0:17     ` Jonathan Tan
  2 siblings, 1 reply; 45+ messages in thread
From: Josh Steadmon @ 2019-03-01  0:09 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, gitster, peff, christian.couder, avarab

On 2019.02.23 15:39, Jonathan Tan wrote:
> Teach upload-pack to send part of its packfile response as URIs.
> 
> An administrator may configure a repository with one or more
> "uploadpack.blobpackfileuri" lines, each line containing an OID and a
> URI. A client may configure fetch.uriprotocols to be a comma-separated
> list of protocols that it is willing to use to fetch additional
> packfiles - this list will be sent to the server. Whenever an object
> with one of those OIDs would appear in the packfile transmitted by
> upload-pack, the server may exclude that object, and instead send the
> URI. The client will then download the packs referred to by those URIs
> before performing the connectivity check.
> 
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  builtin/pack-objects.c | 63 ++++++++++++++++++++++++++++++++++
>  fetch-pack.c           | 58 +++++++++++++++++++++++++++++++
>  t/t5702-protocol-v2.sh | 54 +++++++++++++++++++++++++++++
>  upload-pack.c          | 78 ++++++++++++++++++++++++++++++++++++++----
>  4 files changed, 247 insertions(+), 6 deletions(-)
> 
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index a9fac7c128..73309821e2 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -110,6 +110,8 @@ static unsigned long window_memory_limit = 0;
>  
>  static struct list_objects_filter_options filter_options;
>  
> +static struct string_list uri_protocols = STRING_LIST_INIT_NODUP;
> +
>  enum missing_action {
>  	MA_ERROR = 0,      /* fail if any missing objects are encountered */
>  	MA_ALLOW_ANY,      /* silently allow ALL missing objects */
> @@ -118,6 +120,14 @@ enum missing_action {
>  static enum missing_action arg_missing_action;
>  static show_object_fn fn_show_object;
>  
> +struct configured_exclusion {
> +	struct oidmap_entry e;
> +	char *uri;
> +};
> +static struct oidmap configured_exclusions;
> +
> +static struct oidset excluded_by_config;
> +
>  /*
>   * stats
>   */
> @@ -832,6 +842,23 @@ static off_t write_reused_pack(struct hashfile *f)
>  	return reuse_packfile_offset - sizeof(struct pack_header);
>  }
>  
> +static void write_excluded_by_configs(void)
> +{
> +	struct oidset_iter iter;
> +	const struct object_id *oid;
> +
> +	oidset_iter_init(&excluded_by_config, &iter);
> +	while ((oid = oidset_iter_next(&iter))) {
> +		struct configured_exclusion *ex =
> +			oidmap_get(&configured_exclusions, oid);
> +
> +		if (!ex)
> +			BUG("configured exclusion wasn't configured");
> +		write_in_full(1, ex->uri, strlen(ex->uri));
> +		write_in_full(1, "\n", 1);
> +	}
> +}
> +
>  static const char no_split_warning[] = N_(
>  "disabling bitmap writing, packs are split due to pack.packSizeLimit"
>  );
> @@ -1125,6 +1152,25 @@ static int want_object_in_pack(const struct object_id *oid,
>  		}
>  	}
>  
> +	if (uri_protocols.nr) {
> +		struct configured_exclusion *ex =
> +			oidmap_get(&configured_exclusions, oid);
> +		int i;
> +		const char *p;
> +
> +		if (ex) {
> +			for (i = 0; i < uri_protocols.nr; i++) {
> +				if (skip_prefix(ex->uri,
> +						uri_protocols.items[i].string,
> +						&p) &&
> +				    *p == ':') {
> +					oidset_insert(&excluded_by_config, oid);
> +					return 0;
> +				}
> +			}
> +		}
> +	}
> +
>  	return 1;
>  }
>  
> @@ -2726,6 +2772,19 @@ static int git_pack_config(const char *k, const char *v, void *cb)
>  			    pack_idx_opts.version);
>  		return 0;
>  	}
> +	if (!strcmp(k, "uploadpack.blobpackfileuri")) {
> +		struct configured_exclusion *ex = xmalloc(sizeof(*ex));
> +		const char *end;
> +
> +		if (parse_oid_hex(v, &ex->e.oid, &end) || *end != ' ')
> +			die(_("value of uploadpack.blobpackfileuri must be "
> +			      "of the form '<sha-1> <uri>' (got '%s')"), v);
> +		if (oidmap_get(&configured_exclusions, &ex->e.oid))
> +			die(_("object already configured in another "
> +			      "uploadpack.blobpackfileuri (got '%s')"), v);
> +		ex->uri = xstrdup(end + 1);
> +		oidmap_put(&configured_exclusions, ex);
> +	}
>  	return git_default_config(k, v, cb);
>  }
>  
> @@ -3318,6 +3377,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>  			 N_("do not pack objects in promisor packfiles")),
>  		OPT_BOOL(0, "delta-islands", &use_delta_islands,
>  			 N_("respect islands during delta compression")),
> +		OPT_STRING_LIST(0, "uri-protocol", &uri_protocols,
> +				N_("protocol"),
> +				N_("exclude any configured uploadpack.blobpackfileuri with this protocol")),
>  		OPT_END(),
>  	};
>  
> @@ -3492,6 +3554,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
>  		return 0;
>  	if (nr_result)
>  		prepare_pack(window, depth);
> +	write_excluded_by_configs();
>  	write_pack_file();
>  	if (progress)
>  		fprintf_ln(stderr,
> diff --git a/fetch-pack.c b/fetch-pack.c
> index 812be15d7e..edbe4b3ec3 100644
> --- a/fetch-pack.c
> +++ b/fetch-pack.c
> @@ -38,6 +38,7 @@ static struct lock_file shallow_lock;
>  static const char *alternate_shallow_file;
>  static char *negotiation_algorithm;
>  static struct strbuf fsck_msg_types = STRBUF_INIT;
> +static struct string_list uri_protocols = STRING_LIST_INIT_DUP;
>  
>  /* Remember to update object flag allocation in object.h */
>  #define COMPLETE	(1U << 0)
> @@ -1147,6 +1148,26 @@ static int send_fetch_request(struct fetch_negotiator *negotiator, int fd_out,
>  		warning("filtering not recognized by server, ignoring");
>  	}
>  
> +	if (server_supports_feature("fetch", "packfile-uris", 0)) {
> +		int i;
> +		struct strbuf to_send = STRBUF_INIT;
> +
> +		for (i = 0; i < uri_protocols.nr; i++) {
> +			const char *s = uri_protocols.items[i].string;
> +
> +			if (!strcmp(s, "https") || !strcmp(s, "http")) {
> +				if (to_send.len)
> +					strbuf_addch(&to_send, ',');
> +				strbuf_addstr(&to_send, s);
> +			}
> +		}
> +		if (to_send.len) {
> +			packet_buf_write(&req_buf, "packfile-uris %s",
> +					 to_send.buf);
> +			strbuf_release(&to_send);
> +		}
> +	}
> +
>  	/* add wants */
>  	add_wants(args->no_dependents, wants, &req_buf);
>  
> @@ -1322,6 +1343,32 @@ static void receive_wanted_refs(struct packet_reader *reader,
>  		die(_("error processing wanted refs: %d"), reader->status);
>  }
>  
> +static void receive_packfile_uris(struct packet_reader *reader)
> +{
> +	process_section_header(reader, "packfile-uris", 0);
> +	while (packet_reader_read(reader) == PACKET_READ_NORMAL) {
> +		const char *p;
> +		struct child_process cmd = CHILD_PROCESS_INIT;
> +
> +
> +		if (!skip_prefix(reader->line, "uri ", &p))
> +			die("expected 'uri <uri>', got: %s\n", reader->line);
> +
> +		argv_array_push(&cmd.args, "http-fetch");
> +		argv_array_push(&cmd.args, "--packfile");
> +		argv_array_push(&cmd.args, p);
> +		cmd.git_cmd = 1;
> +		cmd.no_stdin = 1;
> +		cmd.no_stdout = 1;
> +		if (start_command(&cmd))
> +			die("fetch-pack: unable to spawn");
> +		if (finish_command(&cmd))
> +			die("fetch-pack: unable to finish");
> +	}
> +	if (reader->status != PACKET_READ_DELIM)
> +		die("expected DELIM");
> +}
> +

So we process the packfile URIs one by one as we receive them from the
server? If we expect these packfiles to be large (otherwise why are we
bothering to offload them to the CDN), is there a risk that the
connection to the server might time out while we're downloading from the
CDN?

>  enum fetch_state {
>  	FETCH_CHECK_LOCAL = 0,
>  	FETCH_SEND_REQUEST,
> @@ -1414,6 +1461,9 @@ static struct ref *do_fetch_pack_v2(struct fetch_pack_args *args,
>  				receive_wanted_refs(&reader, sought, nr_sought);
>  
>  			/* get the pack */
> +			if (process_section_header(&reader, "packfile-uris", 1)) {
> +				receive_packfile_uris(&reader);
> +			}
>  			process_section_header(&reader, "packfile", 0);
>  			if (get_pack(args, fd, pack_lockfile))
>  				die(_("git fetch-pack: fetch failed."));
> @@ -1464,6 +1514,14 @@ static void fetch_pack_config(void)
>  	git_config_get_bool("transfer.fsckobjects", &transfer_fsck_objects);
>  	git_config_get_string("fetch.negotiationalgorithm",
>  			      &negotiation_algorithm);
> +	if (!uri_protocols.nr) {
> +		char *str;
> +
> +		if (!git_config_get_string("fetch.uriprotocols", &str) && str) {
> +			string_list_split(&uri_protocols, str, ',', -1);
> +			free(str);
> +		}
> +	}
>  
>  	git_config(fetch_pack_config_cb, NULL);
>  }
> diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh
> index db4ae09f2f..6dbe2e9584 100755
> --- a/t/t5702-protocol-v2.sh
> +++ b/t/t5702-protocol-v2.sh
> @@ -656,6 +656,60 @@ test_expect_success 'when server does not send "ready", expect FLUSH' '
>  	test_i18ngrep "expected no other sections to be sent after no .ready." err
>  '
>  
> +test_expect_success 'part of packfile response provided as URI' '
> +	P="$HTTPD_DOCUMENT_ROOT_PATH/http_parent" &&
> +	rm -rf "$P" http_child log &&
> +
> +	git init "$P" &&
> +	git -C "$P" config "uploadpack.allowsidebandall" "true" &&
> +
> +	echo my-blob >"$P/my-blob" &&
> +	git -C "$P" add my-blob &&
> +	echo other-blob >"$P/other-blob" &&
> +	git -C "$P" add other-blob &&
> +	git -C "$P" commit -m x &&
> +
> +	# Create a packfile for my-blob and configure it for exclusion.
> +	git -C "$P" hash-object my-blob >h &&
> +	git -C "$P" pack-objects --stdout <h \
> +		>"$HTTPD_DOCUMENT_ROOT_PATH/one.pack" &&
> +	git -C "$P" config \
> +		"uploadpack.blobpackfileuri" \
> +		"$(cat h) $HTTPD_URL/dumb/one.pack" &&
> +
> +	# Do the same for other-blob.
> +	git -C "$P" hash-object other-blob >h2 &&
> +	git -C "$P" pack-objects --stdout <h2 \
> +		>"$HTTPD_DOCUMENT_ROOT_PATH/two.pack" &&
> +	git -C "$P" config --add \
> +		"uploadpack.blobpackfileuri" \
> +		"$(cat h2) $HTTPD_URL/dumb/two.pack" &&
> +
> +	GIT_TRACE_PACKET="$(pwd)/log" GIT_TEST_SIDEBAND_ALL=1 \
> +		git -c protocol.version=2 \
> +		-c fetch.uriprotocols=http,https \
> +		clone  "$HTTPD_URL/smart/http_parent" http_child &&
> +
> +	# Ensure that my-blob and other-blob are in separate packfiles.
> +	for idx in http_child/.git/objects/pack/*.idx
> +	do
> +		git verify-pack --verbose $idx >out &&
> +		if test "$(grep "^[0-9a-f]\{40\} " out | wc -l)" = 1
> +		then
> +			if grep $(cat h) out
> +			then
> +				>hfound
> +			fi &&
> +			if grep $(cat h2) out
> +			then
> +				>h2found
> +			fi
> +		fi
> +	done &&
> +	test -f hfound &&
> +	test -f h2found
> +'
> +
>  stop_httpd
>  
>  test_done
> diff --git a/upload-pack.c b/upload-pack.c
> index 987d2e139b..2365b707bc 100644
> --- a/upload-pack.c
> +++ b/upload-pack.c
> @@ -105,9 +105,12 @@ static int write_one_shallow(const struct commit_graft *graft, void *cb_data)
>  struct output_state {
>  	char buffer[8193];
>  	int used;
> +	unsigned packfile_uris_started : 1;
> +	unsigned packfile_started : 1;
>  };
>  
> -static int relay_pack_data(int pack_objects_out, struct output_state *os)
> +static int relay_pack_data(int pack_objects_out, struct output_state *os,
> +			   int write_packfile_line)
>  {
>  	/*
>  	 * We keep the last byte to ourselves
> @@ -128,6 +131,37 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os)
>  	}
>  	os->used += readsz;
>  
> +	while (!os->packfile_started) {
> +		char *p;
> +		if (os->used >= 4 && !memcmp(os->buffer, "PACK", 4)) {
> +			os->packfile_started = 1;
> +			if (write_packfile_line) {
> +				if (os->packfile_uris_started)
> +					packet_delim(1);
> +				packet_write_fmt(1, "\1packfile\n");
> +			}
> +			break;
> +		}
> +		if ((p = memchr(os->buffer, '\n', os->used))) {
> +			if (!os->packfile_uris_started) {
> +				os->packfile_uris_started = 1;
> +				if (!write_packfile_line)
> +					BUG("packfile_uris requires sideband-all");
> +				packet_write_fmt(1, "\1packfile-uris\n");
> +			}
> +			*p = '\0';
> +			packet_write_fmt(1, "\1uri %s\n", os->buffer);
> +
> +			os->used -= p - os->buffer + 1;
> +			memmove(os->buffer, p + 1, os->used);
> +		} else {
> +			/*
> +			 * Incomplete line.
> +			 */
> +			return readsz;
> +		}
> +	}
> +
>  	if (os->used > 1) {
>  		send_client_data(1, os->buffer, os->used - 1);
>  		os->buffer[0] = os->buffer[os->used - 1];
> @@ -141,7 +175,8 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os)
>  }
>  
>  static void create_pack_file(const struct object_array *have_obj,
> -			     const struct object_array *want_obj)
> +			     const struct object_array *want_obj,
> +			     const struct string_list *uri_protocols)
>  {
>  	struct child_process pack_objects = CHILD_PROCESS_INIT;
>  	struct output_state output_state = {0};
> @@ -192,6 +227,11 @@ static void create_pack_file(const struct object_array *have_obj,
>  					 expanded_filter_spec.buf);
>  		}
>  	}
> +	if (uri_protocols) {
> +		for (i = 0; i < uri_protocols->nr; i++)
> +			argv_array_pushf(&pack_objects.args, "--uri-protocol=%s",
> +					 uri_protocols->items[0].string);
> +	}
>  
>  	pack_objects.in = -1;
>  	pack_objects.out = -1;
> @@ -278,7 +318,8 @@ static void create_pack_file(const struct object_array *have_obj,
>  		}
>  		if (0 <= pu && (pfd[pu].revents & (POLLIN|POLLHUP))) {
>  			int result = relay_pack_data(pack_objects.out,
> -						     &output_state);
> +						     &output_state,
> +						     !!uri_protocols);
>  
>  			if (result == 0) {
>  				close(pack_objects.out);
> @@ -1123,7 +1164,7 @@ void upload_pack(struct upload_pack_options *options)
>  	if (want_obj.nr) {
>  		struct object_array have_obj = OBJECT_ARRAY_INIT;
>  		get_common_commits(&reader, &have_obj, &want_obj);
> -		create_pack_file(&have_obj, &want_obj);
> +		create_pack_file(&have_obj, &want_obj, 0);
>  	}
>  }
>  
> @@ -1138,6 +1179,7 @@ struct upload_pack_data {
>  	timestamp_t deepen_since;
>  	int deepen_rev_list;
>  	int deepen_relative;
> +	struct string_list uri_protocols;
>  
>  	struct packet_writer writer;
>  
> @@ -1157,6 +1199,7 @@ static void upload_pack_data_init(struct upload_pack_data *data)
>  	struct oid_array haves = OID_ARRAY_INIT;
>  	struct object_array shallows = OBJECT_ARRAY_INIT;
>  	struct string_list deepen_not = STRING_LIST_INIT_DUP;
> +	struct string_list uri_protocols = STRING_LIST_INIT_DUP;
>  
>  	memset(data, 0, sizeof(*data));
>  	data->wants = wants;
> @@ -1164,6 +1207,7 @@ static void upload_pack_data_init(struct upload_pack_data *data)
>  	data->haves = haves;
>  	data->shallows = shallows;
>  	data->deepen_not = deepen_not;
> +	data->uri_protocols = uri_protocols;
>  	packet_writer_init(&data->writer, 1);
>  }
>  
> @@ -1322,9 +1366,17 @@ static void process_args(struct packet_reader *request,
>  			continue;
>  		}
>  
> +		if (skip_prefix(arg, "packfile-uris ", &p)) {
> +			string_list_split(&data->uri_protocols, p, ',', -1);
> +			continue;
> +		}
> +
>  		/* ignore unknown lines maybe? */
>  		die("unexpected line: '%s'", arg);
>  	}
> +
> +	if (data->uri_protocols.nr && !data->writer.use_sideband)
> +		string_list_clear(&data->uri_protocols, 0);
>  }
>  
>  static int process_haves(struct oid_array *haves, struct oid_array *common,
> @@ -1514,8 +1566,13 @@ int upload_pack_v2(struct repository *r, struct argv_array *keys,
>  			send_wanted_ref_info(&data);
>  			send_shallow_info(&data, &want_obj);
>  
> -			packet_writer_write(&data.writer, "packfile\n");
> -			create_pack_file(&have_obj, &want_obj);
> +			if (data.uri_protocols.nr) {
> +				create_pack_file(&have_obj, &want_obj,
> +						 &data.uri_protocols);
> +			} else {
> +				packet_write_fmt(1, "packfile\n");
> +				create_pack_file(&have_obj, &want_obj, NULL);
> +			}
>  			state = FETCH_DONE;
>  			break;
>  		case FETCH_DONE:
> @@ -1536,6 +1593,7 @@ int upload_pack_advertise(struct repository *r,
>  		int allow_filter_value;
>  		int allow_ref_in_want;
>  		int allow_sideband_all_value;
> +		char *str = NULL;
>  
>  		strbuf_addstr(value, "shallow");
>  
> @@ -1557,6 +1615,14 @@ int upload_pack_advertise(struct repository *r,
>  					   &allow_sideband_all_value) &&
>  		     allow_sideband_all_value))
>  			strbuf_addstr(value, " sideband-all");
> +
> +		if (!repo_config_get_string(the_repository,
> +					    "uploadpack.blobpackfileuri",
> +					    &str) &&
> +		    str) {
> +			strbuf_addstr(value, " packfile-uris");
> +			free(str);
> +		}
>  	}
>  
>  	return 1;
> -- 
> 2.19.0.271.gfe8321ec05.dirty
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [WIP 7/7] upload-pack: send part of packfile response as uri
  2019-03-01  0:09   ` Josh Steadmon
@ 2019-03-01  0:17     ` Jonathan Tan
  0 siblings, 0 replies; 45+ messages in thread
From: Jonathan Tan @ 2019-03-01  0:17 UTC (permalink / raw)
  To: steadmon; +Cc: jonathantanmy, git, gitster, peff, christian.couder, avarab

> So we process the packfile URIs one by one as we receive them from the
> server? If we expect these packfiles to be large (otherwise why are we
> bothering to offload them to the CDN), is there a risk that the
> connection to the server might time out while we're downloading from the
> CDN?

You're right that this is undesirable - this is one of the things that I
will fix, as I mention in the cover letter ("starting CDN downloads...")
[1].

> Please take a look. Feel free to comment on anything, but I prefer
> comments on the major things first (e.g. my usage of a separate process
> (http-fetch) to fetch packfiles, since as far as I know, Git doesn't
> link to libcurl; any of the design decisions I described above). I know
> that there are some implementation details that could be improved (e.g.
> parallelization of the CDN downloads, starting CDN downloads *after*
> closing the first HTTP request, holding on to the .keep locks until
> after the refs are set), but will work on those once the overall design
> is more or less finalized.

[1] https://public-inbox.org/git/20190301000954.GA47591@google.com/

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [WIP 0/7] CDN offloading of fetch response
  2019-02-26  9:12       ` Ævar Arnfjörð Bjarmason
@ 2019-03-04  8:24         ` Christian Couder
  0 siblings, 0 replies; 45+ messages in thread
From: Christian Couder @ 2019-03-04  8:24 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jonathan Nieder, Jonathan Tan, git, Junio C Hamano, Jeff King

On Tue, Feb 26, 2019 at 10:12 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Tue, Feb 26 2019, Christian Couder wrote:
>
> > On Tue, Feb 26, 2019 at 12:45 AM Jonathan Nieder <jrnieder@gmail.com> wrote:

> >> But the same thing can happen with redirects, with embedded assets in
> >> web pages, and so on.
> >
> > I don't think it's the same situation, because the CDN offloading is
> > likely to be used for large objects that some hosting sites like
> > GitHub, GitLab and BitBucket might not be ok to have them store for
> > free on their machines. (I think the current limitations are around
> > 10GB or 20GB, everything included, for each user.)
> >
> > So it's likely that users will want a way to host on such sites
> > incomplete repos using CDN offloading to a CDN on another site. And
> > then if the CDN is not accessible for some reason, things will
> > completely break when users will clone.
> >
> > You could say that it's the same issue as when a video is not
> > available on a web page, but the web browser can still render the page
> > when a video is not available. So I don't think it's the same kind of
> > issue.
> >
> > And by the way that's a reason why I think it's important to think
> > about this in relation to promisor/partial clone remotes. Because with
> > them it's less of a big deal if the CDN is unavailable, temporarily or
> > not, for some reason.
>
> I think all of that's correct. E.g. you can imagine a CDN where the CDN
> serves literally one blob (not a pack), and the server the rest of the
> trees/commits/blobs.
>
> But for the purposes of reviewing this I think it's better to say that
> we're going to have a limited initial introduction of CDN where those
> more complex cases don't need to be handled.
>
> That can always be added later, as far as I can tell from the protocol
> alteration in the RFC nothing's closing the door on that, we could
> always add another capability / protocol extension.

Yeah, it doesn't close the door on further improvements. The issue
though is that it doesn't seem to have many benefits over implementing
things in many promisor remotes. The main benefit seems to be that the
secondary server locations are automatically configured. But when
looking at what can happen in the real world, this benefit seems more
like a drawback to me as it potentially creates a lot of problems.

A solution, many promisor remotes, where:

- first secondary server URLs are manually specified on the client
side, and then
- some kind of negotiation, so that they can be automatically
selected, is implemented

seems better to me than a solution, CDN offloading, where:

- first the main server decides the secondary server URLs, and then
- we work around the cases where this creates problems

In the case of CDN offloading it is likely that early client and
server implementations will create problems for many people as long as
most of the workarounds aren't implemented. While in the case of many
promisor remotes there is always the manual solution as long as the
automated selection doesn't work well enough.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [WIP 0/7] CDN offloading of fetch response
  2019-02-28 23:21       ` Jonathan Nieder
@ 2019-03-04  8:54         ` Christian Couder
  0 siblings, 0 replies; 45+ messages in thread
From: Christian Couder @ 2019-03-04  8:54 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Jonathan Tan, git, Junio C Hamano, Jeff King,
	Ævar Arnfjörð Bjarmason

Hi,

On Fri, Mar 1, 2019 at 12:21 AM Jonathan Nieder <jrnieder@gmail.com> wrote:
>
> Sorry for the slow followup.  Thanks for probing into the design ---
> this should be useful for getting the docs to be clear.
>
> Christian Couder wrote:
>
> > So it's likely that users will want a way to host on such sites
> > incomplete repos using CDN offloading to a CDN on another site. And
> > then if the CDN is not accessible for some reason, things will
> > completely break when users will clone.
>
> I think this would be a broken setup --- we can make it clear in the
> protocol and server docs that you should only point to a CDN for which
> you control the contents, to avoid breaking clients.

We can say whatever in the docs, but in real life if it's
simpler/cheaper for repo admins to use a CDN for example on Google and
a repo on GitHub, they are likely to do it anyway.

> That doesn't prevent adding additional features in the future e.g. for
> "server suggested alternates" --- it's just that I consider that a
> separate feature.
>
> Using CDN offloading requires cooperation of the hosting provider.
> It's a way to optimize how fetches work, not a way to have a partial
> repository on the server side.

We can say whatever we want about what it is for. Users are likely to
use it anyway in the way they think it will benefit them the most.

> > On Tue, Feb 26, 2019 at 12:45 AM Jonathan Nieder <jrnieder@gmail.com> wrote:
>
> >> This doesn't stop a hosting provider from using e.g. server options to
> >> allow the client more control over how their response is served, just
> >> like can be done for other features of how the transfer works (how
> >> often to send progress updates, whether to prioritize latency or
> >> throughput, etc).
> >
> > Could you give a more concrete example of what could be done?
>
> What I mean is passing server options using "git fetch --server-option".
> For example:
>
>         git fetch -o priority=BATCH origin master
>
> or
>
>         git fetch -o avoid-cdn=badcdn.example.com origin master
>
> The interpretation of server options is up to the server.

If you often have to tell things like "-o
avoid-cdn=badcdn.example.com", then how is it better than just
specifying "-o usecdn=goodcdn.example.com" or even better using the
remote mechanism to configure a remote for goodcdn.example.com and
then configuring this remote to be used along the origin remote (which
is what many promisor remotes is about)?

> >> What the client *can* do is turn off support for packfile URLs in a
> >> request completely.  This is required for backward compatibility and
> >> allows working around a host that has configured the feature
> >> incorrectly.
> >
> > If the full content of a repo is really large, the size of a full pack
> > file sent by an initial clone could be really big and many client
> > machines could not have enough memory to deal with that. And this
> > suppose that repo hosting providers would be ok to host very large
> > repos in the first place.
>
> Do we require the packfile to fit in memory?  If so, we should fix
> that (to use streaming instead).

Even if we stream the packfile to write it, at one point we have to use it.

And I could be wrong but I think that mmap doesn't work on Windows, so
I think we will just try to read the whole thing into memory. Even on
Linux I don't think it's a good idea to mmap a very large file and
then use some big parts of it which I think we will have to do when
checking out the large files from inside the packfile.

Yeah, we can improve that part of Git too. I think though that it
means yet another thing (and not an easy one) that needs to be
improved before CDN offloading can work well in the real world.

I think that the Git "development philosophy" since the beginning has
been more about adding things that work well in the real world first
even if they are small and a bit manual, and then improving on top of
those early things, rather than adding a big thing that doesn't quite
work well in the real world but is automated and then improving on
that.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 0/8] CDN offloading of fetch response
  2019-02-23 23:38 [WIP 0/7] CDN offloading of fetch response Jonathan Tan
                   ` (7 preceding siblings ...)
  2019-02-25 21:30 ` [WIP 0/7] CDN offloading of fetch response Christian Couder
@ 2019-03-08 21:55 ` Jonathan Tan
  2019-03-08 21:55   ` [PATCH v2 1/8] http: use --stdin when getting dumb HTTP pack Jonathan Tan
                     ` (10 more replies)
  8 siblings, 11 replies; 45+ messages in thread
From: Jonathan Tan @ 2019-03-08 21:55 UTC (permalink / raw)
  To: avarab, git; +Cc: Jonathan Tan

Here's my current progress - the only thing that is lacking is more
tests, maybe, so I think it's ready for review.

I wrote in version 1:

> I know
> that there are some implementation details that could be improved (e.g.
> parallelization of the CDN downloads, starting CDN downloads *after*
> closing the first HTTP request, holding on to the .keep locks until
> after the refs are set), but will work on those once the overall design
> is more or less finalized.

This code now starts CDN downloads after closing the first HTTP request,
and it holds on to the .keep files until after the refs are set. I'll
leave parallelization of the CDN downloads for later work.

One relatively significant change: someone pointed out that the issue fixed by 
50d3413740 ("http: make redirects more obvious", 2016-12-06) could also
occur here, so I have changed it so that the server is required to send
the packfile's hash along with the URI.

This does mean that Ævar's workflow described in [1] would not work.
Quoting Ævar:

> More concretely, I'd like to have a setup where a server can just dumbly
> point to some URL that probably has most of the data, without having any
> idea what OIDs are in it. So that e.g. some machine entirely
> disconnected from the server (and with just a regular clone) can
> continually generating an up-to-date-enough packfile.

With 50d3413740, it seems to me that it's important for the server to
know details about the URIs that it points to, so such a disconnection
would not work.

[1] https://public-inbox.org/git/87k1hv6eel.fsf@evledraar.gmail.com/

Jonathan Tan (8):
  http: use --stdin when getting dumb HTTP pack
  http: improve documentation of http_pack_request
  http-fetch: support fetching packfiles by URL
  Documentation: order protocol v2 sections
  Documentation: add Packfile URIs design doc
  upload-pack: refactor reading of pack-objects out
  fetch-pack: support more than one pack lockfile
  upload-pack: send part of packfile response as uri

 Documentation/git-http-fetch.txt         |   8 +-
 Documentation/technical/packfile-uri.txt |  78 ++++++++++++
 Documentation/technical/protocol-v2.txt  |  44 +++++--
 builtin/fetch-pack.c                     |  17 ++-
 builtin/pack-objects.c                   |  76 +++++++++++
 connected.c                              |   8 +-
 fetch-pack.c                             | 129 +++++++++++++++++--
 fetch-pack.h                             |   2 +-
 http-fetch.c                             |  64 ++++++++--
 http.c                                   |  82 +++++++-----
 http.h                                   |  32 ++++-
 t/t5550-http-fetch-dumb.sh               |  25 ++++
 t/t5702-protocol-v2.sh                   |  57 +++++++++
 transport-helper.c                       |   5 +-
 transport.c                              |  14 +-
 transport.h                              |   6 +-
 upload-pack.c                            | 155 +++++++++++++++++------
 17 files changed, 670 insertions(+), 132 deletions(-)
 create mode 100644 Documentation/technical/packfile-uri.txt

Range-diff against v1:
1:  987c9b686b < -:  ---------- http: use --stdin and --keep when downloading pack
-:  ---------- > 1:  87173d0ad1 http: use --stdin when getting dumb HTTP pack
2:  4b53a6f48c ! 2:  66d31720a0 http: improve documentation of http_pack_request
    @@ -45,4 +45,4 @@
     + */
      extern struct http_pack_request *new_http_pack_request(
      	struct packed_git *target, const char *base_url);
    - int finish_http_pack_request(struct http_pack_request *preq, char **lockfile);
    + extern int finish_http_pack_request(struct http_pack_request *preq);
3:  afe73a282d ! 3:  02373f6afe http-fetch: support fetching packfiles by URL
    @@ -31,7 +31,8 @@
     +--packfile::
     +	Instead of a commit id on the command line (which is not expected in
     +	this case), 'git http-fetch' fetches the packfile directly at the given
    -+	URL and generates the corresponding .idx file.
    ++	URL and uses index-pack to generate corresponding .idx and .keep files.
    ++	The output of index-pack is printed to stdout.
     +
      --recover::
      	Verify that everything reachable from target is fetched.  Used after
    @@ -101,12 +102,12 @@
     +		struct http_pack_request *preq;
     +		struct slot_results results;
     +		int ret;
    -+		char *lockfile;
     +
     +		preq = new_http_pack_request(NULL, url);
     +		if (preq == NULL)
     +			die("couldn't create http pack request");
     +		preq->slot->results = &results;
    ++		preq->generate_keep = 1;
     +
     +		if (start_active_slot(preq->slot)) {
     +			run_active_slot(preq->slot);
    @@ -118,9 +119,8 @@
     +			die("Unable to start request");
     +		}
     +
    -+		if ((ret = finish_http_pack_request(preq, &lockfile)))
    ++		if ((ret = finish_http_pack_request(preq)))
     +			die("finish_http_pack_request gave result %d", ret);
    -+		unlink(lockfile);
     +		release_http_pack_request(preq);
     +		rc = 0;
     +	} else {
    @@ -180,6 +180,20 @@
      	tmpfile_fd = xopen(preq->tmpfile.buf, O_RDONLY);
      
     @@
    + 	argv_array_push(&ip.args, "--stdin");
    + 	ip.git_cmd = 1;
    + 	ip.in = tmpfile_fd;
    +-	ip.no_stdout = 1;
    ++	if (preq->generate_keep) {
    ++		argv_array_pushf(&ip.args, "--keep=git %"PRIuMAX,
    ++				 (uintmax_t)getpid());
    ++		ip.out = 0;
    ++	} else {
    ++		ip.no_stdout = 1;
    ++	}
    + 
    + 	if (run_command(&ip)) {
    + 		ret = -1;
      		goto cleanup;
      	}
      
    @@ -243,6 +257,19 @@
      	 * pack list that target is in. finish_http_pack_request() will remove
      	 * target from lst and call install_packed_git() on target.
      	 */
    + 	struct packed_git **lst;
    + 
    ++	/*
    ++	 * If this is true, finish_http_pack_request() will pass "--keep" to
    ++	 * index-pack, resulting in the creation of a keep file, and will not
    ++	 * suppress its stdout (that is, the "keep\t<hash>\n" line will be
    ++	 * printed to stdout).
    ++	 */
    ++	unsigned generate_keep : 1;
    ++
    + 	/*
    + 	 * State managed by functions in http.c.
    + 	 */
     @@
      };
      
    @@ -269,7 +296,14 @@
     +test_expect_success 'http-fetch --packfile' '
     +	git init packfileclient &&
     +	p=$(cd "$HTTPD_DOCUMENT_ROOT_PATH"/repo_pack.git && ls objects/pack/pack-*.pack) &&
    -+	git -C packfileclient http-fetch --packfile "$HTTPD_URL"/dumb/repo_pack.git/$p &&
    ++	git -C packfileclient http-fetch --packfile "$HTTPD_URL"/dumb/repo_pack.git/$p >out &&
    ++
    ++	# Ensure that the expected files are generated
    ++	grep "^keep.[0-9a-f]\{16,\}$" out &&
    ++	cut -c6- out >packhash &&
    ++	test -e "packfileclient/.git/objects/pack/pack-$(cat packhash).pack" &&
    ++	test -e "packfileclient/.git/objects/pack/pack-$(cat packhash).idx" &&
    ++	test -e "packfileclient/.git/objects/pack/pack-$(cat packhash).keep" &&
     +
     +	# Ensure that it has the HEAD of repo_pack, at least
     +	HASH=$(git -C "$HTTPD_DOCUMENT_ROOT_PATH"/repo_pack.git rev-parse HEAD) &&
4:  15a00cd3e1 = 4:  2e0f2daba0 Documentation: order protocol v2 sections
5:  9df7f6e9a4 ! 5:  5ce56844d3 Documentation: add Packfile URIs design doc
    @@ -64,7 +64,7 @@
     +after the Minimum Viable Product (see "Future work").
     +
     +The client can inhibit this feature (i.e. refrain from sending the
    -+`packfile-urls` parameter) by passing --no-packfile-urls to `git fetch`.
    ++`packfile-uris` parameter) by passing --no-packfile-uris to `git fetch`.
     +
     +Future work
     +-----------
    @@ -93,6 +93,24 @@
      --- a/Documentation/technical/protocol-v2.txt
      +++ b/Documentation/technical/protocol-v2.txt
     @@
    + 	indicating its sideband (1, 2, or 3), and the server may send "0005\2"
    + 	(a PKT-LINE of sideband 2 with no payload) as a keepalive packet.
    + 
    ++If the 'packfile-uris' feature is advertised, the following argument
    ++can be included in the client's request as well as the potential
    ++addition of the 'packfile-uris' section in the server's response as
    ++explained below.
    ++
    ++    packfile-uris <comma-separated list of protocols>
    ++	Indicates to the server that the client is willing to receive
    ++	URIs of any of the given protocols in place of objects in the
    ++	sent packfile. Before performing the connectivity check, the
    ++	client should download from all given URIs. Currently, the
    ++	protocols supported are "http" and "https".
    ++
    + The response of `fetch` is broken into a number of sections separated by
    + delimiter packets (0001), with each section beginning with its section
    + header. Most sections are sent only when the packfile is sent.
      
          output = acknowledgements flush-pkt |
      	     [acknowledgments delim-pkt] [shallow-info delim-pkt]
    @@ -107,8 +125,25 @@
          wanted-ref = obj-id SP refname
      
     +    packfile-uris = PKT-LINE("packfile-uris" LF) *packfile-uri
    -+    packfile-uri = PKT-LINE("uri" SP *%x20-ff LF)
    ++    packfile-uri = PKT-LINE(40*(HEXDIGIT) SP *%x20-ff LF)
     +
          packfile = PKT-LINE("packfile" LF)
      	       *PKT-LINE(%x01-03 *%x00-ff)
      
    +@@
    + 	* The server MUST NOT send any refs which were not requested
    + 	  using 'want-ref' lines.
    + 
    ++    packfile-uris section
    ++	* This section is only included if the client sent
    ++	  'packfile-uris' and the server has at least one such URI to
    ++	  send.
    ++
    ++	* Always begins with the section header "packfile-uris".
    ++
    ++	* For each URI the server sends, it sends a hash of the pack's
    ++	  contents (as output by git index-pack) followed by the URI.
    ++
    +     packfile section
    + 	* This section is only included if the client has sent 'want'
    + 	  lines in its request and either requested that no more
6:  639235562e = 6:  d487f46b0f upload-pack: refactor reading of pack-objects out
-:  ---------- > 7:  f46504a166 fetch-pack: support more than one pack lockfile
7:  0e821b4427 ! 8:  c7546868cf upload-pack: send part of packfile response as uri
    @@ -5,12 +5,12 @@
         Teach upload-pack to send part of its packfile response as URIs.
     
         An administrator may configure a repository with one or more
    -    "uploadpack.blobpackfileuri" lines, each line containing an OID and a
    -    URI. A client may configure fetch.uriprotocols to be a comma-separated
    -    list of protocols that it is willing to use to fetch additional
    -    packfiles - this list will be sent to the server. Whenever an object
    -    with one of those OIDs would appear in the packfile transmitted by
    -    upload-pack, the server may exclude that object, and instead send the
    +    "uploadpack.blobpackfileuri" lines, each line containing an OID, a pack
    +    hash, and a URI. A client may configure fetch.uriprotocols to be a
    +    comma-separated list of protocols that it is willing to use to fetch
    +    additional packfiles - this list will be sent to the server. Whenever an
    +    object with one of those OIDs would appear in the packfile transmitted
    +    by upload-pack, the server may exclude that object, and instead send the
         URI. The client will then download the packs referred to by those URIs
         before performing the connectivity check.
     
    @@ -35,6 +35,7 @@
      
     +struct configured_exclusion {
     +	struct oidmap_entry e;
    ++	char *pack_hash_hex;
     +	char *uri;
     +};
     +static struct oidmap configured_exclusions;
    @@ -60,6 +61,8 @@
     +
     +		if (!ex)
     +			BUG("configured exclusion wasn't configured");
    ++		write_in_full(1, ex->pack_hash_hex, strlen(ex->pack_hash_hex));
    ++		write_in_full(1, " ", 1);
     +		write_in_full(1, ex->uri, strlen(ex->uri));
     +		write_in_full(1, "\n", 1);
     +	}
    @@ -100,15 +103,25 @@
      	}
     +	if (!strcmp(k, "uploadpack.blobpackfileuri")) {
     +		struct configured_exclusion *ex = xmalloc(sizeof(*ex));
    -+		const char *end;
    ++		const char *oid_end, *pack_end;
    ++		/*
    ++		 * Stores the pack hash. This is not a true object ID, but is
    ++		 * of the same form.
    ++		 */
    ++		struct object_id pack_hash;
     +
    -+		if (parse_oid_hex(v, &ex->e.oid, &end) || *end != ' ')
    ++		if (parse_oid_hex(v, &ex->e.oid, &oid_end) ||
    ++		    *oid_end != ' ' ||
    ++		    parse_oid_hex(oid_end + 1, &pack_hash, &pack_end) ||
    ++		    *pack_end != ' ')
     +			die(_("value of uploadpack.blobpackfileuri must be "
    -+			      "of the form '<sha-1> <uri>' (got '%s')"), v);
    ++			      "of the form '<object-hash> <pack-hash> <uri>' (got '%s')"), v);
     +		if (oidmap_get(&configured_exclusions, &ex->e.oid))
     +			die(_("object already configured in another "
     +			      "uploadpack.blobpackfileuri (got '%s')"), v);
    -+		ex->uri = xstrdup(end + 1);
    ++		ex->pack_hash_hex = xcalloc(1, pack_end - oid_end);
    ++		memcpy(ex->pack_hash_hex, oid_end + 1, pack_end - oid_end - 1);
    ++		ex->uri = xstrdup(pack_end + 1);
     +		oidmap_put(&configured_exclusions, ex);
     +	}
      	return git_default_config(k, v, cb);
    @@ -144,6 +157,42 @@
      
      /* Remember to update object flag allocation in object.h */
      #define COMPLETE	(1U << 0)
    +@@
    + }
    + 
    + static int get_pack(struct fetch_pack_args *args,
    +-		    int xd[2], struct string_list *pack_lockfiles)
    ++		    int xd[2], struct string_list *pack_lockfiles,
    ++		    int only_packfile)
    + {
    + 	struct async demux;
    + 	int do_keep = args->keep_pack;
    +@@
    + 					"--keep=fetch-pack %"PRIuMAX " on %s",
    + 					(uintmax_t)getpid(), hostname);
    + 		}
    +-		if (args->check_self_contained_and_connected)
    ++		if (only_packfile && args->check_self_contained_and_connected)
    + 			argv_array_push(&cmd.args, "--check-self-contained-and-connected");
    ++		else
    ++			/*
    ++			 * We cannot perform any connectivity checks because
    ++			 * not all packs have been downloaded; let the caller
    ++			 * have this responsibility.
    ++			 */
    ++			args->check_self_contained_and_connected = 0;
    + 		if (args->from_promisor)
    + 			argv_array_push(&cmd.args, "--promisor");
    + 	}
    +@@
    + 		alternate_shallow_file = setup_temporary_shallow(si->shallow);
    + 	else
    + 		alternate_shallow_file = NULL;
    +-	if (get_pack(args, fd, pack_lockfiles))
    ++	if (get_pack(args, fd, pack_lockfiles, 1))
    + 		die(_("git fetch-pack: fetch failed."));
    + 
    +  all_done:
     @@
      		warning("filtering not recognized by server, ignoring");
      	}
    @@ -175,27 +224,16 @@
      		die(_("error processing wanted refs: %d"), reader->status);
      }
      
    -+static void receive_packfile_uris(struct packet_reader *reader)
    ++static void receive_packfile_uris(struct packet_reader *reader,
    ++				  struct string_list *uris)
     +{
     +	process_section_header(reader, "packfile-uris", 0);
     +	while (packet_reader_read(reader) == PACKET_READ_NORMAL) {
    -+		const char *p;
    -+		struct child_process cmd = CHILD_PROCESS_INIT;
    -+
    -+
    -+		if (!skip_prefix(reader->line, "uri ", &p))
    -+			die("expected 'uri <uri>', got: %s\n", reader->line);
    ++		if (reader->pktlen < the_hash_algo->hexsz ||
    ++		    reader->line[the_hash_algo->hexsz] != ' ')
    ++			die("expected '<hash> <uri>', got: %s\n", reader->line);
     +
    -+		argv_array_push(&cmd.args, "http-fetch");
    -+		argv_array_push(&cmd.args, "--packfile");
    -+		argv_array_push(&cmd.args, p);
    -+		cmd.git_cmd = 1;
    -+		cmd.no_stdin = 1;
    -+		cmd.no_stdout = 1;
    -+		if (start_command(&cmd))
    -+			die("fetch-pack: unable to spawn");
    -+		if (finish_command(&cmd))
    -+			die("fetch-pack: unable to finish");
    ++		string_list_append(uris, reader->line);
     +	}
     +	if (reader->status != PACKET_READ_DELIM)
     +		die("expected DELIM");
    @@ -205,15 +243,81 @@
      	FETCH_CHECK_LOCAL = 0,
      	FETCH_SEND_REQUEST,
     @@
    + 	int in_vain = 0;
    + 	int haves_to_send = INITIAL_FLUSH;
    + 	struct fetch_negotiator negotiator;
    ++	struct string_list packfile_uris = STRING_LIST_INIT_DUP;
    ++	int i;
    ++
    + 	fetch_negotiator_init(&negotiator, negotiation_algorithm);
    + 	packet_reader_init(&reader, fd[0], NULL, 0,
    + 			   PACKET_READ_CHOMP_NEWLINE |
    +@@
    + 			if (process_section_header(&reader, "wanted-refs", 1))
      				receive_wanted_refs(&reader, sought, nr_sought);
      
    - 			/* get the pack */
    -+			if (process_section_header(&reader, "packfile-uris", 1)) {
    -+				receive_packfile_uris(&reader);
    -+			}
    +-			/* get the pack */
    ++			/* get the pack(s) */
    ++			if (process_section_header(&reader, "packfile-uris", 1))
    ++				receive_packfile_uris(&reader, &packfile_uris);
      			process_section_header(&reader, "packfile", 0);
    - 			if (get_pack(args, fd, pack_lockfile))
    +-			if (get_pack(args, fd, pack_lockfiles))
    ++			if (get_pack(args, fd, pack_lockfiles,
    ++				     !packfile_uris.nr))
      				die(_("git fetch-pack: fetch failed."));
    + 
    + 			state = FETCH_DONE;
    +@@
    + 		}
    + 	}
    + 
    ++	for (i = 0; i < packfile_uris.nr; i++) {
    ++		struct child_process cmd = CHILD_PROCESS_INIT;
    ++		char packname[GIT_MAX_HEXSZ + 1];
    ++		const char *uri = packfile_uris.items[i].string +
    ++			the_hash_algo->hexsz + 1;
    ++
    ++		argv_array_push(&cmd.args, "http-fetch");
    ++		argv_array_push(&cmd.args, "--packfile");
    ++		argv_array_push(&cmd.args, uri);
    ++		cmd.git_cmd = 1;
    ++		cmd.no_stdin = 1;
    ++		cmd.out = -1;
    ++		if (start_command(&cmd))
    ++			die("fetch-pack: unable to spawn http-fetch");
    ++
    ++		if (read_in_full(cmd.out, packname, 5) < 0 ||
    ++		    memcmp(packname, "keep\t", 5))
    ++			die("fetch-pack: expected keep then TAB at start of http-fetch output");
    ++
    ++		if (read_in_full(cmd.out, packname,
    ++				 the_hash_algo->hexsz + 1) < 0 ||
    ++		    packname[the_hash_algo->hexsz] != '\n')
    ++			die("fetch-pack: expected hash then LF at end of http-fetch output");
    ++
    ++		packname[the_hash_algo->hexsz] = '\0';
    ++
    ++		close(cmd.out);
    ++
    ++		if (finish_command(&cmd))
    ++			die("fetch-pack: unable to finish http-fetch");
    ++
    ++		if (memcmp(packfile_uris.items[i].string, packname,
    ++			   the_hash_algo->hexsz))
    ++			die("fetch-pack: pack downloaded from %s does not match expected hash %.*s",
    ++			    uri, (int) the_hash_algo->hexsz,
    ++			    packfile_uris.items[i].string);
    ++
    ++		string_list_append_nodup(pack_lockfiles,
    ++					 xstrfmt("%s/pack/pack-%s.keep",
    ++						 get_object_directory(),
    ++						 packname));
    ++	}
    ++	string_list_clear(&packfile_uris, 0);
    ++
    + 	negotiator.release(&negotiator);
    + 	oidset_clear(&common);
    + 	return ref;
     @@
      	git_config_get_bool("transfer.fsckobjects", &transfer_fsck_objects);
      	git_config_get_string("fetch.negotiationalgorithm",
    @@ -237,6 +341,15 @@
      	test_i18ngrep "expected no other sections to be sent after no .ready." err
      '
      
    ++configure_exclusion () {
    ++	git -C "$1" hash-object "$2" >objh &&
    ++	git -C "$1" pack-objects "$HTTPD_DOCUMENT_ROOT_PATH/mypack" <objh >packh &&
    ++	git -C "$1" config --add \
    ++		"uploadpack.blobpackfileuri" \
    ++		"$(cat objh) $(cat packh) $HTTPD_URL/dumb/mypack-$(cat packh).pack" &&
    ++	cat objh
    ++}
    ++
     +test_expect_success 'part of packfile response provided as URI' '
     +	P="$HTTPD_DOCUMENT_ROOT_PATH/http_parent" &&
     +	rm -rf "$P" http_child log &&
    @@ -250,32 +363,22 @@
     +	git -C "$P" add other-blob &&
     +	git -C "$P" commit -m x &&
     +
    -+	# Create a packfile for my-blob and configure it for exclusion.
    -+	git -C "$P" hash-object my-blob >h &&
    -+	git -C "$P" pack-objects --stdout <h \
    -+		>"$HTTPD_DOCUMENT_ROOT_PATH/one.pack" &&
    -+	git -C "$P" config \
    -+		"uploadpack.blobpackfileuri" \
    -+		"$(cat h) $HTTPD_URL/dumb/one.pack" &&
    -+
    -+	# Do the same for other-blob.
    -+	git -C "$P" hash-object other-blob >h2 &&
    -+	git -C "$P" pack-objects --stdout <h2 \
    -+		>"$HTTPD_DOCUMENT_ROOT_PATH/two.pack" &&
    -+	git -C "$P" config --add \
    -+		"uploadpack.blobpackfileuri" \
    -+		"$(cat h2) $HTTPD_URL/dumb/two.pack" &&
    ++	configure_exclusion "$P" my-blob >h &&
    ++	configure_exclusion "$P" other-blob >h2 &&
     +
    -+	GIT_TRACE_PACKET="$(pwd)/log" GIT_TEST_SIDEBAND_ALL=1 \
    -+		git -c protocol.version=2 \
    ++	GIT_TRACE=1 GIT_TRACE_PACKET="$(pwd)/log" GIT_TEST_SIDEBAND_ALL=1 \
    ++	git -c protocol.version=2 \
     +		-c fetch.uriprotocols=http,https \
    -+		clone  "$HTTPD_URL/smart/http_parent" http_child &&
    ++		clone "$HTTPD_URL/smart/http_parent" http_child &&
     +
     +	# Ensure that my-blob and other-blob are in separate packfiles.
     +	for idx in http_child/.git/objects/pack/*.idx
     +	do
     +		git verify-pack --verbose $idx >out &&
    -+		if test "$(grep "^[0-9a-f]\{40\} " out | wc -l)" = 1
    ++		{
    ++			grep "^[0-9a-f]\{16,\} " out || :
    ++		} >out.objectlist &&
    ++		if test_line_count = 1 out.objectlist
     +		then
     +			if grep $(cat h) out
     +			then
    @@ -288,7 +391,11 @@
     +		fi
     +	done &&
     +	test -f hfound &&
    -+	test -f h2found
    ++	test -f h2found &&
    ++
    ++	# Ensure that there are exactly 6 files (3 .pack and 3 .idx).
    ++	ls http_child/.git/objects/pack/* >filelist &&
    ++	test_line_count = 6 filelist
     +'
     +
      stop_httpd
    @@ -335,7 +442,7 @@
     +				packet_write_fmt(1, "\1packfile-uris\n");
     +			}
     +			*p = '\0';
    -+			packet_write_fmt(1, "\1uri %s\n", os->buffer);
    ++			packet_write_fmt(1, "\1%s\n", os->buffer);
     +
     +			os->used -= p - os->buffer + 1;
     +			memmove(os->buffer, p + 1, os->used);
8:  2c31d71166 < -:  ---------- SQUASH???
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v2 1/8] http: use --stdin when getting dumb HTTP pack
  2019-03-08 21:55 ` [PATCH v2 0/8] " Jonathan Tan
@ 2019-03-08 21:55   ` Jonathan Tan
  2019-03-08 21:55   ` [PATCH v2 2/8] http: improve documentation of http_pack_request Jonathan Tan
                     ` (9 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Jonathan Tan @ 2019-03-08 21:55 UTC (permalink / raw)
  To: avarab, git; +Cc: Jonathan Tan, Junio C Hamano

When Git fetches a pack using dumb HTTP, it reuses the server's name for
the packfile (which incorporates a hash), which is different from the
behavior of fetch-pack and receive-pack.

A subsequent patch will allow downloading packs over HTTP(S) as part of
a fetch. These downloads will not necessarily be from a Git repository,
and thus may not have a hash as part of its name.

Thus, teach http to pass --stdin to index-pack, so that we have no
reliance on the server's name for the packfile.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 http.c | 33 +++++++++++----------------------
 1 file changed, 11 insertions(+), 22 deletions(-)

diff --git a/http.c b/http.c
index a32ad36ddf..34f82af87c 100644
--- a/http.c
+++ b/http.c
@@ -2204,9 +2204,9 @@ int finish_http_pack_request(struct http_pack_request *preq)
 {
 	struct packed_git **lst;
 	struct packed_git *p = preq->target;
-	char *tmp_idx;
-	size_t len;
 	struct child_process ip = CHILD_PROCESS_INIT;
+	int tmpfile_fd;
+	int ret = 0;
 
 	close_pack_index(p);
 
@@ -2218,35 +2218,24 @@ int finish_http_pack_request(struct http_pack_request *preq)
 		lst = &((*lst)->next);
 	*lst = (*lst)->next;
 
-	if (!strip_suffix(preq->tmpfile.buf, ".pack.temp", &len))
-		BUG("pack tmpfile does not end in .pack.temp?");
-	tmp_idx = xstrfmt("%.*s.idx.temp", (int)len, preq->tmpfile.buf);
+	tmpfile_fd = xopen(preq->tmpfile.buf, O_RDONLY);
 
 	argv_array_push(&ip.args, "index-pack");
-	argv_array_pushl(&ip.args, "-o", tmp_idx, NULL);
-	argv_array_push(&ip.args, preq->tmpfile.buf);
+	argv_array_push(&ip.args, "--stdin");
 	ip.git_cmd = 1;
-	ip.no_stdin = 1;
+	ip.in = tmpfile_fd;
 	ip.no_stdout = 1;
 
 	if (run_command(&ip)) {
-		unlink(preq->tmpfile.buf);
-		unlink(tmp_idx);
-		free(tmp_idx);
-		return -1;
-	}
-
-	unlink(sha1_pack_index_name(p->sha1));
-
-	if (finalize_object_file(preq->tmpfile.buf, sha1_pack_name(p->sha1))
-	 || finalize_object_file(tmp_idx, sha1_pack_index_name(p->sha1))) {
-		free(tmp_idx);
-		return -1;
+		ret = -1;
+		goto cleanup;
 	}
 
 	install_packed_git(the_repository, p);
-	free(tmp_idx);
-	return 0;
+cleanup:
+	close(tmpfile_fd);
+	unlink(preq->tmpfile.buf);
+	return ret;
 }
 
 struct http_pack_request *new_http_pack_request(
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 2/8] http: improve documentation of http_pack_request
  2019-03-08 21:55 ` [PATCH v2 0/8] " Jonathan Tan
  2019-03-08 21:55   ` [PATCH v2 1/8] http: use --stdin when getting dumb HTTP pack Jonathan Tan
@ 2019-03-08 21:55   ` Jonathan Tan
  2019-03-08 21:55   ` [PATCH v2 3/8] http-fetch: support fetching packfiles by URL Jonathan Tan
                     ` (8 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Jonathan Tan @ 2019-03-08 21:55 UTC (permalink / raw)
  To: avarab, git; +Cc: Jonathan Tan, Junio C Hamano

struct http_pack_request and the functions that use it will be modified
in a subsequent patch. Using it is complicated (to use, call the
initialization function, then set some but not all fields in the
returned struct), so add some documentation to help future users.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 http.h | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/http.h b/http.h
index 4eb4e808e5..ded1edcca4 100644
--- a/http.h
+++ b/http.h
@@ -202,14 +202,31 @@ extern int http_get_info_packs(const char *base_url,
 	struct packed_git **packs_head);
 
 struct http_pack_request {
+	/*
+	 * Initialized by new_http_pack_request().
+	 */
 	char *url;
 	struct packed_git *target;
+	struct active_request_slot *slot;
+
+	/*
+	 * After calling new_http_pack_request(), point lst to the head of the
+	 * pack list that target is in. finish_http_pack_request() will remove
+	 * target from lst and call install_packed_git() on target.
+	 */
 	struct packed_git **lst;
+
+	/*
+	 * State managed by functions in http.c.
+	 */
 	FILE *packfile;
 	struct strbuf tmpfile;
-	struct active_request_slot *slot;
 };
 
+/*
+ * target must be an element in a pack list obtained from
+ * http_get_info_packs().
+ */
 extern struct http_pack_request *new_http_pack_request(
 	struct packed_git *target, const char *base_url);
 extern int finish_http_pack_request(struct http_pack_request *preq);
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 3/8] http-fetch: support fetching packfiles by URL
  2019-03-08 21:55 ` [PATCH v2 0/8] " Jonathan Tan
  2019-03-08 21:55   ` [PATCH v2 1/8] http: use --stdin when getting dumb HTTP pack Jonathan Tan
  2019-03-08 21:55   ` [PATCH v2 2/8] http: improve documentation of http_pack_request Jonathan Tan
@ 2019-03-08 21:55   ` Jonathan Tan
  2019-03-08 21:55   ` [PATCH v2 4/8] Documentation: order protocol v2 sections Jonathan Tan
                     ` (7 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Jonathan Tan @ 2019-03-08 21:55 UTC (permalink / raw)
  To: avarab, git; +Cc: Jonathan Tan, Junio C Hamano

Teach http-fetch the ability to download packfiles directly, given a
URL, and to verify them.

The http_pack_request suite of functions have been modified to support a
NULL target. When target is NULL, the given URL is downloaded directly
instead of being treated as the root of a repository.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/git-http-fetch.txt |  8 +++-
 http-fetch.c                     | 64 +++++++++++++++++++++++++-------
 http.c                           | 49 +++++++++++++++++-------
 http.h                           | 19 ++++++++--
 t/t5550-http-fetch-dumb.sh       | 25 +++++++++++++
 5 files changed, 135 insertions(+), 30 deletions(-)

diff --git a/Documentation/git-http-fetch.txt b/Documentation/git-http-fetch.txt
index 666b042679..8357359a9b 100644
--- a/Documentation/git-http-fetch.txt
+++ b/Documentation/git-http-fetch.txt
@@ -9,7 +9,7 @@ git-http-fetch - Download from a remote Git repository via HTTP
 SYNOPSIS
 --------
 [verse]
-'git http-fetch' [-c] [-t] [-a] [-d] [-v] [-w filename] [--recover] [--stdin] <commit> <url>
+'git http-fetch' [-c] [-t] [-a] [-d] [-v] [-w filename] [--recover] [--stdin | --packfile | <commit>] <url>
 
 DESCRIPTION
 -----------
@@ -40,6 +40,12 @@ commit-id::
 
 		<commit-id>['\t'<filename-as-in--w>]
 
+--packfile::
+	Instead of a commit id on the command line (which is not expected in
+	this case), 'git http-fetch' fetches the packfile directly at the given
+	URL and uses index-pack to generate corresponding .idx and .keep files.
+	The output of index-pack is printed to stdout.
+
 --recover::
 	Verify that everything reachable from target is fetched.  Used after
 	an earlier fetch is interrupted.
diff --git a/http-fetch.c b/http-fetch.c
index a32ac118d9..a9764d6f96 100644
--- a/http-fetch.c
+++ b/http-fetch.c
@@ -5,7 +5,7 @@
 #include "walker.h"
 
 static const char http_fetch_usage[] = "git http-fetch "
-"[-c] [-t] [-a] [-v] [--recover] [-w ref] [--stdin] commit-id url";
+"[-c] [-t] [-a] [-v] [--recover] [-w ref] [--stdin | --packfile | commit-id] url";
 
 int cmd_main(int argc, const char **argv)
 {
@@ -19,6 +19,7 @@ int cmd_main(int argc, const char **argv)
 	int rc = 0;
 	int get_verbosely = 0;
 	int get_recover = 0;
+	int packfile = 0;
 
 	while (arg < argc && argv[arg][0] == '-') {
 		if (argv[arg][1] == 't') {
@@ -35,43 +36,80 @@ int cmd_main(int argc, const char **argv)
 			get_recover = 1;
 		} else if (!strcmp(argv[arg], "--stdin")) {
 			commits_on_stdin = 1;
+		} else if (!strcmp(argv[arg], "--packfile")) {
+			packfile = 1;
 		}
 		arg++;
 	}
-	if (argc != arg + 2 - commits_on_stdin)
+	if (argc != arg + 2 - (commits_on_stdin || packfile))
 		usage(http_fetch_usage);
 	if (commits_on_stdin) {
 		commits = walker_targets_stdin(&commit_id, &write_ref);
+	} else if (packfile) {
+		/* URL will be set later */
 	} else {
 		commit_id = (char **) &argv[arg++];
 		commits = 1;
 	}
 
-	if (argv[arg])
-		str_end_url_with_slash(argv[arg], &url);
+	if (packfile) {
+		url = xstrdup(argv[arg]);
+	} else {
+		if (argv[arg])
+			str_end_url_with_slash(argv[arg], &url);
+	}
 
 	setup_git_directory();
 
 	git_config(git_default_config, NULL);
 
 	http_init(NULL, url, 0);
-	walker = get_http_walker(url);
-	walker->get_verbosely = get_verbosely;
-	walker->get_recover = get_recover;
 
-	rc = walker_fetch(walker, commits, commit_id, write_ref, url);
+	if (packfile) {
+		struct http_pack_request *preq;
+		struct slot_results results;
+		int ret;
+
+		preq = new_http_pack_request(NULL, url);
+		if (preq == NULL)
+			die("couldn't create http pack request");
+		preq->slot->results = &results;
+		preq->generate_keep = 1;
+
+		if (start_active_slot(preq->slot)) {
+			run_active_slot(preq->slot);
+			if (results.curl_result != CURLE_OK) {
+				die("Unable to get pack file %s\n%s", preq->url,
+				    curl_errorstr);
+			}
+		} else {
+			die("Unable to start request");
+		}
+
+		if ((ret = finish_http_pack_request(preq)))
+			die("finish_http_pack_request gave result %d", ret);
+		release_http_pack_request(preq);
+		rc = 0;
+	} else {
+		walker = get_http_walker(url);
+		walker->get_verbosely = get_verbosely;
+		walker->get_recover = get_recover;
+
+		rc = walker_fetch(walker, commits, commit_id, write_ref, url);
 
-	if (commits_on_stdin)
-		walker_targets_free(commits, commit_id, write_ref);
+		if (commits_on_stdin)
+			walker_targets_free(commits, commit_id, write_ref);
 
-	if (walker->corrupt_object_found) {
-		fprintf(stderr,
+		if (walker->corrupt_object_found) {
+			fprintf(stderr,
 "Some loose object were found to be corrupt, but they might be just\n"
 "a false '404 Not Found' error message sent with incorrect HTTP\n"
 "status code.  Suggest running 'git fsck'.\n");
+		}
+
+		walker_free(walker);
 	}
 
-	walker_free(walker);
 	http_cleanup();
 
 	free(url);
diff --git a/http.c b/http.c
index 34f82af87c..b372e61520 100644
--- a/http.c
+++ b/http.c
@@ -2208,15 +2208,18 @@ int finish_http_pack_request(struct http_pack_request *preq)
 	int tmpfile_fd;
 	int ret = 0;
 
-	close_pack_index(p);
+	if (p)
+		close_pack_index(p);
 
 	fclose(preq->packfile);
 	preq->packfile = NULL;
 
-	lst = preq->lst;
-	while (*lst != p)
-		lst = &((*lst)->next);
-	*lst = (*lst)->next;
+	if (p) {
+		lst = preq->lst;
+		while (*lst != p)
+			lst = &((*lst)->next);
+		*lst = (*lst)->next;
+	}
 
 	tmpfile_fd = xopen(preq->tmpfile.buf, O_RDONLY);
 
@@ -2224,14 +2227,21 @@ int finish_http_pack_request(struct http_pack_request *preq)
 	argv_array_push(&ip.args, "--stdin");
 	ip.git_cmd = 1;
 	ip.in = tmpfile_fd;
-	ip.no_stdout = 1;
+	if (preq->generate_keep) {
+		argv_array_pushf(&ip.args, "--keep=git %"PRIuMAX,
+				 (uintmax_t)getpid());
+		ip.out = 0;
+	} else {
+		ip.no_stdout = 1;
+	}
 
 	if (run_command(&ip)) {
 		ret = -1;
 		goto cleanup;
 	}
 
-	install_packed_git(the_repository, p);
+	if (p)
+		install_packed_git(the_repository, p);
 cleanup:
 	close(tmpfile_fd);
 	unlink(preq->tmpfile.buf);
@@ -2249,12 +2259,24 @@ struct http_pack_request *new_http_pack_request(
 	strbuf_init(&preq->tmpfile, 0);
 	preq->target = target;
 
-	end_url_with_slash(&buf, base_url);
-	strbuf_addf(&buf, "objects/pack/pack-%s.pack",
-		sha1_to_hex(target->sha1));
-	preq->url = strbuf_detach(&buf, NULL);
+	if (target) {
+		end_url_with_slash(&buf, base_url);
+		strbuf_addf(&buf, "objects/pack/pack-%s.pack",
+			sha1_to_hex(target->sha1));
+		preq->url = strbuf_detach(&buf, NULL);
+	} else {
+		preq->url = xstrdup(base_url);
+	}
+
+	if (target) {
+		strbuf_addf(&preq->tmpfile, "%s.temp",
+			    sha1_pack_name(target->sha1));
+	} else {
+		strbuf_addf(&preq->tmpfile, "%s/pack/pack-", get_object_directory());
+		strbuf_addstr_urlencode(&preq->tmpfile, base_url, 1);
+		strbuf_addstr(&preq->tmpfile, ".temp");
+	}
 
-	strbuf_addf(&preq->tmpfile, "%s.temp", sha1_pack_name(target->sha1));
 	preq->packfile = fopen(preq->tmpfile.buf, "a");
 	if (!preq->packfile) {
 		error("Unable to open local file %s for pack",
@@ -2278,7 +2300,8 @@ struct http_pack_request *new_http_pack_request(
 		if (http_is_verbose)
 			fprintf(stderr,
 				"Resuming fetch of pack %s at byte %"PRIuMAX"\n",
-				sha1_to_hex(target->sha1), (uintmax_t)prev_posn);
+				target ? sha1_to_hex(target->sha1) : base_url,
+				(uintmax_t)prev_posn);
 		http_opt_request_remainder(preq->slot->curl, prev_posn);
 	}
 
diff --git a/http.h b/http.h
index ded1edcca4..8f63c76e09 100644
--- a/http.h
+++ b/http.h
@@ -210,12 +210,21 @@ struct http_pack_request {
 	struct active_request_slot *slot;
 
 	/*
-	 * After calling new_http_pack_request(), point lst to the head of the
+	 * After calling new_http_pack_request(), if fetching a pack that
+	 * http_get_info_packs() told us about, point lst to the head of the
 	 * pack list that target is in. finish_http_pack_request() will remove
 	 * target from lst and call install_packed_git() on target.
 	 */
 	struct packed_git **lst;
 
+	/*
+	 * If this is true, finish_http_pack_request() will pass "--keep" to
+	 * index-pack, resulting in the creation of a keep file, and will not
+	 * suppress its stdout (that is, the "keep\t<hash>\n" line will be
+	 * printed to stdout).
+	 */
+	unsigned generate_keep : 1;
+
 	/*
 	 * State managed by functions in http.c.
 	 */
@@ -224,8 +233,12 @@ struct http_pack_request {
 };
 
 /*
- * target must be an element in a pack list obtained from
- * http_get_info_packs().
+ * If fetching a pack that http_get_info_packs() told us about, set target to
+ * an element in a pack list obtained from http_get_info_packs(). The actual
+ * URL fetched will be base_url followed by a suffix with the hash of the pack.
+ *
+ * Otherwise, set target to NULL. The actual URL fetched will be base_url
+ * itself.
  */
 extern struct http_pack_request *new_http_pack_request(
 	struct packed_git *target, const char *base_url);
diff --git a/t/t5550-http-fetch-dumb.sh b/t/t5550-http-fetch-dumb.sh
index 6d7d88ccc9..84235fd6d4 100755
--- a/t/t5550-http-fetch-dumb.sh
+++ b/t/t5550-http-fetch-dumb.sh
@@ -199,6 +199,23 @@ test_expect_success 'fetch packed objects' '
 	git clone $HTTPD_URL/dumb/repo_pack.git
 '
 
+test_expect_success 'http-fetch --packfile' '
+	git init packfileclient &&
+	p=$(cd "$HTTPD_DOCUMENT_ROOT_PATH"/repo_pack.git && ls objects/pack/pack-*.pack) &&
+	git -C packfileclient http-fetch --packfile "$HTTPD_URL"/dumb/repo_pack.git/$p >out &&
+
+	# Ensure that the expected files are generated
+	grep "^keep.[0-9a-f]\{16,\}$" out &&
+	cut -c6- out >packhash &&
+	test -e "packfileclient/.git/objects/pack/pack-$(cat packhash).pack" &&
+	test -e "packfileclient/.git/objects/pack/pack-$(cat packhash).idx" &&
+	test -e "packfileclient/.git/objects/pack/pack-$(cat packhash).keep" &&
+
+	# Ensure that it has the HEAD of repo_pack, at least
+	HASH=$(git -C "$HTTPD_DOCUMENT_ROOT_PATH"/repo_pack.git rev-parse HEAD) &&
+	git -C packfileclient cat-file -e "$HASH"
+'
+
 test_expect_success 'fetch notices corrupt pack' '
 	cp -R "$HTTPD_DOCUMENT_ROOT_PATH"/repo_pack.git "$HTTPD_DOCUMENT_ROOT_PATH"/repo_bad1.git &&
 	(cd "$HTTPD_DOCUMENT_ROOT_PATH"/repo_bad1.git &&
@@ -214,6 +231,14 @@ test_expect_success 'fetch notices corrupt pack' '
 	)
 '
 
+test_expect_success 'http-fetch --packfile with corrupt pack' '
+	rm -rf packfileclient &&
+	git init packfileclient &&
+	p=$(cd "$HTTPD_DOCUMENT_ROOT_PATH"/repo_bad1.git && ls objects/pack/pack-*.pack) &&
+	test_must_fail git -C packfileclient http-fetch --packfile \
+		"$HTTPD_URL"/dumb/repo_bad1.git/$p
+'
+
 test_expect_success 'fetch notices corrupt idx' '
 	cp -R "$HTTPD_DOCUMENT_ROOT_PATH"/repo_pack.git "$HTTPD_DOCUMENT_ROOT_PATH"/repo_bad2.git &&
 	(cd "$HTTPD_DOCUMENT_ROOT_PATH"/repo_bad2.git &&
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 4/8] Documentation: order protocol v2 sections
  2019-03-08 21:55 ` [PATCH v2 0/8] " Jonathan Tan
                     ` (2 preceding siblings ...)
  2019-03-08 21:55   ` [PATCH v2 3/8] http-fetch: support fetching packfiles by URL Jonathan Tan
@ 2019-03-08 21:55   ` Jonathan Tan
  2019-03-08 21:55   ` [PATCH v2 5/8] Documentation: add Packfile URIs design doc Jonathan Tan
                     ` (6 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Jonathan Tan @ 2019-03-08 21:55 UTC (permalink / raw)
  To: avarab, git; +Cc: Jonathan Tan, Junio C Hamano

The current C Git implementation expects Git servers to follow a
specific order of sections when transmitting protocol v2 responses, but
this is not explicit in the documentation. Make the order explicit.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/technical/protocol-v2.txt | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
index ead85ce35c..36239ec7e9 100644
--- a/Documentation/technical/protocol-v2.txt
+++ b/Documentation/technical/protocol-v2.txt
@@ -325,11 +325,11 @@ included in the client's request:
 
 The response of `fetch` is broken into a number of sections separated by
 delimiter packets (0001), with each section beginning with its section
-header.
+header. Most sections are sent only when the packfile is sent.
 
-    output = *section
-    section = (acknowledgments | shallow-info | wanted-refs | packfile)
-	      (flush-pkt | delim-pkt)
+    output = acknowledgements flush-pkt |
+	     [acknowledgments delim-pkt] [shallow-info delim-pkt]
+	     [wanted-refs delim-pkt] packfile flush-pkt
 
     acknowledgments = PKT-LINE("acknowledgments" LF)
 		      (nak | *ack)
@@ -351,9 +351,10 @@ header.
 	       *PKT-LINE(%x01-03 *%x00-ff)
 
     acknowledgments section
-	* If the client determines that it is finished with negotiations
-	  by sending a "done" line, the acknowledgments sections MUST be
-	  omitted from the server's response.
+	* If the client determines that it is finished with negotiations by
+	  sending a "done" line (thus requiring the server to send a packfile),
+	  the acknowledgments sections MUST be omitted from the server's
+	  response.
 
 	* Always begins with the section header "acknowledgments"
 
@@ -404,9 +405,6 @@ header.
 	  which the client has not indicated was shallow as a part of
 	  its request.
 
-	* This section is only included if a packfile section is also
-	  included in the response.
-
     wanted-refs section
 	* This section is only included if the client has requested a
 	  ref using a 'want-ref' line and if a packfile section is also
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 5/8] Documentation: add Packfile URIs design doc
  2019-03-08 21:55 ` [PATCH v2 0/8] " Jonathan Tan
                     ` (3 preceding siblings ...)
  2019-03-08 21:55   ` [PATCH v2 4/8] Documentation: order protocol v2 sections Jonathan Tan
@ 2019-03-08 21:55   ` Jonathan Tan
  2019-04-23  5:31     ` Jeff King
  2019-03-08 21:55   ` [PATCH v2 6/8] upload-pack: refactor reading of pack-objects out Jonathan Tan
                     ` (5 subsequent siblings)
  10 siblings, 1 reply; 45+ messages in thread
From: Jonathan Tan @ 2019-03-08 21:55 UTC (permalink / raw)
  To: avarab, git; +Cc: Jonathan Tan, Junio C Hamano

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Documentation/technical/packfile-uri.txt | 78 ++++++++++++++++++++++++
 Documentation/technical/protocol-v2.txt  | 28 ++++++++-
 2 files changed, 105 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/technical/packfile-uri.txt

diff --git a/Documentation/technical/packfile-uri.txt b/Documentation/technical/packfile-uri.txt
new file mode 100644
index 0000000000..6a5a6440d5
--- /dev/null
+++ b/Documentation/technical/packfile-uri.txt
@@ -0,0 +1,78 @@
+Packfile URIs
+=============
+
+This feature allows servers to serve part of their packfile response as URIs.
+This allows server designs that improve scalability in bandwidth and CPU usage
+(for example, by serving some data through a CDN), and (in the future) provides
+some measure of resumability to clients.
+
+This feature is available only in protocol version 2.
+
+Protocol
+--------
+
+The server advertises `packfile-uris`.
+
+If the client then communicates which protocols (HTTPS, etc.) it supports with
+a `packfile-uris` argument, the server MAY send a `packfile-uris` section
+directly before the `packfile` section (right after `wanted-refs` if it is
+sent) containing URIs of any of the given protocols. The URIs point to
+packfiles that use only features that the client has declared that it supports
+(e.g. ofs-delta and thin-pack). See protocol-v2.txt for the documentation of
+this section.
+
+Clients then should understand that the returned packfile could be incomplete,
+and that it needs to download all the given URIs before the fetch or clone is
+complete.
+
+Server design
+-------------
+
+The server can be trivially made compatible with the proposed protocol by
+having it advertise `packfile-uris`, tolerating the client sending
+`packfile-uris`, and never sending any `packfile-uris` section. But we should
+include some sort of non-trivial implementation in the Minimum Viable Product,
+at least so that we can test the client.
+
+This is the implementation: a feature, marked experimental, that allows the
+server to be configured by one or more `uploadpack.blobPackfileUri=<sha1>
+<uri>` entries. Whenever the list of objects to be sent is assembled, a blob
+with the given sha1 can be replaced by the given URI. This allows, for example,
+servers to delegate serving of large blobs to CDNs.
+
+Client design
+-------------
+
+While fetching, the client needs to remember the list of URIs and cannot
+declare that the fetch is complete until all URIs have been downloaded as
+packfiles.
+
+The division of work (initial fetch + additional URIs) introduces convenient
+points for resumption of an interrupted clone - such resumption can be done
+after the Minimum Viable Product (see "Future work").
+
+The client can inhibit this feature (i.e. refrain from sending the
+`packfile-uris` parameter) by passing --no-packfile-uris to `git fetch`.
+
+Future work
+-----------
+
+The protocol design allows some evolution of the server and client without any
+need for protocol changes, so only a small-scoped design is included here to
+form the MVP. For example, the following can be done:
+
+ * On the server, a long-running process that takes in entire requests and
+   outputs a list of URIs and the corresponding inclusion and exclusion sets of
+   objects. This allows, e.g., signed URIs to be used and packfiles for common
+   requests to be cached.
+ * On the client, resumption of clone. If a clone is interrupted, information
+   could be recorded in the repository's config and a "clone-resume" command
+   can resume the clone in progress. (Resumption of subsequent fetches is more
+   difficult because that must deal with the user wanting to use the repository
+   even after the fetch was interrupted.)
+
+There are some possible features that will require a change in protocol:
+
+ * Additional HTTP headers (e.g. authentication)
+ * Byte range support
+ * Different file formats referenced by URIs (e.g. raw object)
diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
index 36239ec7e9..7b63c26ecd 100644
--- a/Documentation/technical/protocol-v2.txt
+++ b/Documentation/technical/protocol-v2.txt
@@ -323,13 +323,26 @@ included in the client's request:
 	indicating its sideband (1, 2, or 3), and the server may send "0005\2"
 	(a PKT-LINE of sideband 2 with no payload) as a keepalive packet.
 
+If the 'packfile-uris' feature is advertised, the following argument
+can be included in the client's request as well as the potential
+addition of the 'packfile-uris' section in the server's response as
+explained below.
+
+    packfile-uris <comma-separated list of protocols>
+	Indicates to the server that the client is willing to receive
+	URIs of any of the given protocols in place of objects in the
+	sent packfile. Before performing the connectivity check, the
+	client should download from all given URIs. Currently, the
+	protocols supported are "http" and "https".
+
 The response of `fetch` is broken into a number of sections separated by
 delimiter packets (0001), with each section beginning with its section
 header. Most sections are sent only when the packfile is sent.
 
     output = acknowledgements flush-pkt |
 	     [acknowledgments delim-pkt] [shallow-info delim-pkt]
-	     [wanted-refs delim-pkt] packfile flush-pkt
+	     [wanted-refs delim-pkt] [packfile-uris delim-pkt]
+	     packfile flush-pkt
 
     acknowledgments = PKT-LINE("acknowledgments" LF)
 		      (nak | *ack)
@@ -347,6 +360,9 @@ header. Most sections are sent only when the packfile is sent.
 		  *PKT-LINE(wanted-ref LF)
     wanted-ref = obj-id SP refname
 
+    packfile-uris = PKT-LINE("packfile-uris" LF) *packfile-uri
+    packfile-uri = PKT-LINE(40*(HEXDIGIT) SP *%x20-ff LF)
+
     packfile = PKT-LINE("packfile" LF)
 	       *PKT-LINE(%x01-03 *%x00-ff)
 
@@ -418,6 +434,16 @@ header. Most sections are sent only when the packfile is sent.
 	* The server MUST NOT send any refs which were not requested
 	  using 'want-ref' lines.
 
+    packfile-uris section
+	* This section is only included if the client sent
+	  'packfile-uris' and the server has at least one such URI to
+	  send.
+
+	* Always begins with the section header "packfile-uris".
+
+	* For each URI the server sends, it sends a hash of the pack's
+	  contents (as output by git index-pack) followed by the URI.
+
     packfile section
 	* This section is only included if the client has sent 'want'
 	  lines in its request and either requested that no more
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 6/8] upload-pack: refactor reading of pack-objects out
  2019-03-08 21:55 ` [PATCH v2 0/8] " Jonathan Tan
                     ` (4 preceding siblings ...)
  2019-03-08 21:55   ` [PATCH v2 5/8] Documentation: add Packfile URIs design doc Jonathan Tan
@ 2019-03-08 21:55   ` Jonathan Tan
  2019-03-08 21:55   ` [PATCH v2 7/8] fetch-pack: support more than one pack lockfile Jonathan Tan
                     ` (4 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Jonathan Tan @ 2019-03-08 21:55 UTC (permalink / raw)
  To: avarab, git; +Cc: Jonathan Tan, Junio C Hamano

Subsequent patches will change how the output of pack-objects is
processed, so extract that processing into its own function.

Currently, at most 1 character can be buffered (in the "buffered" local
variable). One of those patches will require a larger buffer, so replace
that "buffered" local variable with a buffer array.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 upload-pack.c | 81 ++++++++++++++++++++++++++++++---------------------
 1 file changed, 47 insertions(+), 34 deletions(-)

diff --git a/upload-pack.c b/upload-pack.c
index d098ef5982..987d2e139b 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -102,14 +102,52 @@ static int write_one_shallow(const struct commit_graft *graft, void *cb_data)
 	return 0;
 }
 
+struct output_state {
+	char buffer[8193];
+	int used;
+};
+
+static int relay_pack_data(int pack_objects_out, struct output_state *os)
+{
+	/*
+	 * We keep the last byte to ourselves
+	 * in case we detect broken rev-list, so that we
+	 * can leave the stream corrupted.  This is
+	 * unfortunate -- unpack-objects would happily
+	 * accept a valid packdata with trailing garbage,
+	 * so appending garbage after we pass all the
+	 * pack data is not good enough to signal
+	 * breakage to downstream.
+	 */
+	ssize_t readsz;
+
+	readsz = xread(pack_objects_out, os->buffer + os->used,
+		       sizeof(os->buffer) - os->used);
+	if (readsz < 0) {
+		return readsz;
+	}
+	os->used += readsz;
+
+	if (os->used > 1) {
+		send_client_data(1, os->buffer, os->used - 1);
+		os->buffer[0] = os->buffer[os->used - 1];
+		os->used = 1;
+	} else {
+		send_client_data(1, os->buffer, os->used);
+		os->used = 0;
+	}
+
+	return readsz;
+}
+
 static void create_pack_file(const struct object_array *have_obj,
 			     const struct object_array *want_obj)
 {
 	struct child_process pack_objects = CHILD_PROCESS_INIT;
-	char data[8193], progress[128];
+	struct output_state output_state = {0};
+	char progress[128];
 	char abort_msg[] = "aborting due to possible repository "
 		"corruption on the remote side.";
-	int buffered = -1;
 	ssize_t sz;
 	int i;
 	FILE *pipe_fd;
@@ -239,39 +277,15 @@ static void create_pack_file(const struct object_array *have_obj,
 			continue;
 		}
 		if (0 <= pu && (pfd[pu].revents & (POLLIN|POLLHUP))) {
-			/* Data ready; we keep the last byte to ourselves
-			 * in case we detect broken rev-list, so that we
-			 * can leave the stream corrupted.  This is
-			 * unfortunate -- unpack-objects would happily
-			 * accept a valid packdata with trailing garbage,
-			 * so appending garbage after we pass all the
-			 * pack data is not good enough to signal
-			 * breakage to downstream.
-			 */
-			char *cp = data;
-			ssize_t outsz = 0;
-			if (0 <= buffered) {
-				*cp++ = buffered;
-				outsz++;
-			}
-			sz = xread(pack_objects.out, cp,
-				  sizeof(data) - outsz);
-			if (0 < sz)
-				;
-			else if (sz == 0) {
+			int result = relay_pack_data(pack_objects.out,
+						     &output_state);
+
+			if (result == 0) {
 				close(pack_objects.out);
 				pack_objects.out = -1;
-			}
-			else
+			} else if (result < 0) {
 				goto fail;
-			sz += outsz;
-			if (1 < sz) {
-				buffered = data[sz-1] & 0xFF;
-				sz--;
 			}
-			else
-				buffered = -1;
-			send_client_data(1, data, sz);
 		}
 
 		/*
@@ -296,9 +310,8 @@ static void create_pack_file(const struct object_array *have_obj,
 	}
 
 	/* flush the data */
-	if (0 <= buffered) {
-		data[0] = buffered;
-		send_client_data(1, data, 1);
+	if (output_state.used > 0) {
+		send_client_data(1, output_state.buffer, output_state.used);
 		fprintf(stderr, "flushed.\n");
 	}
 	if (use_sideband)
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 7/8] fetch-pack: support more than one pack lockfile
  2019-03-08 21:55 ` [PATCH v2 0/8] " Jonathan Tan
                     ` (5 preceding siblings ...)
  2019-03-08 21:55   ` [PATCH v2 6/8] upload-pack: refactor reading of pack-objects out Jonathan Tan
@ 2019-03-08 21:55   ` Jonathan Tan
  2019-03-08 21:55   ` [PATCH v2 8/8] upload-pack: send part of packfile response as uri Jonathan Tan
                     ` (3 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Jonathan Tan @ 2019-03-08 21:55 UTC (permalink / raw)
  To: avarab, git; +Cc: Jonathan Tan

Whenever a fetch results in a packfile being downloaded, a .keep file is
generated, so that the packfile can be preserved (from, say, a running
"git repack") until refs are written referring to the contents of the
packfile.

In a subsequent patch, a successful fetch using protocol v2 may result
in more than one .keep file being generated. Therefore, teach
fetch_pack() and the transport mechanism to support multiple .keep
files.

Implementation notes:

 - builtin/fetch-pack.c normally does not generate .keep files, and thus
   is unaffected by this or future changes. However, it has an
   undocumented "--lock-pack" feature, used by remote-curl.c when
   implementing the "fetch" remote helper command. In keeping with the
   remote helper protocol, only one "lock" line will ever be written;
   the rest will result in warnings to stderr. However, in practice,
   warnings will never be written because the remote-curl.c "fetch" is
   only used for protocol v0/v1 (which will not generate multiple .keep
   files). (Protocol v2 uses the "stateless-connect" command, not the
   "fetch" command.)

 - connected.c has an optimization in that connectivity checks on a ref
   need not be done if the target object is in a pack known to be
   self-contained and connected. If there are multiple packfiles, this
   optimization can no longer be done.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/fetch-pack.c | 17 +++++++++++------
 connected.c          |  8 +++++---
 fetch-pack.c         | 23 ++++++++++++-----------
 fetch-pack.h         |  2 +-
 transport-helper.c   |  5 +++--
 transport.c          | 14 ++++++++------
 transport.h          |  6 +++---
 7 files changed, 43 insertions(+), 32 deletions(-)

diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index 153a2bd282..709fc4c44b 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -48,8 +48,8 @@ int cmd_fetch_pack(int argc, const char **argv, const char *prefix)
 	struct ref **sought = NULL;
 	int nr_sought = 0, alloc_sought = 0;
 	int fd[2];
-	char *pack_lockfile = NULL;
-	char **pack_lockfile_ptr = NULL;
+	struct string_list pack_lockfiles = STRING_LIST_INIT_DUP;
+	struct string_list *pack_lockfiles_ptr = NULL;
 	struct child_process *conn;
 	struct fetch_pack_args args;
 	struct oid_array shallow = OID_ARRAY_INIT;
@@ -134,7 +134,7 @@ int cmd_fetch_pack(int argc, const char **argv, const char *prefix)
 		}
 		if (!strcmp("--lock-pack", arg)) {
 			args.lock_pack = 1;
-			pack_lockfile_ptr = &pack_lockfile;
+			pack_lockfiles_ptr = &pack_lockfiles;
 			continue;
 		}
 		if (!strcmp("--check-self-contained-and-connected", arg)) {
@@ -235,10 +235,15 @@ int cmd_fetch_pack(int argc, const char **argv, const char *prefix)
 	}
 
 	ref = fetch_pack(&args, fd, conn, ref, dest, sought, nr_sought,
-			 &shallow, pack_lockfile_ptr, version);
-	if (pack_lockfile) {
-		printf("lock %s\n", pack_lockfile);
+			 &shallow, pack_lockfiles_ptr, version);
+	if (pack_lockfiles.nr) {
+		int i;
+
+		printf("lock %s\n", pack_lockfiles.items[0].string);
 		fflush(stdout);
+		for (i = 1; i < pack_lockfiles.nr; i++)
+			warning(_("Lockfile created but not reported: %s"),
+				pack_lockfiles.items[i].string);
 	}
 	if (args.check_self_contained_and_connected &&
 	    args.self_contained_and_connected) {
diff --git a/connected.c b/connected.c
index 1bba888eff..a866fcf9e8 100644
--- a/connected.c
+++ b/connected.c
@@ -40,10 +40,12 @@ int check_connected(oid_iterate_fn fn, void *cb_data,
 
 	if (transport && transport->smart_options &&
 	    transport->smart_options->self_contained_and_connected &&
-	    transport->pack_lockfile &&
-	    strip_suffix(transport->pack_lockfile, ".keep", &base_len)) {
+	    transport->pack_lockfiles.nr == 1 &&
+	    strip_suffix(transport->pack_lockfiles.items[0].string,
+			 ".keep", &base_len)) {
 		struct strbuf idx_file = STRBUF_INIT;
-		strbuf_add(&idx_file, transport->pack_lockfile, base_len);
+		strbuf_add(&idx_file, transport->pack_lockfiles.items[0].string,
+			   base_len);
 		strbuf_addstr(&idx_file, ".idx");
 		new_pack = add_packed_git(idx_file.buf, idx_file.len, 1);
 		strbuf_release(&idx_file);
diff --git a/fetch-pack.c b/fetch-pack.c
index 812be15d7e..cf89af21d9 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -755,7 +755,7 @@ static int sideband_demux(int in, int out, void *data)
 }
 
 static int get_pack(struct fetch_pack_args *args,
-		    int xd[2], char **pack_lockfile)
+		    int xd[2], struct string_list *pack_lockfiles)
 {
 	struct async demux;
 	int do_keep = args->keep_pack;
@@ -798,7 +798,7 @@ static int get_pack(struct fetch_pack_args *args,
 	}
 
 	if (do_keep || args->from_promisor) {
-		if (pack_lockfile)
+		if (pack_lockfiles)
 			cmd.out = -1;
 		cmd_name = "index-pack";
 		argv_array_push(&cmd.args, cmd_name);
@@ -853,8 +853,9 @@ static int get_pack(struct fetch_pack_args *args,
 	cmd.git_cmd = 1;
 	if (start_command(&cmd))
 		die(_("fetch-pack: unable to fork off %s"), cmd_name);
-	if (do_keep && pack_lockfile) {
-		*pack_lockfile = index_pack_lockfile(cmd.out);
+	if (do_keep && pack_lockfiles) {
+		string_list_append_nodup(pack_lockfiles,
+					 index_pack_lockfile(cmd.out));
 		close(cmd.out);
 	}
 
@@ -886,7 +887,7 @@ static struct ref *do_fetch_pack(struct fetch_pack_args *args,
 				 const struct ref *orig_ref,
 				 struct ref **sought, int nr_sought,
 				 struct shallow_info *si,
-				 char **pack_lockfile)
+				 struct string_list *pack_lockfiles)
 {
 	struct ref *ref = copy_ref_list(orig_ref);
 	struct object_id oid;
@@ -992,7 +993,7 @@ static struct ref *do_fetch_pack(struct fetch_pack_args *args,
 		alternate_shallow_file = setup_temporary_shallow(si->shallow);
 	else
 		alternate_shallow_file = NULL;
-	if (get_pack(args, fd, pack_lockfile))
+	if (get_pack(args, fd, pack_lockfiles))
 		die(_("git fetch-pack: fetch failed."));
 
  all_done:
@@ -1334,7 +1335,7 @@ static struct ref *do_fetch_pack_v2(struct fetch_pack_args *args,
 				    int fd[2],
 				    const struct ref *orig_ref,
 				    struct ref **sought, int nr_sought,
-				    char **pack_lockfile)
+				    struct string_list *pack_lockfiles)
 {
 	struct ref *ref = copy_ref_list(orig_ref);
 	enum fetch_state state = FETCH_CHECK_LOCAL;
@@ -1415,7 +1416,7 @@ static struct ref *do_fetch_pack_v2(struct fetch_pack_args *args,
 
 			/* get the pack */
 			process_section_header(&reader, "packfile", 0);
-			if (get_pack(args, fd, pack_lockfile))
+			if (get_pack(args, fd, pack_lockfiles))
 				die(_("git fetch-pack: fetch failed."));
 
 			state = FETCH_DONE;
@@ -1617,7 +1618,7 @@ struct ref *fetch_pack(struct fetch_pack_args *args,
 		       const char *dest,
 		       struct ref **sought, int nr_sought,
 		       struct oid_array *shallow,
-		       char **pack_lockfile,
+		       struct string_list *pack_lockfiles,
 		       enum protocol_version version)
 {
 	struct ref *ref_cpy;
@@ -1648,10 +1649,10 @@ struct ref *fetch_pack(struct fetch_pack_args *args,
 	prepare_shallow_info(&si, shallow);
 	if (version == protocol_v2)
 		ref_cpy = do_fetch_pack_v2(args, fd, ref, sought, nr_sought,
-					   pack_lockfile);
+					   pack_lockfiles);
 	else
 		ref_cpy = do_fetch_pack(args, fd, ref, sought, nr_sought,
-					&si, pack_lockfile);
+					&si, pack_lockfiles);
 	reprepare_packed_git(the_repository);
 
 	if (!args->cloning && args->deepen) {
diff --git a/fetch-pack.h b/fetch-pack.h
index 43ec344d95..47af3c1d1e 100644
--- a/fetch-pack.h
+++ b/fetch-pack.h
@@ -84,7 +84,7 @@ struct ref *fetch_pack(struct fetch_pack_args *args,
 		       struct ref **sought,
 		       int nr_sought,
 		       struct oid_array *shallow,
-		       char **pack_lockfile,
+		       struct string_list *pack_lockfiles,
 		       enum protocol_version version);
 
 /*
diff --git a/transport-helper.c b/transport-helper.c
index 1f52c95fd8..94c0e37778 100644
--- a/transport-helper.c
+++ b/transport-helper.c
@@ -397,10 +397,11 @@ static int fetch_with_fetch(struct transport *transport,
 
 		if (starts_with(buf.buf, "lock ")) {
 			const char *name = buf.buf + 5;
-			if (transport->pack_lockfile)
+			if (transport->pack_lockfiles.nr)
 				warning(_("%s also locked %s"), data->name, name);
 			else
-				transport->pack_lockfile = xstrdup(name);
+				string_list_append(&transport->pack_lockfiles,
+						   name);
 		}
 		else if (data->check_connectivity &&
 			 data->transport_options.check_self_contained_and_connected &&
diff --git a/transport.c b/transport.c
index e078812897..80c7903fa7 100644
--- a/transport.c
+++ b/transport.c
@@ -359,14 +359,14 @@ static int fetch_refs_via_pack(struct transport *transport,
 		refs = fetch_pack(&args, data->fd, data->conn,
 				  refs_tmp ? refs_tmp : transport->remote_refs,
 				  dest, to_fetch, nr_heads, &data->shallow,
-				  &transport->pack_lockfile, data->version);
+				  &transport->pack_lockfiles, data->version);
 		break;
 	case protocol_v1:
 	case protocol_v0:
 		refs = fetch_pack(&args, data->fd, data->conn,
 				  refs_tmp ? refs_tmp : transport->remote_refs,
 				  dest, to_fetch, nr_heads, &data->shallow,
-				  &transport->pack_lockfile, data->version);
+				  &transport->pack_lockfiles, data->version);
 		break;
 	case protocol_unknown_version:
 		BUG("unknown protocol version");
@@ -909,6 +909,7 @@ struct transport *transport_get(struct remote *remote, const char *url)
 	struct transport *ret = xcalloc(1, sizeof(*ret));
 
 	ret->progress = isatty(2);
+	string_list_init(&ret->pack_lockfiles, 1);
 
 	if (!remote)
 		BUG("No remote provided to transport_get()");
@@ -1302,10 +1303,11 @@ int transport_fetch_refs(struct transport *transport, struct ref *refs)
 
 void transport_unlock_pack(struct transport *transport)
 {
-	if (transport->pack_lockfile) {
-		unlink_or_warn(transport->pack_lockfile);
-		FREE_AND_NULL(transport->pack_lockfile);
-	}
+	int i;
+
+	for (i = 0; i < transport->pack_lockfiles.nr; i++)
+		unlink_or_warn(transport->pack_lockfiles.items[i].string);
+	string_list_clear(&transport->pack_lockfiles, 0);
 }
 
 int transport_connect(struct transport *transport, const char *name,
diff --git a/transport.h b/transport.h
index f2ee7c4f49..e430677217 100644
--- a/transport.h
+++ b/transport.h
@@ -5,8 +5,7 @@
 #include "run-command.h"
 #include "remote.h"
 #include "list-objects-filter-options.h"
-
-struct string_list;
+#include "string-list.h"
 
 struct git_transport_options {
 	unsigned thin : 1;
@@ -98,7 +97,8 @@ struct transport {
 	 */
 	const struct string_list *server_options;
 
-	char *pack_lockfile;
+	struct string_list pack_lockfiles;
+
 	signed verbose : 3;
 	/**
 	 * Transports should not set this directly, and should use this
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v2 8/8] upload-pack: send part of packfile response as uri
  2019-03-08 21:55 ` [PATCH v2 0/8] " Jonathan Tan
                     ` (6 preceding siblings ...)
  2019-03-08 21:55   ` [PATCH v2 7/8] fetch-pack: support more than one pack lockfile Jonathan Tan
@ 2019-03-08 21:55   ` Jonathan Tan
  2019-03-19 20:48   ` [PATCH v2 0/8] CDN offloading of fetch response Josh Steadmon
                     ` (2 subsequent siblings)
  10 siblings, 0 replies; 45+ messages in thread
From: Jonathan Tan @ 2019-03-08 21:55 UTC (permalink / raw)
  To: avarab, git; +Cc: Jonathan Tan, Junio C Hamano

Teach upload-pack to send part of its packfile response as URIs.

An administrator may configure a repository with one or more
"uploadpack.blobpackfileuri" lines, each line containing an OID, a pack
hash, and a URI. A client may configure fetch.uriprotocols to be a
comma-separated list of protocols that it is willing to use to fetch
additional packfiles - this list will be sent to the server. Whenever an
object with one of those OIDs would appear in the packfile transmitted
by upload-pack, the server may exclude that object, and instead send the
URI. The client will then download the packs referred to by those URIs
before performing the connectivity check.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/pack-objects.c |  76 ++++++++++++++++++++++++++++
 fetch-pack.c           | 112 +++++++++++++++++++++++++++++++++++++++--
 t/t5702-protocol-v2.sh |  57 +++++++++++++++++++++
 upload-pack.c          |  78 +++++++++++++++++++++++++---
 4 files changed, 312 insertions(+), 11 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a9fac7c128..2fa962c87d 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -110,6 +110,8 @@ static unsigned long window_memory_limit = 0;
 
 static struct list_objects_filter_options filter_options;
 
+static struct string_list uri_protocols = STRING_LIST_INIT_NODUP;
+
 enum missing_action {
 	MA_ERROR = 0,      /* fail if any missing objects are encountered */
 	MA_ALLOW_ANY,      /* silently allow ALL missing objects */
@@ -118,6 +120,15 @@ enum missing_action {
 static enum missing_action arg_missing_action;
 static show_object_fn fn_show_object;
 
+struct configured_exclusion {
+	struct oidmap_entry e;
+	char *pack_hash_hex;
+	char *uri;
+};
+static struct oidmap configured_exclusions;
+
+static struct oidset excluded_by_config;
+
 /*
  * stats
  */
@@ -832,6 +843,25 @@ static off_t write_reused_pack(struct hashfile *f)
 	return reuse_packfile_offset - sizeof(struct pack_header);
 }
 
+static void write_excluded_by_configs(void)
+{
+	struct oidset_iter iter;
+	const struct object_id *oid;
+
+	oidset_iter_init(&excluded_by_config, &iter);
+	while ((oid = oidset_iter_next(&iter))) {
+		struct configured_exclusion *ex =
+			oidmap_get(&configured_exclusions, oid);
+
+		if (!ex)
+			BUG("configured exclusion wasn't configured");
+		write_in_full(1, ex->pack_hash_hex, strlen(ex->pack_hash_hex));
+		write_in_full(1, " ", 1);
+		write_in_full(1, ex->uri, strlen(ex->uri));
+		write_in_full(1, "\n", 1);
+	}
+}
+
 static const char no_split_warning[] = N_(
 "disabling bitmap writing, packs are split due to pack.packSizeLimit"
 );
@@ -1125,6 +1155,25 @@ static int want_object_in_pack(const struct object_id *oid,
 		}
 	}
 
+	if (uri_protocols.nr) {
+		struct configured_exclusion *ex =
+			oidmap_get(&configured_exclusions, oid);
+		int i;
+		const char *p;
+
+		if (ex) {
+			for (i = 0; i < uri_protocols.nr; i++) {
+				if (skip_prefix(ex->uri,
+						uri_protocols.items[i].string,
+						&p) &&
+				    *p == ':') {
+					oidset_insert(&excluded_by_config, oid);
+					return 0;
+				}
+			}
+		}
+	}
+
 	return 1;
 }
 
@@ -2726,6 +2775,29 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 			    pack_idx_opts.version);
 		return 0;
 	}
+	if (!strcmp(k, "uploadpack.blobpackfileuri")) {
+		struct configured_exclusion *ex = xmalloc(sizeof(*ex));
+		const char *oid_end, *pack_end;
+		/*
+		 * Stores the pack hash. This is not a true object ID, but is
+		 * of the same form.
+		 */
+		struct object_id pack_hash;
+
+		if (parse_oid_hex(v, &ex->e.oid, &oid_end) ||
+		    *oid_end != ' ' ||
+		    parse_oid_hex(oid_end + 1, &pack_hash, &pack_end) ||
+		    *pack_end != ' ')
+			die(_("value of uploadpack.blobpackfileuri must be "
+			      "of the form '<object-hash> <pack-hash> <uri>' (got '%s')"), v);
+		if (oidmap_get(&configured_exclusions, &ex->e.oid))
+			die(_("object already configured in another "
+			      "uploadpack.blobpackfileuri (got '%s')"), v);
+		ex->pack_hash_hex = xcalloc(1, pack_end - oid_end);
+		memcpy(ex->pack_hash_hex, oid_end + 1, pack_end - oid_end - 1);
+		ex->uri = xstrdup(pack_end + 1);
+		oidmap_put(&configured_exclusions, ex);
+	}
 	return git_default_config(k, v, cb);
 }
 
@@ -3318,6 +3390,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			 N_("do not pack objects in promisor packfiles")),
 		OPT_BOOL(0, "delta-islands", &use_delta_islands,
 			 N_("respect islands during delta compression")),
+		OPT_STRING_LIST(0, "uri-protocol", &uri_protocols,
+				N_("protocol"),
+				N_("exclude any configured uploadpack.blobpackfileuri with this protocol")),
 		OPT_END(),
 	};
 
@@ -3492,6 +3567,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		return 0;
 	if (nr_result)
 		prepare_pack(window, depth);
+	write_excluded_by_configs();
 	write_pack_file();
 	if (progress)
 		fprintf_ln(stderr,
diff --git a/fetch-pack.c b/fetch-pack.c
index cf89af21d9..32695f5243 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -38,6 +38,7 @@ static struct lock_file shallow_lock;
 static const char *alternate_shallow_file;
 static char *negotiation_algorithm;
 static struct strbuf fsck_msg_types = STRBUF_INIT;
+static struct string_list uri_protocols = STRING_LIST_INIT_DUP;
 
 /* Remember to update object flag allocation in object.h */
 #define COMPLETE	(1U << 0)
@@ -755,7 +756,8 @@ static int sideband_demux(int in, int out, void *data)
 }
 
 static int get_pack(struct fetch_pack_args *args,
-		    int xd[2], struct string_list *pack_lockfiles)
+		    int xd[2], struct string_list *pack_lockfiles,
+		    int only_packfile)
 {
 	struct async demux;
 	int do_keep = args->keep_pack;
@@ -815,8 +817,15 @@ static int get_pack(struct fetch_pack_args *args,
 					"--keep=fetch-pack %"PRIuMAX " on %s",
 					(uintmax_t)getpid(), hostname);
 		}
-		if (args->check_self_contained_and_connected)
+		if (only_packfile && args->check_self_contained_and_connected)
 			argv_array_push(&cmd.args, "--check-self-contained-and-connected");
+		else
+			/*
+			 * We cannot perform any connectivity checks because
+			 * not all packs have been downloaded; let the caller
+			 * have this responsibility.
+			 */
+			args->check_self_contained_and_connected = 0;
 		if (args->from_promisor)
 			argv_array_push(&cmd.args, "--promisor");
 	}
@@ -993,7 +1002,7 @@ static struct ref *do_fetch_pack(struct fetch_pack_args *args,
 		alternate_shallow_file = setup_temporary_shallow(si->shallow);
 	else
 		alternate_shallow_file = NULL;
-	if (get_pack(args, fd, pack_lockfiles))
+	if (get_pack(args, fd, pack_lockfiles, 1))
 		die(_("git fetch-pack: fetch failed."));
 
  all_done:
@@ -1148,6 +1157,26 @@ static int send_fetch_request(struct fetch_negotiator *negotiator, int fd_out,
 		warning("filtering not recognized by server, ignoring");
 	}
 
+	if (server_supports_feature("fetch", "packfile-uris", 0)) {
+		int i;
+		struct strbuf to_send = STRBUF_INIT;
+
+		for (i = 0; i < uri_protocols.nr; i++) {
+			const char *s = uri_protocols.items[i].string;
+
+			if (!strcmp(s, "https") || !strcmp(s, "http")) {
+				if (to_send.len)
+					strbuf_addch(&to_send, ',');
+				strbuf_addstr(&to_send, s);
+			}
+		}
+		if (to_send.len) {
+			packet_buf_write(&req_buf, "packfile-uris %s",
+					 to_send.buf);
+			strbuf_release(&to_send);
+		}
+	}
+
 	/* add wants */
 	add_wants(args->no_dependents, wants, &req_buf);
 
@@ -1323,6 +1352,21 @@ static void receive_wanted_refs(struct packet_reader *reader,
 		die(_("error processing wanted refs: %d"), reader->status);
 }
 
+static void receive_packfile_uris(struct packet_reader *reader,
+				  struct string_list *uris)
+{
+	process_section_header(reader, "packfile-uris", 0);
+	while (packet_reader_read(reader) == PACKET_READ_NORMAL) {
+		if (reader->pktlen < the_hash_algo->hexsz ||
+		    reader->line[the_hash_algo->hexsz] != ' ')
+			die("expected '<hash> <uri>', got: %s\n", reader->line);
+
+		string_list_append(uris, reader->line);
+	}
+	if (reader->status != PACKET_READ_DELIM)
+		die("expected DELIM");
+}
+
 enum fetch_state {
 	FETCH_CHECK_LOCAL = 0,
 	FETCH_SEND_REQUEST,
@@ -1344,6 +1388,9 @@ static struct ref *do_fetch_pack_v2(struct fetch_pack_args *args,
 	int in_vain = 0;
 	int haves_to_send = INITIAL_FLUSH;
 	struct fetch_negotiator negotiator;
+	struct string_list packfile_uris = STRING_LIST_INIT_DUP;
+	int i;
+
 	fetch_negotiator_init(&negotiator, negotiation_algorithm);
 	packet_reader_init(&reader, fd[0], NULL, 0,
 			   PACKET_READ_CHOMP_NEWLINE |
@@ -1414,9 +1461,12 @@ static struct ref *do_fetch_pack_v2(struct fetch_pack_args *args,
 			if (process_section_header(&reader, "wanted-refs", 1))
 				receive_wanted_refs(&reader, sought, nr_sought);
 
-			/* get the pack */
+			/* get the pack(s) */
+			if (process_section_header(&reader, "packfile-uris", 1))
+				receive_packfile_uris(&reader, &packfile_uris);
 			process_section_header(&reader, "packfile", 0);
-			if (get_pack(args, fd, pack_lockfiles))
+			if (get_pack(args, fd, pack_lockfiles,
+				     !packfile_uris.nr))
 				die(_("git fetch-pack: fetch failed."));
 
 			state = FETCH_DONE;
@@ -1426,6 +1476,50 @@ static struct ref *do_fetch_pack_v2(struct fetch_pack_args *args,
 		}
 	}
 
+	for (i = 0; i < packfile_uris.nr; i++) {
+		struct child_process cmd = CHILD_PROCESS_INIT;
+		char packname[GIT_MAX_HEXSZ + 1];
+		const char *uri = packfile_uris.items[i].string +
+			the_hash_algo->hexsz + 1;
+
+		argv_array_push(&cmd.args, "http-fetch");
+		argv_array_push(&cmd.args, "--packfile");
+		argv_array_push(&cmd.args, uri);
+		cmd.git_cmd = 1;
+		cmd.no_stdin = 1;
+		cmd.out = -1;
+		if (start_command(&cmd))
+			die("fetch-pack: unable to spawn http-fetch");
+
+		if (read_in_full(cmd.out, packname, 5) < 0 ||
+		    memcmp(packname, "keep\t", 5))
+			die("fetch-pack: expected keep then TAB at start of http-fetch output");
+
+		if (read_in_full(cmd.out, packname,
+				 the_hash_algo->hexsz + 1) < 0 ||
+		    packname[the_hash_algo->hexsz] != '\n')
+			die("fetch-pack: expected hash then LF at end of http-fetch output");
+
+		packname[the_hash_algo->hexsz] = '\0';
+
+		close(cmd.out);
+
+		if (finish_command(&cmd))
+			die("fetch-pack: unable to finish http-fetch");
+
+		if (memcmp(packfile_uris.items[i].string, packname,
+			   the_hash_algo->hexsz))
+			die("fetch-pack: pack downloaded from %s does not match expected hash %.*s",
+			    uri, (int) the_hash_algo->hexsz,
+			    packfile_uris.items[i].string);
+
+		string_list_append_nodup(pack_lockfiles,
+					 xstrfmt("%s/pack/pack-%s.keep",
+						 get_object_directory(),
+						 packname));
+	}
+	string_list_clear(&packfile_uris, 0);
+
 	negotiator.release(&negotiator);
 	oidset_clear(&common);
 	return ref;
@@ -1465,6 +1559,14 @@ static void fetch_pack_config(void)
 	git_config_get_bool("transfer.fsckobjects", &transfer_fsck_objects);
 	git_config_get_string("fetch.negotiationalgorithm",
 			      &negotiation_algorithm);
+	if (!uri_protocols.nr) {
+		char *str;
+
+		if (!git_config_get_string("fetch.uriprotocols", &str) && str) {
+			string_list_split(&uri_protocols, str, ',', -1);
+			free(str);
+		}
+	}
 
 	git_config(fetch_pack_config_cb, NULL);
 }
diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh
index db4ae09f2f..6784dfe279 100755
--- a/t/t5702-protocol-v2.sh
+++ b/t/t5702-protocol-v2.sh
@@ -656,6 +656,63 @@ test_expect_success 'when server does not send "ready", expect FLUSH' '
 	test_i18ngrep "expected no other sections to be sent after no .ready." err
 '
 
+configure_exclusion () {
+	git -C "$1" hash-object "$2" >objh &&
+	git -C "$1" pack-objects "$HTTPD_DOCUMENT_ROOT_PATH/mypack" <objh >packh &&
+	git -C "$1" config --add \
+		"uploadpack.blobpackfileuri" \
+		"$(cat objh) $(cat packh) $HTTPD_URL/dumb/mypack-$(cat packh).pack" &&
+	cat objh
+}
+
+test_expect_success 'part of packfile response provided as URI' '
+	P="$HTTPD_DOCUMENT_ROOT_PATH/http_parent" &&
+	rm -rf "$P" http_child log &&
+
+	git init "$P" &&
+	git -C "$P" config "uploadpack.allowsidebandall" "true" &&
+
+	echo my-blob >"$P/my-blob" &&
+	git -C "$P" add my-blob &&
+	echo other-blob >"$P/other-blob" &&
+	git -C "$P" add other-blob &&
+	git -C "$P" commit -m x &&
+
+	configure_exclusion "$P" my-blob >h &&
+	configure_exclusion "$P" other-blob >h2 &&
+
+	GIT_TRACE=1 GIT_TRACE_PACKET="$(pwd)/log" GIT_TEST_SIDEBAND_ALL=1 \
+	git -c protocol.version=2 \
+		-c fetch.uriprotocols=http,https \
+		clone "$HTTPD_URL/smart/http_parent" http_child &&
+
+	# Ensure that my-blob and other-blob are in separate packfiles.
+	for idx in http_child/.git/objects/pack/*.idx
+	do
+		git verify-pack --verbose $idx >out &&
+		{
+			grep "^[0-9a-f]\{16,\} " out || :
+		} >out.objectlist &&
+		if test_line_count = 1 out.objectlist
+		then
+			if grep $(cat h) out
+			then
+				>hfound
+			fi &&
+			if grep $(cat h2) out
+			then
+				>h2found
+			fi
+		fi
+	done &&
+	test -f hfound &&
+	test -f h2found &&
+
+	# Ensure that there are exactly 6 files (3 .pack and 3 .idx).
+	ls http_child/.git/objects/pack/* >filelist &&
+	test_line_count = 6 filelist
+'
+
 stop_httpd
 
 test_done
diff --git a/upload-pack.c b/upload-pack.c
index 987d2e139b..d36e1fc06a 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -105,9 +105,12 @@ static int write_one_shallow(const struct commit_graft *graft, void *cb_data)
 struct output_state {
 	char buffer[8193];
 	int used;
+	unsigned packfile_uris_started : 1;
+	unsigned packfile_started : 1;
 };
 
-static int relay_pack_data(int pack_objects_out, struct output_state *os)
+static int relay_pack_data(int pack_objects_out, struct output_state *os,
+			   int write_packfile_line)
 {
 	/*
 	 * We keep the last byte to ourselves
@@ -128,6 +131,37 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os)
 	}
 	os->used += readsz;
 
+	while (!os->packfile_started) {
+		char *p;
+		if (os->used >= 4 && !memcmp(os->buffer, "PACK", 4)) {
+			os->packfile_started = 1;
+			if (write_packfile_line) {
+				if (os->packfile_uris_started)
+					packet_delim(1);
+				packet_write_fmt(1, "\1packfile\n");
+			}
+			break;
+		}
+		if ((p = memchr(os->buffer, '\n', os->used))) {
+			if (!os->packfile_uris_started) {
+				os->packfile_uris_started = 1;
+				if (!write_packfile_line)
+					BUG("packfile_uris requires sideband-all");
+				packet_write_fmt(1, "\1packfile-uris\n");
+			}
+			*p = '\0';
+			packet_write_fmt(1, "\1%s\n", os->buffer);
+
+			os->used -= p - os->buffer + 1;
+			memmove(os->buffer, p + 1, os->used);
+		} else {
+			/*
+			 * Incomplete line.
+			 */
+			return readsz;
+		}
+	}
+
 	if (os->used > 1) {
 		send_client_data(1, os->buffer, os->used - 1);
 		os->buffer[0] = os->buffer[os->used - 1];
@@ -141,7 +175,8 @@ static int relay_pack_data(int pack_objects_out, struct output_state *os)
 }
 
 static void create_pack_file(const struct object_array *have_obj,
-			     const struct object_array *want_obj)
+			     const struct object_array *want_obj,
+			     const struct string_list *uri_protocols)
 {
 	struct child_process pack_objects = CHILD_PROCESS_INIT;
 	struct output_state output_state = {0};
@@ -192,6 +227,11 @@ static void create_pack_file(const struct object_array *have_obj,
 					 expanded_filter_spec.buf);
 		}
 	}
+	if (uri_protocols) {
+		for (i = 0; i < uri_protocols->nr; i++)
+			argv_array_pushf(&pack_objects.args, "--uri-protocol=%s",
+					 uri_protocols->items[0].string);
+	}
 
 	pack_objects.in = -1;
 	pack_objects.out = -1;
@@ -278,7 +318,8 @@ static void create_pack_file(const struct object_array *have_obj,
 		}
 		if (0 <= pu && (pfd[pu].revents & (POLLIN|POLLHUP))) {
 			int result = relay_pack_data(pack_objects.out,
-						     &output_state);
+						     &output_state,
+						     !!uri_protocols);
 
 			if (result == 0) {
 				close(pack_objects.out);
@@ -1123,7 +1164,7 @@ void upload_pack(struct upload_pack_options *options)
 	if (want_obj.nr) {
 		struct object_array have_obj = OBJECT_ARRAY_INIT;
 		get_common_commits(&reader, &have_obj, &want_obj);
-		create_pack_file(&have_obj, &want_obj);
+		create_pack_file(&have_obj, &want_obj, 0);
 	}
 }
 
@@ -1138,6 +1179,7 @@ struct upload_pack_data {
 	timestamp_t deepen_since;
 	int deepen_rev_list;
 	int deepen_relative;
+	struct string_list uri_protocols;
 
 	struct packet_writer writer;
 
@@ -1157,6 +1199,7 @@ static void upload_pack_data_init(struct upload_pack_data *data)
 	struct oid_array haves = OID_ARRAY_INIT;
 	struct object_array shallows = OBJECT_ARRAY_INIT;
 	struct string_list deepen_not = STRING_LIST_INIT_DUP;
+	struct string_list uri_protocols = STRING_LIST_INIT_DUP;
 
 	memset(data, 0, sizeof(*data));
 	data->wants = wants;
@@ -1164,6 +1207,7 @@ static void upload_pack_data_init(struct upload_pack_data *data)
 	data->haves = haves;
 	data->shallows = shallows;
 	data->deepen_not = deepen_not;
+	data->uri_protocols = uri_protocols;
 	packet_writer_init(&data->writer, 1);
 }
 
@@ -1322,9 +1366,17 @@ static void process_args(struct packet_reader *request,
 			continue;
 		}
 
+		if (skip_prefix(arg, "packfile-uris ", &p)) {
+			string_list_split(&data->uri_protocols, p, ',', -1);
+			continue;
+		}
+
 		/* ignore unknown lines maybe? */
 		die("unexpected line: '%s'", arg);
 	}
+
+	if (data->uri_protocols.nr && !data->writer.use_sideband)
+		string_list_clear(&data->uri_protocols, 0);
 }
 
 static int process_haves(struct oid_array *haves, struct oid_array *common,
@@ -1514,8 +1566,13 @@ int upload_pack_v2(struct repository *r, struct argv_array *keys,
 			send_wanted_ref_info(&data);
 			send_shallow_info(&data, &want_obj);
 
-			packet_writer_write(&data.writer, "packfile\n");
-			create_pack_file(&have_obj, &want_obj);
+			if (data.uri_protocols.nr) {
+				create_pack_file(&have_obj, &want_obj,
+						 &data.uri_protocols);
+			} else {
+				packet_write_fmt(1, "packfile\n");
+				create_pack_file(&have_obj, &want_obj, NULL);
+			}
 			state = FETCH_DONE;
 			break;
 		case FETCH_DONE:
@@ -1536,6 +1593,7 @@ int upload_pack_advertise(struct repository *r,
 		int allow_filter_value;
 		int allow_ref_in_want;
 		int allow_sideband_all_value;
+		char *str = NULL;
 
 		strbuf_addstr(value, "shallow");
 
@@ -1557,6 +1615,14 @@ int upload_pack_advertise(struct repository *r,
 					   &allow_sideband_all_value) &&
 		     allow_sideband_all_value))
 			strbuf_addstr(value, " sideband-all");
+
+		if (!repo_config_get_string(the_repository,
+					    "uploadpack.blobpackfileuri",
+					    &str) &&
+		    str) {
+			strbuf_addstr(value, " packfile-uris");
+			free(str);
+		}
 	}
 
 	return 1;
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 0/8] CDN offloading of fetch response
  2019-03-08 21:55 ` [PATCH v2 0/8] " Jonathan Tan
                     ` (7 preceding siblings ...)
  2019-03-08 21:55   ` [PATCH v2 8/8] upload-pack: send part of packfile response as uri Jonathan Tan
@ 2019-03-19 20:48   ` Josh Steadmon
  2019-04-23  5:21   ` Jeff King
  2019-04-24  9:09   ` Ævar Arnfjörð Bjarmason
  10 siblings, 0 replies; 45+ messages in thread
From: Josh Steadmon @ 2019-03-19 20:48 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: avarab, git

On 2019.03.08 13:55, Jonathan Tan wrote:
> Here's my current progress - the only thing that is lacking is more
> tests, maybe, so I think it's ready for review.

This series looks good to me.

Reviewed-by: Josh Steadmon <steadmon@google.com>

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 0/8] CDN offloading of fetch response
  2019-03-08 21:55 ` [PATCH v2 0/8] " Jonathan Tan
                     ` (8 preceding siblings ...)
  2019-03-19 20:48   ` [PATCH v2 0/8] CDN offloading of fetch response Josh Steadmon
@ 2019-04-23  5:21   ` Jeff King
  2019-04-23 19:23     ` Jonathan Tan
  2019-04-24  9:09   ` Ævar Arnfjörð Bjarmason
  10 siblings, 1 reply; 45+ messages in thread
From: Jeff King @ 2019-04-23  5:21 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: avarab, git

On Fri, Mar 08, 2019 at 01:55:12PM -0800, Jonathan Tan wrote:

> Here's my current progress - the only thing that is lacking is more
> tests, maybe, so I think it's ready for review.

A bit belated, but here are some overall thoughts. A lot of my thinking
comes from comparing this to previous work I had done on teaching the
client to fetch first from a bundle and then do a follow-up fetch from
the server.

> This code now starts CDN downloads after closing the first HTTP request,
> and it holds on to the .keep files until after the refs are set. I'll
> leave parallelization of the CDN downloads for later work.

Good. One of the problems with the bundle+followup approach was either
having to hold open or re-establish the connection to the original
server, since that fetch had to be put on hold.

I agree that parallelizing can come later. I do wonder if it's worth
making a new tool rather than trying to reuse git-http-fetch. Yes, it
basically does what we want already, but there's quite a lot of cruft in
that dumb-http code base. Using http_get_file(), or even curl more
directly might be a bit easier.

One problem in particular I'd worry about is that the http code
generally expects authentication to happen through the initial
ref-discovery, and then be set for all subsequent requests. So I doubt
an http_pack_request actually handles auth at all, and retro-fitting it
may be tricky.

> One relatively significant change: someone pointed out that the issue fixed by 
> 50d3413740 ("http: make redirects more obvious", 2016-12-06) could also
> occur here, so I have changed it so that the server is required to send
> the packfile's hash along with the URI.

I have mixed feelings on that. I like how it clearly makes the server
the source of authority. You don't have to care about CDN tampering,
because you know you are getting the bytes the server wanted you to.

I think even without that this is still _mostly_ true, though. You're
going to compute the hash of every object the CDN sends you, and those
objects must fit into the missing gaps in what the server sends you. So
the worst case is that a malicious CDN could send you some extra
objects. That's probably not the end of the world, but I do like the
extra guarantee of knowing that you got byte-for-byte what the server
wanted you to.

> This does mean that Ævar's workflow described in [1] would not work.
> Quoting Ævar:
> 
> > More concretely, I'd like to have a setup where a server can just dumbly
> > point to some URL that probably has most of the data, without having any
> > idea what OIDs are in it. So that e.g. some machine entirely
> > disconnected from the server (and with just a regular clone) can
> > continually generating an up-to-date-enough packfile.
> 
> With 50d3413740, it seems to me that it's important for the server to
> know details about the URIs that it points to, so such a disconnection
> would not work.

I think even without 50d3413740, this is important for your scheme. One
of the weaknesses (and strengths, I suppose) of the bundle+followup
scheme was that the initial bundle generally had to meet the git
repository guarantee. After you fetched the bundle, you'd tell the
server "OK, now I have commit X" just like you were a regular client.

But I really like that your proposal does away with that, and asks for
tighter cooperation between the server and the CDN. It means the CDN can
serve some random subset of the objects. But once we do that, now the
server _has_ to know what was in those URLs it pointed to, because our
protocol has no good way of communicating a random subset of objects (if
they're just blobs, we could enumerate all of them to the server as
haves; yuck.  But as soon as you get a tree or commit from the CDN, you
have to have all of the reachable objects to be able to claim it).

So I think this is pretty inherent to your proposal, and but it's the
right tradeoff to be making here (I think there could still be room for
the other thing, too, but it's just a different feature).

I have a few other comments on the design, but I'll send them in
response to patch 5.

-Peff

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 5/8] Documentation: add Packfile URIs design doc
  2019-03-08 21:55   ` [PATCH v2 5/8] Documentation: add Packfile URIs design doc Jonathan Tan
@ 2019-04-23  5:31     ` Jeff King
  2019-04-23 20:38       ` Jonathan Tan
                         ` (2 more replies)
  0 siblings, 3 replies; 45+ messages in thread
From: Jeff King @ 2019-04-23  5:31 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: avarab, git, Junio C Hamano

On Fri, Mar 08, 2019 at 01:55:17PM -0800, Jonathan Tan wrote:

> +If the 'packfile-uris' feature is advertised, the following argument
> +can be included in the client's request as well as the potential
> +addition of the 'packfile-uris' section in the server's response as
> +explained below.
> +
> +    packfile-uris <comma-separated list of protocols>
> +	Indicates to the server that the client is willing to receive
> +	URIs of any of the given protocols in place of objects in the
> +	sent packfile. Before performing the connectivity check, the
> +	client should download from all given URIs. Currently, the
> +	protocols supported are "http" and "https".

This negotiation seems backwards to me, because it puts too much power
in the hands of the server.

The server says "OK, I support this feature". Then the client says "I
support it, too. But only if you like these protocols". And then the
server dumps a bunch of URIs and expects the client to respect them.

The problem I see is that the client doesn't get to vet the list of
URIs; it only gets to specify a protocol match. But there are many other
reasons it might want to reject a URI: we don't like the protocol, the
domain name is on a blacklist (or not on a whitelist), the domain name
can't resolve, we can't make a TCP connection to the server, we can't
successfully fetch the pack.

You'll note that those rise in complexity and time as you go down the
list. I'm not sure where on that spectrum we'd want our clients to stop
vetting (and it may even depend on config). But I think we ought to
design the protocol to put the decision in the hands of the client so
that it _can_ make those choices itself.

I.e., I think the conversation ought to be more like:

  Server: I support packfile-uris X, Y, Z.

  Client: Great. I'll use URIs X and Z.

  Server: OK, here's your pack, minus any objects I know are in X and Z.
          I'll send you the objects from Y as normal.

And then the client is free to pick and choose. The initial server uri
list can come in the capabilities list, or it can be a separate request
once the client sees the server supports packfile-uris and wants to ask
about them. We may need some way for the server to group the uris so
that the client knows which ones are alternates of each other (and which
ones are needed to make a complete set).

-Peff

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 0/8] CDN offloading of fetch response
  2019-04-23  5:21   ` Jeff King
@ 2019-04-23 19:23     ` Jonathan Tan
  0 siblings, 0 replies; 45+ messages in thread
From: Jonathan Tan @ 2019-04-23 19:23 UTC (permalink / raw)
  To: peff; +Cc: jonathantanmy, avarab, git

> On Fri, Mar 08, 2019 at 01:55:12PM -0800, Jonathan Tan wrote:
> 
> > Here's my current progress - the only thing that is lacking is more
> > tests, maybe, so I think it's ready for review.
> 
> A bit belated, but here are some overall thoughts. A lot of my thinking
> comes from comparing this to previous work I had done on teaching the
> client to fetch first from a bundle and then do a follow-up fetch from
> the server.

Thanks for your thoughts.

> > This code now starts CDN downloads after closing the first HTTP request,
> > and it holds on to the .keep files until after the refs are set. I'll
> > leave parallelization of the CDN downloads for later work.
> 
> Good. One of the problems with the bundle+followup approach was either
> having to hold open or re-establish the connection to the original
> server, since that fetch had to be put on hold.
> 
> I agree that parallelizing can come later. I do wonder if it's worth
> making a new tool rather than trying to reuse git-http-fetch. Yes, it
> basically does what we want already, but there's quite a lot of cruft in
> that dumb-http code base. Using http_get_file(), or even curl more
> directly might be a bit easier.
> 
> One problem in particular I'd worry about is that the http code
> generally expects authentication to happen through the initial
> ref-discovery, and then be set for all subsequent requests. So I doubt
> an http_pack_request actually handles auth at all, and retro-fitting it
> may be tricky.

Thanks - I'll keep that in mind. If auth isn't handled, then we'll
definitely need something new.

> > One relatively significant change: someone pointed out that the issue fixed by 
> > 50d3413740 ("http: make redirects more obvious", 2016-12-06) could also
> > occur here, so I have changed it so that the server is required to send
> > the packfile's hash along with the URI.
> 
> I have mixed feelings on that. I like how it clearly makes the server
> the source of authority. You don't have to care about CDN tampering,
> because you know you are getting the bytes the server wanted you to.

I was thinking not of CDN tampering but of a malicious server. Suppose a
server knew of (1) the hash of a target commit, and (2) the URL of a
packfile containing that target commit. The server can craft a commit
whose parent is that target commit and advertise it, and serves both
that commit as a packfile and the URL as packfile-uri. When the user
subsequently pushes, they will probably inadvertently push everything
including the target commit.

This is similar to the 50d3413740 case except that now the server needs
an additional datum (2). The packfile's hash is yet another additional
datum. I admit that it may not be that useful, though.

> I think even without that this is still _mostly_ true, though. You're
> going to compute the hash of every object the CDN sends you, and those
> objects must fit into the missing gaps in what the server sends you. So
> the worst case is that a malicious CDN could send you some extra
> objects. That's probably not the end of the world, but I do like the
> extra guarantee of knowing that you got byte-for-byte what the server
> wanted you to.

Yes, the guarantee is nice.

> > This does mean that Ævar's workflow described in [1] would not work.
> > Quoting Ævar:
> > 
> > > More concretely, I'd like to have a setup where a server can just dumbly
> > > point to some URL that probably has most of the data, without having any
> > > idea what OIDs are in it. So that e.g. some machine entirely
> > > disconnected from the server (and with just a regular clone) can
> > > continually generating an up-to-date-enough packfile.
> > 
> > With 50d3413740, it seems to me that it's important for the server to
> > know details about the URIs that it points to, so such a disconnection
> > would not work.
> 
> I think even without 50d3413740, this is important for your scheme. One
> of the weaknesses (and strengths, I suppose) of the bundle+followup
> scheme was that the initial bundle generally had to meet the git
> repository guarantee. After you fetched the bundle, you'd tell the
> server "OK, now I have commit X" just like you were a regular client.
> 
> But I really like that your proposal does away with that, and asks for
> tighter cooperation between the server and the CDN. It means the CDN can
> serve some random subset of the objects. But once we do that, now the
> server _has_ to know what was in those URLs it pointed to, because our
> protocol has no good way of communicating a random subset of objects (if
> they're just blobs, we could enumerate all of them to the server as
> haves; yuck.  But as soon as you get a tree or commit from the CDN, you
> have to have all of the reachable objects to be able to claim it).

Ah...you're right that the server still needs some inkling of what the
CDN's packfile has.

Without the packfile hash, the knowledge needed is slightly less: the
server just has to know a subset of the objects instead of the exact
list of objects, but I can't think of how this can benefit us other than
that we don't require so much consistency (it's OK if the server's
knowledge of the CDN is updated significantly later than the CDN is
updated, if the CDN reuses the same URL for each version of the
packfile).

> So I think this is pretty inherent to your proposal, and but it's the
> right tradeoff to be making here (I think there could still be room for
> the other thing, too, but it's just a different feature).

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 5/8] Documentation: add Packfile URIs design doc
  2019-04-23  5:31     ` Jeff King
@ 2019-04-23 20:38       ` Jonathan Tan
  2019-04-23 22:18         ` Ævar Arnfjörð Bjarmason
  2019-04-23 22:11       ` Jonathan Nieder
  2019-04-24  3:01       ` Junio C Hamano
  2 siblings, 1 reply; 45+ messages in thread
From: Jonathan Tan @ 2019-04-23 20:38 UTC (permalink / raw)
  To: peff; +Cc: jonathantanmy, avarab, git, gitster

> The problem I see is that the client doesn't get to vet the list of
> URIs; it only gets to specify a protocol match. But there are many other
> reasons it might want to reject a URI: we don't like the protocol, the
> domain name is on a blacklist (or not on a whitelist), the domain name
> can't resolve, we can't make a TCP connection to the server, we can't
> successfully fetch the pack.
> 
> You'll note that those rise in complexity and time as you go down the
> list. I'm not sure where on that spectrum we'd want our clients to stop
> vetting (and it may even depend on config). But I think we ought to
> design the protocol to put the decision in the hands of the client so
> that it _can_ make those choices itself.
> 
> I.e., I think the conversation ought to be more like:
> 
>   Server: I support packfile-uris X, Y, Z.
> 
>   Client: Great. I'll use URIs X and Z.
> 
>   Server: OK, here's your pack, minus any objects I know are in X and Z.
>           I'll send you the objects from Y as normal.
> 
> And then the client is free to pick and choose.

One drawback I see is that the server needs to compute objects to be
sent twice - once to generate the URIs and once after the client has
informed the server which URIs it wants.

If we expect some packfile-uris to not be usable sometimes (for any of
the reasons you listed), this would be nice. The protocol in my patches
support a rudimentary version of this (if any of the URIs don't work for
any reason, just fetch again without advertising that we support
packfile URIs) but the packfile received during the first patch is
wasted.

So the tradeoff is: in the good case, your suggestion means that we make
another fetch request, increasing the load on the server and taking more
time. In the bad case, your suggestions means that we avoid sending a
useless packfile upon the first patch, and if the server is smart
enough, even the second packfile will be smaller. It depends on how
often we think the bad case occurs, if servers will typically send more
than one packfile-uri, and how smart we think servers will typically be.

> The initial server uri
> list can come in the capabilities list, or it can be a separate request
> once the client sees the server supports packfile-uris and wants to ask
> about them.

I don't think this is possible - the URI list is dependent on the wants
and haves.

> We may need some way for the server to group the uris so
> that the client knows which ones are alternates of each other (and which
> ones are needed to make a complete set).

My initial design didn't have this feature (in fact, it seems to me that
each thing should have one canonical URL, which means that there is no
need for an alternate) - do you think we should be thinking about this
at this stage?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 5/8] Documentation: add Packfile URIs design doc
  2019-04-23  5:31     ` Jeff King
  2019-04-23 20:38       ` Jonathan Tan
@ 2019-04-23 22:11       ` Jonathan Nieder
  2019-04-23 22:25         ` Ævar Arnfjörð Bjarmason
  2019-04-24  3:01       ` Junio C Hamano
  2 siblings, 1 reply; 45+ messages in thread
From: Jonathan Nieder @ 2019-04-23 22:11 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, avarab, git, Junio C Hamano

Hi,

Jeff King wrote:
> On Fri, Mar 08, 2019 at 01:55:17PM -0800, Jonathan Tan wrote:

>> +If the 'packfile-uris' feature is advertised, the following argument
>> +can be included in the client's request as well as the potential
>> +addition of the 'packfile-uris' section in the server's response as
>> +explained below.
>> +
>> +    packfile-uris <comma-separated list of protocols>
>> +	Indicates to the server that the client is willing to receive
>> +	URIs of any of the given protocols in place of objects in the
>> +	sent packfile. Before performing the connectivity check, the
>> +	client should download from all given URIs. Currently, the
>> +	protocols supported are "http" and "https".
>
> This negotiation seems backwards to me, because it puts too much power
> in the hands of the server.

Thanks.  Forgive me if this was covered earlier in the conversation, but
why do we need more than one protocol at all here?  Can we restrict this
to only-https, all the time?

[...]
> The problem I see is that the client doesn't get to vet the list of
> URIs; it only gets to specify a protocol match. But there are many other
> reasons it might want to reject a URI: we don't like the protocol, the
> domain name is on a blacklist (or not on a whitelist), the domain name
> can't resolve, we can't make a TCP connection to the server, we can't
> successfully fetch the pack.

Christian mentioned this desire to vet URIs before, and I'll admit I
found it hard to imagine a use case.  Why can't it work like e.g.
<frame> on the web, where if you don't like that domain, then you
don't get to access the page?  From a server operator's point of view,
if you want to support a second URI that more clients support, why
wouldn't you just always use that second URI instead of making clients
choose?

Thanks and hope that helps,
Jonathan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 5/8] Documentation: add Packfile URIs design doc
  2019-04-23 20:38       ` Jonathan Tan
@ 2019-04-23 22:18         ` Ævar Arnfjörð Bjarmason
  2019-04-23 22:22           ` Jonathan Nieder
  0 siblings, 1 reply; 45+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-04-23 22:18 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: peff, git, gitster


On Tue, Apr 23 2019, Jonathan Tan wrote:

>> The problem I see is that the client doesn't get to vet the list of
>> URIs; it only gets to specify a protocol match. But there are many other
>> reasons it might want to reject a URI: we don't like the protocol, the
>> domain name is on a blacklist (or not on a whitelist), the domain name
>> can't resolve, we can't make a TCP connection to the server, we can't
>> successfully fetch the pack.
>>
>> You'll note that those rise in complexity and time as you go down the
>> list. I'm not sure where on that spectrum we'd want our clients to stop
>> vetting (and it may even depend on config). But I think we ought to
>> design the protocol to put the decision in the hands of the client so
>> that it _can_ make those choices itself.
>>
>> I.e., I think the conversation ought to be more like:
>>
>>   Server: I support packfile-uris X, Y, Z.
>>
>>   Client: Great. I'll use URIs X and Z.
>>
>>   Server: OK, here's your pack, minus any objects I know are in X and Z.
>>           I'll send you the objects from Y as normal.
>>
>> And then the client is free to pick and choose.
>
> One drawback I see is that the server needs to compute objects to be
> sent twice - once to generate the URIs and once after the client has
> informed the server which URIs it wants.
>
> If we expect some packfile-uris to not be usable sometimes (for any of
> the reasons you listed), this would be nice. The protocol in my patches
> support a rudimentary version of this (if any of the URIs don't work for
> any reason, just fetch again without advertising that we support
> packfile URIs) but the packfile received during the first patch is
> wasted.

This is really orthagonal to this series, but wouldn't a better
resumption strategy here be to walk the pack we just downloaded, run the
equivalent of 'commit-graph write' on it to figure out likely "tip"
commits, and use those in "have" lines to negotiate with the server the
next time around?

I've sometimes wished we optionally had that sort of fetch algorithm, in
particular now setting "alternates" on a freshly init-ed repo will do
the full initial fetch even though we have most/all of the objects in
the alternates now.

> So the tradeoff is: in the good case, your suggestion means that we make
> another fetch request, increasing the load on the server and taking more
> time. In the bad case, your suggestions means that we avoid sending a
> useless packfile upon the first patch, and if the server is smart
> enough, even the second packfile will be smaller. It depends on how
> often we think the bad case occurs, if servers will typically send more
> than one packfile-uri, and how smart we think servers will typically be.
>
>> The initial server uri
>> list can come in the capabilities list, or it can be a separate request
>> once the client sees the server supports packfile-uris and wants to ask
>> about them.
>
> I don't think this is possible - the URI list is dependent on the wants
> and haves.
>
>> We may need some way for the server to group the uris so
>> that the client knows which ones are alternates of each other (and which
>> ones are needed to make a complete set).
>
> My initial design didn't have this feature (in fact, it seems to me that
> each thing should have one canonical URL, which means that there is no
> need for an alternate) - do you think we should be thinking about this
> at this stage?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 5/8] Documentation: add Packfile URIs design doc
  2019-04-23 22:18         ` Ævar Arnfjörð Bjarmason
@ 2019-04-23 22:22           ` Jonathan Nieder
  2019-04-23 22:30             ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 45+ messages in thread
From: Jonathan Nieder @ 2019-04-23 22:22 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Jonathan Tan, peff, git, gitster

Hi,

Ævar Arnfjörð Bjarmason wrote:

> This is really orthagonal to this series, but wouldn't a better
> resumption strategy here be to walk the pack we just downloaded, run the
> equivalent of 'commit-graph write' on it to figure out likely "tip"
> commits, and use those in "have" lines to negotiate with the server the
> next time around?

Do you mean this for when a pack is self-contained and contains all
objects reachable from those "tip" commits?

What would you do when a pack is not self-contained in that way?

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 5/8] Documentation: add Packfile URIs design doc
  2019-04-23 22:11       ` Jonathan Nieder
@ 2019-04-23 22:25         ` Ævar Arnfjörð Bjarmason
  2019-04-23 22:48           ` Jonathan Nieder
  0 siblings, 1 reply; 45+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-04-23 22:25 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jeff King, Jonathan Tan, git, Junio C Hamano


On Wed, Apr 24 2019, Jonathan Nieder wrote:

> Hi,
>
> Jeff King wrote:
>> On Fri, Mar 08, 2019 at 01:55:17PM -0800, Jonathan Tan wrote:
>
>>> +If the 'packfile-uris' feature is advertised, the following argument
>>> +can be included in the client's request as well as the potential
>>> +addition of the 'packfile-uris' section in the server's response as
>>> +explained below.
>>> +
>>> +    packfile-uris <comma-separated list of protocols>
>>> +	Indicates to the server that the client is willing to receive
>>> +	URIs of any of the given protocols in place of objects in the
>>> +	sent packfile. Before performing the connectivity check, the
>>> +	client should download from all given URIs. Currently, the
>>> +	protocols supported are "http" and "https".
>>
>> This negotiation seems backwards to me, because it puts too much power
>> in the hands of the server.
>
> Thanks.  Forgive me if this was covered earlier in the conversation, but
> why do we need more than one protocol at all here?  Can we restrict this
> to only-https, all the time?

There was this in an earlier discussion about this:
https://public-inbox.org/git/877eds5fpl.fsf@evledraar.gmail.com/

It seems arbitrary to break it for new features if we support http in
general, especially with a design as it is now where the checksum of the
pack is transmitted out-of-band.

> [...]
>> The problem I see is that the client doesn't get to vet the list of
>> URIs; it only gets to specify a protocol match. But there are many other
>> reasons it might want to reject a URI: we don't like the protocol, the
>> domain name is on a blacklist (or not on a whitelist), the domain name
>> can't resolve, we can't make a TCP connection to the server, we can't
>> successfully fetch the pack.
>
> Christian mentioned this desire to vet URIs before, and I'll admit I
> found it hard to imagine a use case.  Why can't it work like e.g.
> <frame> on the web, where if you don't like that domain, then you
> don't get to access the page?  From a server operator's point of view,
> if you want to support a second URI that more clients support, why
> wouldn't you just always use that second URI instead of making clients
> choose?
>
> Thanks and hope that helps,
> Jonathan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 5/8] Documentation: add Packfile URIs design doc
  2019-04-23 22:22           ` Jonathan Nieder
@ 2019-04-23 22:30             ` Ævar Arnfjörð Bjarmason
  2019-04-23 22:51               ` Jonathan Nieder
  0 siblings, 1 reply; 45+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-04-23 22:30 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jonathan Tan, peff, git, gitster


On Wed, Apr 24 2019, Jonathan Nieder wrote:

> Hi,
>
> Ævar Arnfjörð Bjarmason wrote:
>
>> This is really orthagonal to this series, but wouldn't a better
>> resumption strategy here be to walk the pack we just downloaded, run the
>> equivalent of 'commit-graph write' on it to figure out likely "tip"
>> commits, and use those in "have" lines to negotiate with the server the
>> next time around?
>
> Do you mean this for when a pack is self-contained and contains all
> objects reachable from those "tip" commits?
>
> What would you do when a pack is not self-contained in that way?

Indeed, it had been a while since I read the first version of this. I
was assuming a "base pack" use-case, but it seems it's narrowly isolated
to just "N packs each containing one big blob", right?

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 5/8] Documentation: add Packfile URIs design doc
  2019-04-23 22:25         ` Ævar Arnfjörð Bjarmason
@ 2019-04-23 22:48           ` Jonathan Nieder
  2019-04-24  7:48             ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 45+ messages in thread
From: Jonathan Nieder @ 2019-04-23 22:48 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jeff King, Jonathan Tan, git, Junio C Hamano

Ævar Arnfjörð Bjarmason wrote:
> On Wed, Apr 24 2019, Jonathan Nieder wrote:
>> Jeff King wrote:
>>> On Fri, Mar 08, 2019 at 01:55:17PM -0800, Jonathan Tan wrote:

>>>> +If the 'packfile-uris' feature is advertised, the following argument
>>>> +can be included in the client's request as well as the potential
>>>> +addition of the 'packfile-uris' section in the server's response as
>>>> +explained below.
>>>> +
>>>> +    packfile-uris <comma-separated list of protocols>
>>>> +	Indicates to the server that the client is willing to receive
>>>> +	URIs of any of the given protocols in place of objects in the
>>>> +	sent packfile. Before performing the connectivity check, the
>>>> +	client should download from all given URIs. Currently, the
>>>> +	protocols supported are "http" and "https".
>>>
>>> This negotiation seems backwards to me, because it puts too much power
>>> in the hands of the server.
>>
>> Thanks.  Forgive me if this was covered earlier in the conversation, but
>> why do we need more than one protocol at all here?  Can we restrict this
>> to only-https, all the time?
>
> There was this in an earlier discussion about this:
> https://public-inbox.org/git/877eds5fpl.fsf@evledraar.gmail.com/
>
> It seems arbitrary to break it for new features if we support http in
> general, especially with a design as it is now where the checksum of the
> pack is transmitted out-of-band.

Thanks for the pointer.  TLS provides privacy, too, but I can see why
in today's world it might not always be easy to set it up, and given
that we have integrity protection via that checksum, I can see why
some people might have a legitimate need for using plain "http" here.

We may also want to support packfile-uris using SSH protocol in the
future.  Might as well figure out how the protocol negotiation works
now.  So let's delve more into it:

Peff mentioned that it feels backwards for the client to specify what
protocols they support in the request, instead of the server
specifying them upfront in the capability advertisement.  I'm inclined
to agree: it's probably reasonable to put this in server capabilities
instead.  That would even allow the client to do something like

	This server only supports HTTP without TLS, which you have
	indicated is a condition in which you want to be prompted.
	Proceed?

	[Use HTTP packfiles]  [Use slower but safer inline packs]

Peff also asked whether protocol scheme is the right granularity:
should the server list what domains they can serve packfiles from
instead?  In other words, once you're doing it for protocol schemes,
why not do it for whole URIs too?  I'm grateful for the question since
it's a way to probe at design assumptions.

- protocol schemes are likely to be low in number because each has its
  own code path to handle it.  By comparison, domains or URIs may be
  too numerous to be something we want to jam into the capability
  advertisement.  (Or the server operator could always use the same
  domain as the Git repo, and then use a 302 to redirect to the CDN.
  I suspect this is likely to be a common setup anyway: it allows the
  Git server to generate a short-lived signed URL that it uses as the
  target of a 302.  But in this case, what is the point of a domain
  whitelist?)

- relatedly, because the list of protocol schemes is small, it is
  feasible to test client behavior with each subset of protocol
  schemes enabled.  Finer-grained filtering would mean more esoteric
  client configurations for server operators to support and debug.

- supported protocol schemes do not vary per request.  The actual
  packfile URI is dynamic and varies per request

- separately from questions of preference or security policy,
  clients may have support for a limited subset of protocol schemes.
  For example, imagine a stripped-down client without SSH support.
  So we need a way to agree about this capability anyway.

So I suspect that, at least to start, protocol scheme negotiation
should be enough and we don't need full URI negotiation.

There are a few escape valves:

- affected clients can complain to the server operator, who will then
  reconfigure the server to use more appropriate packfile URIs

- if there is a need for different clients to use different packfile
  URIs, clients can pass a flag, using --server-option, to the server
  to help it choose.

- a client can disable support for packfile URIs on a particular
  request and fall back to inline packs.

- if and when an affected client materializes, they can help us
  improve the protocol to handle their needs.

Sensible?

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 5/8] Documentation: add Packfile URIs design doc
  2019-04-23 22:30             ` Ævar Arnfjörð Bjarmason
@ 2019-04-23 22:51               ` Jonathan Nieder
  0 siblings, 0 replies; 45+ messages in thread
From: Jonathan Nieder @ 2019-04-23 22:51 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Jonathan Tan, peff, git, gitster

Ævar Arnfjörð Bjarmason wrote:
> On Wed, Apr 24 2019, Jonathan Nieder wrote:

>> Do you mean this for when a pack is self-contained and contains all
>> objects reachable from those "tip" commits?
>>
>> What would you do when a pack is not self-contained in that way?
>
> Indeed, it had been a while since I read the first version of this. I
> was assuming a "base pack" use-case, but it seems it's narrowly isolated
> to just "N packs each containing one big blob", right?

The demo in this patch series covers the single isolated blob case.
The protocol supports the "base pack" use case but many others as
well:

* daily "catch-up fetch" packs
* "base pack without blobs"
* ... etc ...

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 5/8] Documentation: add Packfile URIs design doc
  2019-04-23  5:31     ` Jeff King
  2019-04-23 20:38       ` Jonathan Tan
  2019-04-23 22:11       ` Jonathan Nieder
@ 2019-04-24  3:01       ` Junio C Hamano
  2 siblings, 0 replies; 45+ messages in thread
From: Junio C Hamano @ 2019-04-24  3:01 UTC (permalink / raw)
  To: Jeff King; +Cc: Jonathan Tan, avarab, git

Jeff King <peff@peff.net> writes:

> I.e., I think the conversation ought to be more like:
>
>   Server: I support packfile-uris X, Y, Z.
>
>   Client: Great. I'll use URIs X and Z.
>
>   Server: OK, here's your pack, minus any objects I know are in X and Z.
>           I'll send you the objects from Y as normal.

I agree with the overall direction, but I am afraid that the server
advertisement may have to become a lot more complex.

For example, are you assuming in the above example that X, Y and Z
are not overlapping?

Or perhaps X is the whole repository that is quick to access only
from European clients, while Y and Z combined have the same set of
objects that are meant for our Asian friends.  IOW, the server may
expect that the client would say one of three "there is nothing I
can use then", "I am happy with X and won't use Y and Z" or "I'll
use Y and Z then, without X", and no other combinations like "I'll
use X and Z" may make sense.

> And then the client is free to pick and choose. The initial server uri
> list can come in the capabilities list, or it can be a separate request
> once the client sees the server supports packfile-uris and wants to ask
> about them. We may need some way for the server to group the uris so
> that the client knows which ones are alternates of each other (and which
> ones are needed to make a complete set).

I guess what I am saying is that it is not so clear how we can
present the server offered URIs to the client in such a way that the
client (either mechanically, or consulting the human) can make a
useful choice,

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 5/8] Documentation: add Packfile URIs design doc
  2019-04-23 22:48           ` Jonathan Nieder
@ 2019-04-24  7:48             ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 45+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-04-24  7:48 UTC (permalink / raw)
  To: Jonathan Nieder; +Cc: Jeff King, Jonathan Tan, git, Junio C Hamano


On Wed, Apr 24 2019, Jonathan Nieder wrote:

> Ævar Arnfjörð Bjarmason wrote:
>> On Wed, Apr 24 2019, Jonathan Nieder wrote:
>>> Jeff King wrote:
>>>> On Fri, Mar 08, 2019 at 01:55:17PM -0800, Jonathan Tan wrote:
>
>>>>> +If the 'packfile-uris' feature is advertised, the following argument
>>>>> +can be included in the client's request as well as the potential
>>>>> +addition of the 'packfile-uris' section in the server's response as
>>>>> +explained below.
>>>>> +
>>>>> +    packfile-uris <comma-separated list of protocols>
>>>>> +	Indicates to the server that the client is willing to receive
>>>>> +	URIs of any of the given protocols in place of objects in the
>>>>> +	sent packfile. Before performing the connectivity check, the
>>>>> +	client should download from all given URIs. Currently, the
>>>>> +	protocols supported are "http" and "https".
>>>>
>>>> This negotiation seems backwards to me, because it puts too much power
>>>> in the hands of the server.
>>>
>>> Thanks.  Forgive me if this was covered earlier in the conversation, but
>>> why do we need more than one protocol at all here?  Can we restrict this
>>> to only-https, all the time?
>>
>> There was this in an earlier discussion about this:
>> https://public-inbox.org/git/877eds5fpl.fsf@evledraar.gmail.com/
>>
>> It seems arbitrary to break it for new features if we support http in
>> general, especially with a design as it is now where the checksum of the
>> pack is transmitted out-of-band.
>
> Thanks for the pointer.  TLS provides privacy, too, but I can see why
> in today's world it might not always be easy to set it up, and given
> that we have integrity protection via that checksum, I can see why
> some people might have a legitimate need for using plain "http" here.
>
> We may also want to support packfile-uris using SSH protocol in the
> future.  Might as well figure out how the protocol negotiation works
> now.  So let's delve more into it:
>
> Peff mentioned that it feels backwards for the client to specify what
> protocols they support in the request, instead of the server
> specifying them upfront in the capability advertisement.  I'm inclined
> to agree: it's probably reasonable to put this in server capabilities
> instead.  That would even allow the client to do something like
>
> 	This server only supports HTTP without TLS, which you have
> 	indicated is a condition in which you want to be prompted.
> 	Proceed?
>
> 	[Use HTTP packfiles]  [Use slower but safer inline packs]
>
> Peff also asked whether protocol scheme is the right granularity:
> should the server list what domains they can serve packfiles from
> instead?  In other words, once you're doing it for protocol schemes,
> why not do it for whole URIs too?  I'm grateful for the question since
> it's a way to probe at design assumptions.
>
> - protocol schemes are likely to be low in number because each has its
>   own code path to handle it.  By comparison, domains or URIs may be
>   too numerous to be something we want to jam into the capability
>   advertisement.  (Or the server operator could always use the same
>   domain as the Git repo, and then use a 302 to redirect to the CDN.
>   I suspect this is likely to be a common setup anyway: it allows the
>   Git server to generate a short-lived signed URL that it uses as the
>   target of a 302.  But in this case, what is the point of a domain
>   whitelist?)
>
> - relatedly, because the list of protocol schemes is small, it is
>   feasible to test client behavior with each subset of protocol
>   schemes enabled.  Finer-grained filtering would mean more esoteric
>   client configurations for server operators to support and debug.
>
> - supported protocol schemes do not vary per request.  The actual
>   packfile URI is dynamic and varies per request
>
> - separately from questions of preference or security policy,
>   clients may have support for a limited subset of protocol schemes.
>   For example, imagine a stripped-down client without SSH support.
>   So we need a way to agree about this capability anyway.
>
> So I suspect that, at least to start, protocol scheme negotiation
> should be enough and we don't need full URI negotiation.
>
> There are a few escape valves:
>
> - affected clients can complain to the server operator, who will then
>   reconfigure the server to use more appropriate packfile URIs
>
> - if there is a need for different clients to use different packfile
>   URIs, clients can pass a flag, using --server-option, to the server
>   to help it choose.
>
> - a client can disable support for packfile URIs on a particular
>   request and fall back to inline packs.
>
> - if and when an affected client materializes, they can help us
>   improve the protocol to handle their needs.
>
> Sensible?

Food for thought: would we consider ssh->https a "downgrade"? I think
"maybe". We're going from whatever custom setting the user has
(e.g. manually approve new hosts) to the CA system.

But I think it would be fine to just only whitelist ssh->https and ban
everything else behind a very scary config option or something, we could
always fleshen out the semantics of upgrade/downgrade/switching later,
and it would IMO suck less than outright banning a protcol we otherwise
support in the design, and which (unlike git://) is something people are
still finding uses for in the wild for non-legacy reasons.

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v2 0/8] CDN offloading of fetch response
  2019-03-08 21:55 ` [PATCH v2 0/8] " Jonathan Tan
                     ` (9 preceding siblings ...)
  2019-04-23  5:21   ` Jeff King
@ 2019-04-24  9:09   ` Ævar Arnfjörð Bjarmason
  10 siblings, 0 replies; 45+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-04-24  9:09 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

On Fri, Mar 08 2019, Jonathan Tan wrote:

> One relatively significant change: someone pointed out that the issue fixed by
> 50d3413740 ("http: make redirects more obvious", 2016-12-06) could also
> occur here, so I have changed it so that the server is required to send
> the packfile's hash along with the URI.
>
> This does mean that Ævar's workflow described in
> https://public-inbox.org/git/87k1hv6eel.fsf@evledraar.gmail.com/ would
> not work.  Quoting Ævar:
>
>> More concretely, I'd like to have a setup where a server can just dumbly
>> point to some URL that probably has most of the data, without having any
>> idea what OIDs are in it. So that e.g. some machine entirely
>> disconnected from the server (and with just a regular clone) can
>> continually generating an up-to-date-enough packfile.
>
> With 50d3413740, it seems to me that it's important for the server to
> know details about the URIs that it points to, so such a disconnection
> would not work.

For what it's worth I'm fine with that, and as is obvious in some of the
previous threads I hadn't thought about that threat model. This "I know
XYZ has a pack that should have most of it, check what's in it and get
back to me with HAVEs" was a nice-to-have, but not a must.

And you can still get most of the way there with this proposal and the
techniques described in the follow-up to
https://public-inbox.org/git/87bmhiykvw.fsf@evledraar.gmail.com/ if I'm
reading this series correctly.

I.e. the server would know the CDNs are going to be generating a pack
from the "master" history with the tip being the Nth commit, and since
both agree on the pack algorithm they can generate the same OID/pack
hash pair without further coordination, the server would then
optimistically advertise that. This would give you the sort of "lazy"
CDN setup I was describing with slightly more work than just "here's
some random latest pack".

But isn't in the case that if a malicious server ever got a hold of a
pack SHA-1 it knows to be on a private CDN *and* a commit that's in it
we'd have the same problem? Shouldn't this security model be prominently
documented in Documentation/technical/packfile-uri.txt? I.e. "THOU MUST
NEVER LEAK A COMBINATION OF THE PACK SHA-1 OF A PRIVATE CDN AND A
PRIVATE COMMIT SHA-1!".

It seems likely that once we have git CDN support the first thing CDNs
are going to do is to serve up such packs via a URL that includes the
pack SHA-1. Once the combination of that & a commit in it leaks we're
back to square one, no? *My* default approach in setting up such
infrastructure without knowing about 50d3413740 would be not to care
about the privacy of the SHA-1s, even in the case of private data.

Isn't there also overlap here with non-CDN shallow/narrow fetching?
Where a server can detect such a client, rely on it to get the objects
from elsewhere (e.g. via adding an unrelated remote), and then get a
push that gives it secret data?

I don't see any fool-proof way out of it, but maybe a mode where we:

 1. Get the CDN data URLs (which for the purposes of this example I'm
    assuming are "base packs" containing at least some commits...)

 2. Do a follow-up request to the server where we e.g. pretend not to
    have the last N commits for tips it advertises and ask it to serve
    that up from a non-CDN. If it can't we know it's lying and trying an
    attack similar to the one described in 50d3413740.

    Conversely, can't know if it can that it isn't, but at that point it
    seems unlikely, since surely the useful thing is a delta against the
    recent-but-server-lied-about-having-it commit, at least in most
    cases....

This would also have the advantage of addressing some of the reliability
concerns brought up elsewhere. I.e. once we're doing a sanity-check
post-fetch anyway a failure to download 1/3 packs from a CDN becomes
recoverable, although the design outlined here (understandably) would
prefer that to be done in another "fetch" dialog so the server can keep
the "CDN + my data should be 100% of the data" state.

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2019-04-24  9:09 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-23 23:38 [WIP 0/7] CDN offloading of fetch response Jonathan Tan
2019-02-23 23:38 ` [WIP 1/7] http: use --stdin and --keep when downloading pack Jonathan Tan
2019-02-23 23:38 ` [WIP 2/7] http: improve documentation of http_pack_request Jonathan Tan
2019-02-23 23:38 ` [WIP 3/7] http-fetch: support fetching packfiles by URL Jonathan Tan
2019-02-23 23:38 ` [WIP 4/7] Documentation: order protocol v2 sections Jonathan Tan
2019-02-23 23:38 ` [WIP 5/7] Documentation: add Packfile URIs design doc Jonathan Tan
2019-02-23 23:39 ` [WIP 6/7] upload-pack: refactor reading of pack-objects out Jonathan Tan
2019-02-23 23:39 ` [WIP 7/7] upload-pack: send part of packfile response as uri Jonathan Tan
2019-02-24 15:54   ` Junio C Hamano
2019-02-25 21:04   ` Christian Couder
2019-02-26  1:53     ` Jonathan Nieder
2019-02-26  7:08       ` Christian Couder
2019-03-01  0:09   ` Josh Steadmon
2019-03-01  0:17     ` Jonathan Tan
2019-02-25 21:30 ` [WIP 0/7] CDN offloading of fetch response Christian Couder
2019-02-25 23:45   ` Jonathan Nieder
2019-02-26  8:30     ` Christian Couder
2019-02-26  9:12       ` Ævar Arnfjörð Bjarmason
2019-03-04  8:24         ` Christian Couder
2019-02-28 23:21       ` Jonathan Nieder
2019-03-04  8:54         ` Christian Couder
2019-03-08 21:55 ` [PATCH v2 0/8] " Jonathan Tan
2019-03-08 21:55   ` [PATCH v2 1/8] http: use --stdin when getting dumb HTTP pack Jonathan Tan
2019-03-08 21:55   ` [PATCH v2 2/8] http: improve documentation of http_pack_request Jonathan Tan
2019-03-08 21:55   ` [PATCH v2 3/8] http-fetch: support fetching packfiles by URL Jonathan Tan
2019-03-08 21:55   ` [PATCH v2 4/8] Documentation: order protocol v2 sections Jonathan Tan
2019-03-08 21:55   ` [PATCH v2 5/8] Documentation: add Packfile URIs design doc Jonathan Tan
2019-04-23  5:31     ` Jeff King
2019-04-23 20:38       ` Jonathan Tan
2019-04-23 22:18         ` Ævar Arnfjörð Bjarmason
2019-04-23 22:22           ` Jonathan Nieder
2019-04-23 22:30             ` Ævar Arnfjörð Bjarmason
2019-04-23 22:51               ` Jonathan Nieder
2019-04-23 22:11       ` Jonathan Nieder
2019-04-23 22:25         ` Ævar Arnfjörð Bjarmason
2019-04-23 22:48           ` Jonathan Nieder
2019-04-24  7:48             ` Ævar Arnfjörð Bjarmason
2019-04-24  3:01       ` Junio C Hamano
2019-03-08 21:55   ` [PATCH v2 6/8] upload-pack: refactor reading of pack-objects out Jonathan Tan
2019-03-08 21:55   ` [PATCH v2 7/8] fetch-pack: support more than one pack lockfile Jonathan Tan
2019-03-08 21:55   ` [PATCH v2 8/8] upload-pack: send part of packfile response as uri Jonathan Tan
2019-03-19 20:48   ` [PATCH v2 0/8] CDN offloading of fetch response Josh Steadmon
2019-04-23  5:21   ` Jeff King
2019-04-23 19:23     ` Jonathan Tan
2019-04-24  9:09   ` Ævar Arnfjörð Bjarmason

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).