[WIP RFC 0/5] Design for offloading part of packfile response to CDN

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [WIP RFC 0/5] Design for offloading part of packfile response to CDN
@ 2018-12-03 23:37 Jonathan Tan
  2018-12-03 23:37 ` [WIP RFC 1/5] Documentation: order protocol v2 sections Jonathan Tan
                   ` (5 more replies)
  0 siblings, 6 replies; 30+ messages in thread
From: Jonathan Tan @ 2018-12-03 23:37 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

Some of us have been working on a design to improve the scalability of
Git servers by allowing them to offload part of the packfile response to
CDNs in this way: returning HTTP(S) URIs in fetch responses in addition
to packfiles.

This can reduce the load on individual Git servers and improves
proximity (by having data served from closer to the user).

I have included here a design document (patch 2) and a rough
implementation of the server (patch 5). Currently, the implementation
only allows replacing single blobs with URIs, but the protocol
improvement is designed in such a way as to allow independent
improvement of Git server implementations.

There is a potential issue: a server which produces both the URIs and
the packfile at roughly the same time (like the implementation in this
patch set) will not have sideband access until it has concluded sending
the URIs. Among other things, this means that the server cannot send
keepalive packets until quite late in the response. One solution to this
might be to add a feature that allows the server to use a sideband
throughout the whole response - and this has other benefits too like
allowing servers to inform the client throughout the whole fetch, not
just at the end.

Jonathan Tan (5):
  Documentation: order protocol v2 sections
  Documentation: add Packfile URIs design doc
  upload-pack: refactor reading of pack-objects out
  upload-pack: refactor writing of "packfile" line
  upload-pack: send part of packfile response as uri

 Documentation/technical/packfile-uri.txt |  83 +++++++++++++
 Documentation/technical/protocol-v2.txt  |  22 ++--
 builtin/pack-objects.c                   |  48 ++++++++
 fetch-pack.c                             |   9 ++
 t/t5702-protocol-v2.sh                   |  25 ++++
 upload-pack.c                            | 150 ++++++++++++++++-------
 6 files changed, 285 insertions(+), 52 deletions(-)
 create mode 100644 Documentation/technical/packfile-uri.txt

-- 
2.19.0.271.gfe8321ec05.dirty

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [WIP RFC 1/5] Documentation: order protocol v2 sections
  2018-12-03 23:37 [WIP RFC 0/5] Design for offloading part of packfile response to CDN Jonathan Tan
@ 2018-12-03 23:37 ` Jonathan Tan
  2018-12-05  4:10   ` Junio C Hamano
  2018-12-03 23:37 ` [WIP RFC 2/5] Documentation: add Packfile URIs design doc Jonathan Tan
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 30+ messages in thread
From: Jonathan Tan @ 2018-12-03 23:37 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

The git command line expects Git servers to follow a specific order of
sections when transmitting protocol v2 responses, but this is not
explicit in the documentation. Make the order explicit.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/technical/protocol-v2.txt | 18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
index 09e4e0273f..345c00e08c 100644
--- a/Documentation/technical/protocol-v2.txt
+++ b/Documentation/technical/protocol-v2.txt
@@ -309,11 +309,11 @@ the 'wanted-refs' section in the server's response as explained below.
 
 The response of `fetch` is broken into a number of sections separated by
 delimiter packets (0001), with each section beginning with its section
-header.
+header. Most sections are sent only when the packfile is sent.
 
-    output = *section
-    section = (acknowledgments | shallow-info | wanted-refs | packfile)
-	      (flush-pkt | delim-pkt)
+    output = acknowledgements flush-pkt |
+	     [acknowledgments delim-pkt] [shallow-info delim-pkt]
+	     [wanted-refs delim-pkt] packfile flush-pkt
 
     acknowledgments = PKT-LINE("acknowledgments" LF)
 		      (nak | *ack)
@@ -335,9 +335,10 @@ header.
 	       *PKT-LINE(%x01-03 *%x00-ff)
 
     acknowledgments section
-	* If the client determines that it is finished with negotiations
-	  by sending a "done" line, the acknowledgments sections MUST be
-	  omitted from the server's response.
+	* If the client determines that it is finished with negotiations by
+	  sending a "done" line (thus requiring the server to send a packfile),
+	  the acknowledgments sections MUST be omitted from the server's
+	  response.
 
 	* Always begins with the section header "acknowledgments"
 
@@ -388,9 +389,6 @@ header.
 	  which the client has not indicated was shallow as a part of
 	  its request.
 
-	* This section is only included if a packfile section is also
-	  included in the response.
-
     wanted-refs section
 	* This section is only included if the client has requested a
 	  ref using a 'want-ref' line and if a packfile section is also
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [WIP RFC 2/5] Documentation: add Packfile URIs design doc
  2018-12-03 23:37 [WIP RFC 0/5] Design for offloading part of packfile response to CDN Jonathan Tan
  2018-12-03 23:37 ` [WIP RFC 1/5] Documentation: order protocol v2 sections Jonathan Tan
@ 2018-12-03 23:37 ` Jonathan Tan
  2018-12-04  0:21   ` Stefan Beller
                     ` (3 more replies)
  2018-12-03 23:37 ` [WIP RFC 3/5] upload-pack: refactor reading of pack-objects out Jonathan Tan
                   ` (3 subsequent siblings)
  5 siblings, 4 replies; 30+ messages in thread
From: Jonathan Tan @ 2018-12-03 23:37 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/technical/packfile-uri.txt | 83 ++++++++++++++++++++++++
 Documentation/technical/protocol-v2.txt  |  6 +-
 2 files changed, 88 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/technical/packfile-uri.txt

diff --git a/Documentation/technical/packfile-uri.txt b/Documentation/technical/packfile-uri.txt
new file mode 100644
index 0000000000..6535801486
--- /dev/null
+++ b/Documentation/technical/packfile-uri.txt
@@ -0,0 +1,83 @@
+Packfile URIs
+=============
+
+This feature allows servers to serve part of their packfile response as URIs.
+This allows server designs that improve scalability in bandwidth and CPU usage
+(for example, by serving some data through a CDN), and (in the future) provides
+some measure of resumability to clients.
+
+This feature is available only in protocol version 2.
+
+Protocol
+--------
+
+The server advertises `packfile-uris`.
+
+If the client replies with the following arguments:
+
+ * packfile-uris
+ * thin-pack
+ * ofs-delta
+
+when the server sends the packfile, it MAY send a `packfile-uris` section
+directly before the `packfile` section (right after `wanted-refs` if it is
+sent) containing HTTP(S) URIs. See protocol-v2.txt for the documentation of
+this section.
+
+Clients then should understand that the returned packfile could be incomplete,
+and that it needs to download all the given URIs before the fetch or clone is
+complete. Each URI should point to a Git packfile (which may be a thin pack and
+which may contain offset deltas).
+
+Server design
+-------------
+
+The server can be trivially made compatible with the proposed protocol by
+having it advertise `packfile-uris`, tolerating the client sending
+`packfile-uris`, and never sending any `packfile-uris` section. But we should
+include some sort of non-trivial implementation in the Minimum Viable Product,
+at least so that we can test the client.
+
+This is the implementation: a feature, marked experimental, that allows the
+server to be configured by one or more `uploadpack.blobPackfileUri=<sha1>
+<uri>` entries. Whenever the list of objects to be sent is assembled, a blob
+with the given sha1 can be replaced by the given URI. This allows, for example,
+servers to delegate serving of large blobs to CDNs.
+
+Client design
+-------------
+
+While fetching, the client needs to remember the list of URIs and cannot
+declare that the fetch is complete until all URIs have been downloaded as
+packfiles.
+
+The division of work (initial fetch + additional URIs) introduces convenient
+points for resumption of an interrupted clone - such resumption can be done
+after the Minimum Viable Product (see "Future work").
+
+The client can inhibit this feature (i.e. refrain from sending the
+`packfile-urls` parameter) by passing --no-packfile-urls to `git fetch`.
+
+Future work
+-----------
+
+The protocol design allows some evolution of the server and client without any
+need for protocol changes, so only a small-scoped design is included here to
+form the MVP. For example, the following can be done:
+
+ * On the server, a long-running process that takes in entire requests and
+   outputs a list of URIs and the corresponding inclusion and exclusion sets of
+   objects. This allows, e.g., signed URIs to be used and packfiles for common
+   requests to be cached.
+ * On the client, resumption of clone. If a clone is interrupted, information
+   could be recorded in the repository's config and a "clone-resume" command
+   can resume the clone in progress. (Resumption of subsequent fetches is more
+   difficult because that must deal with the user wanting to use the repository
+   even after the fetch was interrupted.)
+
+There are some possible features that will require a change in protocol:
+
+ * Additional HTTP headers (e.g. authentication)
+ * Byte range support
+ * Different file formats referenced by URIs (e.g. raw object)
+
diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
index 345c00e08c..2cb1c41742 100644
--- a/Documentation/technical/protocol-v2.txt
+++ b/Documentation/technical/protocol-v2.txt
@@ -313,7 +313,8 @@ header. Most sections are sent only when the packfile is sent.
 
     output = acknowledgements flush-pkt |
 	     [acknowledgments delim-pkt] [shallow-info delim-pkt]
-	     [wanted-refs delim-pkt] packfile flush-pkt
+	     [wanted-refs delim-pkt] [packfile-uris delim-pkt]
+	     packfile flush-pkt
 
     acknowledgments = PKT-LINE("acknowledgments" LF)
 		      (nak | *ack)
@@ -331,6 +332,9 @@ header. Most sections are sent only when the packfile is sent.
 		  *PKT-LINE(wanted-ref LF)
     wanted-ref = obj-id SP refname
 
+    packfile-uris = PKT-LINE("packfile-uris" LF) *packfile-uri
+    packfile-uri = PKT-LINE("uri" SP *%x20-ff LF)
+
     packfile = PKT-LINE("packfile" LF)
 	       *PKT-LINE(%x01-03 *%x00-ff)
 
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [WIP RFC 3/5] upload-pack: refactor reading of pack-objects out
  2018-12-03 23:37 [WIP RFC 0/5] Design for offloading part of packfile response to CDN Jonathan Tan
  2018-12-03 23:37 ` [WIP RFC 1/5] Documentation: order protocol v2 sections Jonathan Tan
  2018-12-03 23:37 ` [WIP RFC 2/5] Documentation: add Packfile URIs design doc Jonathan Tan
@ 2018-12-03 23:37 ` Jonathan Tan
  2018-12-04  0:30   ` Stefan Beller
  2018-12-05  6:30   ` Junio C Hamano
  2018-12-03 23:37 ` [WIP RFC 4/5] upload-pack: refactor writing of "packfile" line Jonathan Tan
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 30+ messages in thread
From: Jonathan Tan @ 2018-12-03 23:37 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

Subsequent patches will change how the output of pack-objects is
processed, so extract that processing into its own function.

Currently, at most 1 character can be buffered (in the "buffered" local
variable). One of those patches will require a larger buffer, so replace
that "buffered" local variable with a buffer array.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 upload-pack.c | 80 +++++++++++++++++++++++++++++----------------------
 1 file changed, 46 insertions(+), 34 deletions(-)

diff --git a/upload-pack.c b/upload-pack.c
index 5e81f1ff24..cec43e0a46 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -101,14 +101,51 @@ static int write_one_shallow(const struct commit_graft *graft, void *cb_data)
 	return 0;
 }
 
+struct output_state {
+	char buffer[8193];
+	int used;
+};
+
+static int read_pack_objects_stdout(int outfd, struct output_state *os)
+{
+	/* Data ready; we keep the last byte to ourselves
+	 * in case we detect broken rev-list, so that we
+	 * can leave the stream corrupted.  This is
+	 * unfortunate -- unpack-objects would happily
+	 * accept a valid packdata with trailing garbage,
+	 * so appending garbage after we pass all the
+	 * pack data is not good enough to signal
+	 * breakage to downstream.
+	 */
+	ssize_t readsz;
+
+	readsz = xread(outfd, os->buffer + os->used,
+		       sizeof(os->buffer) - os->used);
+	if (readsz < 0) {
+		return readsz;
+	}
+	os->used += readsz;
+
+	if (os->used > 1) {
+		send_client_data(1, os->buffer, os->used - 1);
+		os->buffer[0] = os->buffer[os->used - 1];
+		os->used = 1;
+	} else {
+		send_client_data(1, os->buffer, os->used);
+		os->used = 0;
+	}
+
+	return readsz;
+}
+
 static void create_pack_file(const struct object_array *have_obj,
 			     const struct object_array *want_obj)
 {
 	struct child_process pack_objects = CHILD_PROCESS_INIT;
-	char data[8193], progress[128];
+	struct output_state output_state = {0};
+	char progress[128];
 	char abort_msg[] = "aborting due to possible repository "
 		"corruption on the remote side.";
-	int buffered = -1;
 	ssize_t sz;
 	int i;
 	FILE *pipe_fd;
@@ -235,39 +272,15 @@ static void create_pack_file(const struct object_array *have_obj,
 			continue;
 		}
 		if (0 <= pu && (pfd[pu].revents & (POLLIN|POLLHUP))) {
-			/* Data ready; we keep the last byte to ourselves
-			 * in case we detect broken rev-list, so that we
-			 * can leave the stream corrupted.  This is
-			 * unfortunate -- unpack-objects would happily
-			 * accept a valid packdata with trailing garbage,
-			 * so appending garbage after we pass all the
-			 * pack data is not good enough to signal
-			 * breakage to downstream.
-			 */
-			char *cp = data;
-			ssize_t outsz = 0;
-			if (0 <= buffered) {
-				*cp++ = buffered;
-				outsz++;
-			}
-			sz = xread(pack_objects.out, cp,
-				  sizeof(data) - outsz);
-			if (0 < sz)
-				;
-			else if (sz == 0) {
+			int result = read_pack_objects_stdout(pack_objects.out,
+							      &output_state);
+
+			if (result == 0) {
 				close(pack_objects.out);
 				pack_objects.out = -1;
-			}
-			else
+			} else if (result < 0) {
 				goto fail;
-			sz += outsz;
-			if (1 < sz) {
-				buffered = data[sz-1] & 0xFF;
-				sz--;
 			}
-			else
-				buffered = -1;
-			send_client_data(1, data, sz);
 		}
 
 		/*
@@ -292,9 +305,8 @@ static void create_pack_file(const struct object_array *have_obj,
 	}
 
 	/* flush the data */
-	if (0 <= buffered) {
-		data[0] = buffered;
-		send_client_data(1, data, 1);
+	if (output_state.used > 0) {
+		send_client_data(1, output_state.buffer, output_state.used);
 		fprintf(stderr, "flushed.\n");
 	}
 	if (use_sideband)
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [WIP RFC 4/5] upload-pack: refactor writing of "packfile" line
  2018-12-03 23:37 [WIP RFC 0/5] Design for offloading part of packfile response to CDN Jonathan Tan
                   ` (2 preceding siblings ...)
  2018-12-03 23:37 ` [WIP RFC 3/5] upload-pack: refactor reading of pack-objects out Jonathan Tan
@ 2018-12-03 23:37 ` Jonathan Tan
  2018-12-06  6:35   ` Junio C Hamano
  2018-12-03 23:37 ` [WIP RFC 5/5] upload-pack: send part of packfile response as uri Jonathan Tan
  2018-12-04  0:01 ` [WIP RFC 0/5] Design for offloading part of packfile response to CDN Stefan Beller
  5 siblings, 1 reply; 30+ messages in thread
From: Jonathan Tan @ 2018-12-03 23:37 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

A subsequent patch allows pack-objects to output additional information
(in addition to the packfile that it currently outputs). This means that
we must hold off on writing the "packfile" section header to the client
before we process the output of pack-objects, so move the writing of
the "packfile" section header to read_pack_objects_stdout().

Unfortunately, this also means that we cannot send keepalive packets
until pack-objects starts sending out the packfile, since the sideband
is only established when the "packfile" section header is sent.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 upload-pack.c | 47 ++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 36 insertions(+), 11 deletions(-)

diff --git a/upload-pack.c b/upload-pack.c
index cec43e0a46..aa2589b858 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -104,9 +104,12 @@ static int write_one_shallow(const struct commit_graft *graft, void *cb_data)
 struct output_state {
 	char buffer[8193];
 	int used;
+	unsigned packfile_started : 1;
+	struct strbuf progress_buf;
 };
 
-static int read_pack_objects_stdout(int outfd, struct output_state *os)
+static int read_pack_objects_stdout(int outfd, struct output_state *os,
+				    int use_protocol_v2)
 {
 	/* Data ready; we keep the last byte to ourselves
 	 * in case we detect broken rev-list, so that we
@@ -126,6 +129,12 @@ static int read_pack_objects_stdout(int outfd, struct output_state *os)
 	}
 	os->used += readsz;
 
+	if (!os->packfile_started) {
+		os->packfile_started = 1;
+		if (use_protocol_v2)
+			packet_write_fmt(1, "packfile\n");
+	}
+
 	if (os->used > 1) {
 		send_client_data(1, os->buffer, os->used - 1);
 		os->buffer[0] = os->buffer[os->used - 1];
@@ -138,8 +147,17 @@ static int read_pack_objects_stdout(int outfd, struct output_state *os)
 	return readsz;
 }
 
+static void flush_progress_buf(struct strbuf *progress_buf)
+{
+	if (!progress_buf->len)
+		return;
+	send_client_data(2, progress_buf->buf, progress_buf->len);
+	strbuf_reset(progress_buf);
+}
+
 static void create_pack_file(const struct object_array *have_obj,
-			     const struct object_array *want_obj)
+			     const struct object_array *want_obj,
+			     int use_protocol_v2)
 {
 	struct child_process pack_objects = CHILD_PROCESS_INIT;
 	struct output_state output_state = {0};
@@ -260,9 +278,13 @@ static void create_pack_file(const struct object_array *have_obj,
 			 */
 			sz = xread(pack_objects.err, progress,
 				  sizeof(progress));
-			if (0 < sz)
-				send_client_data(2, progress, sz);
-			else if (sz == 0) {
+			if (0 < sz) {
+				if (output_state.packfile_started)
+					send_client_data(2, progress, sz);
+				else
+					strbuf_add(&output_state.progress_buf,
+						   progress, sz);
+			} else if (sz == 0) {
 				close(pack_objects.err);
 				pack_objects.err = -1;
 			}
@@ -273,8 +295,10 @@ static void create_pack_file(const struct object_array *have_obj,
 		}
 		if (0 <= pu && (pfd[pu].revents & (POLLIN|POLLHUP))) {
 			int result = read_pack_objects_stdout(pack_objects.out,
-							      &output_state);
-
+							      &output_state,
+							      use_protocol_v2);
+			if (output_state.packfile_started)
+				flush_progress_buf(&output_state.progress_buf);
 			if (result == 0) {
 				close(pack_objects.out);
 				pack_objects.out = -1;
@@ -293,12 +317,14 @@ static void create_pack_file(const struct object_array *have_obj,
 		 * protocol to say anything, so those clients are just out of
 		 * luck.
 		 */
-		if (!ret && use_sideband) {
+		if (!ret && use_sideband && output_state.packfile_started) {
 			static const char buf[] = "0005\1";
 			write_or_die(1, buf, 5);
 		}
 	}
 
+	flush_progress_buf(&output_state.progress_buf);
+
 	if (finish_command(&pack_objects)) {
 		error("git upload-pack: git-pack-objects died with error.");
 		goto fail;
@@ -1094,7 +1120,7 @@ void upload_pack(struct upload_pack_options *options)
 	if (want_obj.nr) {
 		struct object_array have_obj = OBJECT_ARRAY_INIT;
 		get_common_commits(&have_obj, &want_obj);
-		create_pack_file(&have_obj, &want_obj);
+		create_pack_file(&have_obj, &want_obj, 0);
 	}
 }
 
@@ -1475,8 +1501,7 @@ int upload_pack_v2(struct repository *r, struct argv_array *keys,
 			send_wanted_ref_info(&data);
 			send_shallow_info(&data, &want_obj);
 
-			packet_write_fmt(1, "packfile\n");
-			create_pack_file(&have_obj, &want_obj);
+			create_pack_file(&have_obj, &want_obj, 1);
 			state = FETCH_DONE;
 			break;
 		case FETCH_DONE:
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [WIP RFC 5/5] upload-pack: send part of packfile response as uri
  2018-12-03 23:37 [WIP RFC 0/5] Design for offloading part of packfile response to CDN Jonathan Tan
                   ` (3 preceding siblings ...)
  2018-12-03 23:37 ` [WIP RFC 4/5] upload-pack: refactor writing of "packfile" line Jonathan Tan
@ 2018-12-03 23:37 ` Jonathan Tan
  2018-12-04 20:09   ` Stefan Beller
  2018-12-04  0:01 ` [WIP RFC 0/5] Design for offloading part of packfile response to CDN Stefan Beller
  5 siblings, 1 reply; 30+ messages in thread
From: Jonathan Tan @ 2018-12-03 23:37 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan

This is a partial implementation of upload-pack sending part of its
packfile response as URIs.

The client is not fully implemented - it knows to ignore the
"packfile-uris" section, but because it does not actually fetch those
URIs, the returned packfile is incomplete. A test is included to show
that the appropriate URI is indeed transmitted, and that the returned
packfile is lacking exactly the expected object.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/pack-objects.c | 48 ++++++++++++++++++++++++++++++++++++++++++
 fetch-pack.c           |  9 ++++++++
 t/t5702-protocol-v2.sh | 25 ++++++++++++++++++++++
 upload-pack.c          | 37 ++++++++++++++++++++++++++++----
 4 files changed, 115 insertions(+), 4 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index e7ea206c08..2abbddd3cb 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -117,6 +117,15 @@ enum missing_action {
 static enum missing_action arg_missing_action;
 static show_object_fn fn_show_object;
 
+struct configured_exclusion {
+	struct oidmap_entry e;
+	char *uri;
+};
+static struct oidmap configured_exclusions;
+
+static int exclude_configured_blobs;
+static struct oidset excluded_by_config;
+
 /*
  * stats
  */
@@ -831,6 +840,23 @@ static off_t write_reused_pack(struct hashfile *f)
 	return reuse_packfile_offset - sizeof(struct pack_header);
 }
 
+static void write_excluded_by_configs(void)
+{
+	struct oidset_iter iter;
+	const struct object_id *oid;
+
+	oidset_iter_init(&excluded_by_config, &iter);
+	while ((oid = oidset_iter_next(&iter))) {
+		struct configured_exclusion *ex =
+			oidmap_get(&configured_exclusions, oid);
+
+		if (!ex)
+			BUG("configured exclusion wasn't configured");
+		write_in_full(1, ex->uri, strlen(ex->uri));
+		write_in_full(1, "\n", 1);
+	}
+}
+
 static const char no_split_warning[] = N_(
 "disabling bitmap writing, packs are split due to pack.packSizeLimit"
 );
@@ -1124,6 +1150,12 @@ static int want_object_in_pack(const struct object_id *oid,
 		}
 	}
 
+	if (exclude_configured_blobs &&
+	    oidmap_get(&configured_exclusions, oid)) {
+		oidset_insert(&excluded_by_config, oid);
+		return 0;
+	}
+
 	return 1;
 }
 
@@ -2728,6 +2760,19 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 			    pack_idx_opts.version);
 		return 0;
 	}
+	if (!strcmp(k, "uploadpack.blobpackfileuri")) {
+		struct configured_exclusion *ex = xmalloc(sizeof(*ex));
+		const char *end;
+
+		if (parse_oid_hex(v, &ex->e.oid, &end) || *end != ' ')
+			die(_("value of uploadpack.blobpackfileuri must be "
+			      "of the form '<sha-1> <uri>' (got '%s')"), v);
+		if (oidmap_get(&configured_exclusions, &ex->e.oid))
+			die(_("object already configured in another "
+			      "uploadpack.blobpackfileuri (got '%s')"), v);
+		ex->uri = xstrdup(end + 1);
+		oidmap_put(&configured_exclusions, ex);
+	}
 	return git_default_config(k, v, cb);
 }
 
@@ -3314,6 +3359,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			 N_("do not pack objects in promisor packfiles")),
 		OPT_BOOL(0, "delta-islands", &use_delta_islands,
 			 N_("respect islands during delta compression")),
+		OPT_BOOL(0, "exclude-configured-blobs", &exclude_configured_blobs,
+			 N_("respect uploadpack.blobpackfileuri")),
 		OPT_END(),
 	};
 
@@ -3487,6 +3534,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		return 0;
 	if (nr_result)
 		prepare_pack(window, depth);
+	write_excluded_by_configs();
 	write_pack_file();
 	if (progress)
 		fprintf_ln(stderr,
diff --git a/fetch-pack.c b/fetch-pack.c
index 9691046e64..6e1985ab55 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -1413,6 +1413,15 @@ static struct ref *do_fetch_pack_v2(struct fetch_pack_args *args,
 				receive_wanted_refs(&reader, sought, nr_sought);
 
 			/* get the pack */
+			if (process_section_header(&reader, "packfile-uris", 1)) {
+				/* skip the whole section */
+				process_section_header(&reader, "packfile-uris", 0);
+				while (packet_reader_read(&reader) == PACKET_READ_NORMAL) {
+					/* do nothing */
+				}
+				if (reader.status != PACKET_READ_DELIM)
+					die("expected DELIM");
+			}
 			process_section_header(&reader, "packfile", 0);
 			if (get_pack(args, fd, pack_lockfile))
 				die(_("git fetch-pack: fetch failed."));
diff --git a/t/t5702-protocol-v2.sh b/t/t5702-protocol-v2.sh
index 0f2b09ebb8..ccb1fc510e 100755
--- a/t/t5702-protocol-v2.sh
+++ b/t/t5702-protocol-v2.sh
@@ -588,6 +588,31 @@ test_expect_success 'when server does not send "ready", expect FLUSH' '
 	test_i18ngrep "expected no other sections to be sent after no .ready." err
 '
 
+test_expect_success 'part of packfile response provided as URI' '
+	rm -rf "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" http_child log &&
+
+	git init "$HTTPD_DOCUMENT_ROOT_PATH/http_parent" &&
+	echo my-blob >"$HTTPD_DOCUMENT_ROOT_PATH/http_parent/my-blob" &&
+	git -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent/" add my-blob &&
+	git -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent/" commit -m x &&
+
+	git -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent/" hash-object my-blob >h &&
+	git -C "$HTTPD_DOCUMENT_ROOT_PATH/http_parent/" config \
+		"uploadpack.blobpackfileuri" \
+		"$(cat h) https://example.com/a-uri" &&
+
+	# NEEDSWORK: "git clone" fails here because it ignores the URI provided
+	# instead of fetching it.
+	test_must_fail env GIT_TRACE_PACKET="$(pwd)/log" \
+		git -c protocol.version=2 clone \
+		"$HTTPD_URL/smart/http_parent" http_child 2>err &&
+	# Although "git clone" fails, we can still check that the server
+	# provided the URI we requested and that the error message pinpoints
+	# the object that is missing.
+	grep "clone< uri https://example.com/a-uri" log &&
+	test_i18ngrep "did not receive expected object $(cat h)" err
+'
+
 stop_httpd
 
 test_done
diff --git a/upload-pack.c b/upload-pack.c
index aa2589b858..a8eef697ec 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -104,6 +104,7 @@ static int write_one_shallow(const struct commit_graft *graft, void *cb_data)
 struct output_state {
 	char buffer[8193];
 	int used;
+	unsigned packfile_uris_started : 1;
 	unsigned packfile_started : 1;
 	struct strbuf progress_buf;
 };
@@ -129,10 +130,35 @@ static int read_pack_objects_stdout(int outfd, struct output_state *os,
 	}
 	os->used += readsz;
 
-	if (!os->packfile_started) {
-		os->packfile_started = 1;
-		if (use_protocol_v2)
-			packet_write_fmt(1, "packfile\n");
+	while (!os->packfile_started) {
+		char *p;
+		if (os->used >= 4 && !memcmp(os->buffer, "PACK", 4)) {
+			os->packfile_started = 1;
+			if (use_protocol_v2) {
+				if (os->packfile_uris_started)
+					packet_delim(1);
+				packet_write_fmt(1, "packfile\n");
+			}
+			break;
+		}
+		if ((p = memchr(os->buffer, '\n', os->used))) {
+			if (!os->packfile_uris_started) {
+				os->packfile_uris_started = 1;
+				if (!use_protocol_v2)
+					BUG("packfile_uris requires protocol v2");
+				packet_write_fmt(1, "packfile-uris\n");
+			}
+			*p = '\0';
+			packet_write_fmt(1, "uri %s\n", os->buffer);
+
+			os->used -= p - os->buffer + 1;
+			memmove(os->buffer, p, os->used);
+		} else {
+			/*
+			 * Incomplete line.
+			 */
+			return readsz;
+		}
 	}
 
 	if (os->used > 1) {
@@ -205,6 +231,9 @@ static void create_pack_file(const struct object_array *have_obj,
 					 filter_options.filter_spec);
 		}
 	}
+	if (use_protocol_v2)
+		argv_array_push(&pack_objects.args,
+				"--exclude-configured-blobs");
 
 	pack_objects.in = -1;
 	pack_objects.out = -1;
-- 
2.19.0.271.gfe8321ec05.dirty


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 0/5] Design for offloading part of packfile response to CDN
  2018-12-03 23:37 [WIP RFC 0/5] Design for offloading part of packfile response to CDN Jonathan Tan
                   ` (4 preceding siblings ...)
  2018-12-03 23:37 ` [WIP RFC 5/5] upload-pack: send part of packfile response as uri Jonathan Tan
@ 2018-12-04  0:01 ` Stefan Beller
  5 siblings, 0 replies; 30+ messages in thread
From: Stefan Beller @ 2018-12-04  0:01 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

On Mon, Dec 3, 2018 at 3:37 PM Jonathan Tan <jonathantanmy@google.com> wrote:

>
> There is a potential issue: a server which produces both the URIs and
> the packfile at roughly the same time (like the implementation in this
> patch set) will not have sideband access until it has concluded sending
> the URIs. Among other things, this means that the server cannot send
> keepalive packets until quite late in the response. One solution to this
> might be to add a feature that allows the server to use a sideband
> throughout the whole response - and this has other benefits too like
> allowing servers to inform the client throughout the whole fetch, not
> just at the end.

While side band sounds like the right thing to do, we could also
sending (NULL)-URIs within this feature.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 2/5] Documentation: add Packfile URIs design doc
  2018-12-03 23:37 ` [WIP RFC 2/5] Documentation: add Packfile URIs design doc Jonathan Tan
@ 2018-12-04  0:21   ` Stefan Beller
  2018-12-04  1:54   ` brian m. carlson
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 30+ messages in thread
From: Stefan Beller @ 2018-12-04  0:21 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

Thanks for bringing this design to the list!

> diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
> index 345c00e08c..2cb1c41742 100644
> --- a/Documentation/technical/protocol-v2.txt
> +++ b/Documentation/technical/protocol-v2.txt
> @@ -313,7 +313,8 @@ header. Most sections are sent only when the packfile is sent.
>
>      output = acknowledgements flush-pkt |
>              [acknowledgments delim-pkt] [shallow-info delim-pkt]
> -            [wanted-refs delim-pkt] packfile flush-pkt
> +            [wanted-refs delim-pkt] [packfile-uris delim-pkt]
> +            packfile flush-pkt

While this is an RFC and incomplete, we'd need to remember to
add packfile-uris to the capabilities list above, stating that it requires
thin-pack and ofs-delta to be sent, and what to expect from it.

The mention of  --no-packfile-urls in the Client design above
seems to imply we'd want to turn it on by default, which I thought
was not the usual stance how we introduce new things.

An odd way of disabling it would be --no-thin-pack, hoping the
client side implementation abides by the implied requirements.

>      acknowledgments = PKT-LINE("acknowledgments" LF)
>                       (nak | *ack)
> @@ -331,6 +332,9 @@ header. Most sections are sent only when the packfile is sent.
>                   *PKT-LINE(wanted-ref LF)
>      wanted-ref = obj-id SP refname
>
> +    packfile-uris = PKT-LINE("packfile-uris" LF) *packfile-uri
> +    packfile-uri = PKT-LINE("uri" SP *%x20-ff LF)

Is the *%x20-ff a fancy way of saying obj-id?

While the server is configured with pairs of (oid URL),
we would not need to send the exact oid to the client
as that is what the client can figure out on its own by reading
the downloaded pack.

Instead we could send an integrity hash (i.e. the packfile
downloaded from "uri" is expected to hash to $oid here)

Thanks,
Stefan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 3/5] upload-pack: refactor reading of pack-objects out
  2018-12-03 23:37 ` [WIP RFC 3/5] upload-pack: refactor reading of pack-objects out Jonathan Tan
@ 2018-12-04  0:30   ` Stefan Beller
  2018-12-05  6:30   ` Junio C Hamano
  1 sibling, 0 replies; 30+ messages in thread
From: Stefan Beller @ 2018-12-04  0:30 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

On Mon, Dec 3, 2018 at 3:37 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> Subsequent patches will change how the output of pack-objects is
> processed, so extract that processing into its own function.
>
> Currently, at most 1 character can be buffered (in the "buffered" local
> variable). One of those patches will require a larger buffer, so replace
> that "buffered" local variable with a buffer array.

This buffering sounds oddly similar to the pkt reader which can buffer
at most one pkt, the difference being that we'd buffer bytes
instead of pkts.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 2/5] Documentation: add Packfile URIs design doc
  2018-12-03 23:37 ` [WIP RFC 2/5] Documentation: add Packfile URIs design doc Jonathan Tan
  2018-12-04  0:21   ` Stefan Beller
@ 2018-12-04  1:54   ` brian m. carlson
  2018-12-04 19:29     ` Jonathan Tan
  2019-02-19 13:44     ` Ævar Arnfjörð Bjarmason
  2018-12-05  5:02   ` Junio C Hamano
  2019-02-19 14:28   ` Ævar Arnfjörð Bjarmason
  3 siblings, 2 replies; 30+ messages in thread
From: brian m. carlson @ 2018-12-04  1:54 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

[-- Attachment #1: Type: text/plain, Size: 3058 bytes --]

On Mon, Dec 03, 2018 at 03:37:35PM -0800, Jonathan Tan wrote:
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
>  Documentation/technical/packfile-uri.txt | 83 ++++++++++++++++++++++++
>  Documentation/technical/protocol-v2.txt  |  6 +-
>  2 files changed, 88 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/technical/packfile-uri.txt
> 
> diff --git a/Documentation/technical/packfile-uri.txt b/Documentation/technical/packfile-uri.txt
> new file mode 100644
> index 0000000000..6535801486
> --- /dev/null
> +++ b/Documentation/technical/packfile-uri.txt
> @@ -0,0 +1,83 @@
> +Packfile URIs
> +=============
> +
> +This feature allows servers to serve part of their packfile response as URIs.
> +This allows server designs that improve scalability in bandwidth and CPU usage
> +(for example, by serving some data through a CDN), and (in the future) provides
> +some measure of resumability to clients.
> +
> +This feature is available only in protocol version 2.
> +
> +Protocol
> +--------
> +
> +The server advertises `packfile-uris`.
> +
> +If the client replies with the following arguments:
> +
> + * packfile-uris
> + * thin-pack
> + * ofs-delta
> +
> +when the server sends the packfile, it MAY send a `packfile-uris` section
> +directly before the `packfile` section (right after `wanted-refs` if it is
> +sent) containing HTTP(S) URIs. See protocol-v2.txt for the documentation of
> +this section.
> +
> +Clients then should understand that the returned packfile could be incomplete,
> +and that it needs to download all the given URIs before the fetch or clone is
> +complete. Each URI should point to a Git packfile (which may be a thin pack and
> +which may contain offset deltas).


Some thoughts here:

First, I'd like to see a section (and a bit in the implementation)
requiring HTTPS if the original protocol is secure (SSH or HTTPS).
Allowing the server to downgrade to HTTP, even by accident, would be a
security problem.

Second, this feature likely should be opt-in for SSH. One issue I've
seen repeatedly is that people don't want to use HTTPS to fetch things
when they're using SSH for Git. Many people in corporate environments
have proxies that break HTTP for non-browser use cases[0], and using SSH
is the only way that they can make a functional Git connection.

Third, I think the server needs to be required to both support Range
headers and never change the content of a URI, so that we can have
resumable clone implicit in this design. There are some places in the
world where connections are poor and fetching even the initial packfile
at once might be a problem. (I've seen such questions on Stack
Overflow, for example.)

Having said that, I think overall this is a good idea and I'm glad to
see a proposal for it.

[0] For example, a naughty-word filter may corrupt or block certain byte
sequences that occur incidentally in the pack stream.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 2/5] Documentation: add Packfile URIs design doc
  2018-12-04  1:54   ` brian m. carlson
@ 2018-12-04 19:29     ` Jonathan Tan
  2019-02-19 13:22       ` Christian Couder
  2019-02-19 13:44     ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 30+ messages in thread
From: Jonathan Tan @ 2018-12-04 19:29 UTC (permalink / raw)
  To: sandals; +Cc: jonathantanmy, git

> Some thoughts here:
> 
> First, I'd like to see a section (and a bit in the implementation)
> requiring HTTPS if the original protocol is secure (SSH or HTTPS).
> Allowing the server to downgrade to HTTP, even by accident, would be a
> security problem.
> 
> Second, this feature likely should be opt-in for SSH. One issue I've
> seen repeatedly is that people don't want to use HTTPS to fetch things
> when they're using SSH for Git. Many people in corporate environments
> have proxies that break HTTP for non-browser use cases[0], and using SSH
> is the only way that they can make a functional Git connection.

Good points about SSH support and the client needing to control which
protocols the server will send URIs for. I'll include a line in the
client request in which the client can specify which protocols it is OK
with.

> Third, I think the server needs to be required to both support Range
> headers and never change the content of a URI, so that we can have
> resumable clone implicit in this design. There are some places in the
> world where connections are poor and fetching even the initial packfile
> at once might be a problem. (I've seen such questions on Stack
> Overflow, for example.)

Good points. I'll add these in the next revision.

> Having said that, I think overall this is a good idea and I'm glad to
> see a proposal for it.

Thanks, and thanks for your comments too.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 5/5] upload-pack: send part of packfile response as uri
  2018-12-03 23:37 ` [WIP RFC 5/5] upload-pack: send part of packfile response as uri Jonathan Tan
@ 2018-12-04 20:09   ` Stefan Beller
  0 siblings, 0 replies; 30+ messages in thread
From: Stefan Beller @ 2018-12-04 20:09 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

On Mon, Dec 3, 2018 at 3:38 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> This is a partial implementation of upload-pack sending part of its
> packfile response as URIs.

It does so by implementing a new flag `--exclude-configured-blobs`
in pack-objects, which would change the output of pack-objects to
output a list of URLs (of the excluded blobs) followed by the
pack to be asked for.

This design seems easy to implement as then upload-pack
can just parse the output and only needs to insert
"packfile" and "packfile-uris\n" at the appropriate places
of the stream, otherwise it just passes through the information
obtained from pack-objects.

The design as-is would make for hard documentation of
pack-objects (its output is not just a pack anymore when that
flag is given, but a highly optimized byte stream).

Initially I did not anticipate this to be one of the major design problems
as I assumed we'd want to use this pack feature over broadly (e.g.
eventually by offloading most of the objects into a base pack that
is just always included as the likelihood for any object in there is
very high on initial clone), but it makes total sense to only
output the URIs that we actually need.

An alternative that comes very close to the current situation
would be to either pass a file path or file descriptor (that upload-pack
listens to in parallel) to pack-objects as an argument of the new flag.
Then we would not need to splice the protocol sections into the single
output stream, but we could announce the sections, then flush
the URIs and then flush the pack afterwards.

I looked at this quickly, but that would need either extensions in
run-command.c for setting up the new fd for us, as there we already
have OS specific code for these setups, or we'd have to duplicate
some of the logic here, which doesn't enthuse me either.

So maybe we'd create a temp file via mkstemp and pass
the file name to pack-objects for writing the URIs and then
we'd just need to stream that file?

> +       # NEEDSWORK: "git clone" fails here because it ignores the URI provided
> +       # instead of fetching it.
> +       test_must_fail env GIT_TRACE_PACKET="$(pwd)/log" \
> +               git -c protocol.version=2 clone \
> +               "$HTTPD_URL/smart/http_parent" http_child 2>err &&
> +       # Although "git clone" fails, we can still check that the server
> +       # provided the URI we requested and that the error message pinpoints
> +       # the object that is missing.
> +       grep "clone< uri https://example.com/a-uri" log &&
> +       test_i18ngrep "did not receive expected object $(cat h)" err

That is a nice test!

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 1/5] Documentation: order protocol v2 sections
  2018-12-03 23:37 ` [WIP RFC 1/5] Documentation: order protocol v2 sections Jonathan Tan
@ 2018-12-05  4:10   ` Junio C Hamano
  2018-12-06 22:54     ` Jonathan Tan
  0 siblings, 1 reply; 30+ messages in thread
From: Junio C Hamano @ 2018-12-05  4:10 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

Jonathan Tan <jonathantanmy@google.com> writes:

> The git command line expects Git servers to follow a specific order of

"Command line"?  It sounds like you are talking about the order of
command line arguments and options, but apparently that is not what
you are doing.  Is it "The git over-the-wire protocol"?

> +    output = acknowledgements flush-pkt |
> +	     [acknowledgments delim-pkt] [shallow-info delim-pkt]
> +	     [wanted-refs delim-pkt] packfile flush-pkt

So the output can be either 

 - acks followed by flush (and nothing else) or

 - (possibly) acks, followed by (possibly) shallow, followed by
   (possibly) wanted-refs, followed by the pack stream and flush at
   the end.

> @@ -335,9 +335,10 @@ header.
>  	       *PKT-LINE(%x01-03 *%x00-ff)
>  
>      acknowledgments section
> -	* If the client determines that it is finished with negotiations
> -	  by sending a "done" line, the acknowledgments sections MUST be
> -	  omitted from the server's response.
> +	* If the client determines that it is finished with negotiations by
> +	  sending a "done" line (thus requiring the server to send a packfile),
> +	  the acknowledgments sections MUST be omitted from the server's
> +	  response.

OK.  

>  	* Always begins with the section header "acknowledgments"
>  
> @@ -388,9 +389,6 @@ header.
>  	  which the client has not indicated was shallow as a part of
>  	  its request.
>  
> -	* This section is only included if a packfile section is also
> -	  included in the response.
> -

Earlier, we said that shallow-info is not given when packfile is not
there.  That is captured in the updated EBNF above.  We don't have a
corresponding removal of a bullet point for wanted-refs section below
but probably that is because the original did not have corresponding
bullet point to begin with.

>      wanted-refs section
>  	* This section is only included if the client has requested a
>  	  ref using a 'want-ref' line and if a packfile section is also

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 2/5] Documentation: add Packfile URIs design doc
  2018-12-03 23:37 ` [WIP RFC 2/5] Documentation: add Packfile URIs design doc Jonathan Tan
  2018-12-04  0:21   ` Stefan Beller
  2018-12-04  1:54   ` brian m. carlson
@ 2018-12-05  5:02   ` Junio C Hamano
  2018-12-05  5:55     ` Junio C Hamano
  2018-12-06 23:16     ` Jonathan Tan
  2019-02-19 14:28   ` Ævar Arnfjörð Bjarmason
  3 siblings, 2 replies; 30+ messages in thread
From: Junio C Hamano @ 2018-12-05  5:02 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

Jonathan Tan <jonathantanmy@google.com> writes:

> +This feature allows servers to serve part of their packfile response as URIs.
> +This allows server designs that improve scalability in bandwidth and CPU usage
> +(for example, by serving some data through a CDN), and (in the future) provides
> +some measure of resumability to clients.

Without reading the remainder, this makes readers anticipate a few
good things ;-)

 - "part of", so pre-generated constant material can be given from
   CDN and then followed-up by "filling the gaps" small packfile,
   perhaps?

 - The "part of" transmission may not bring the repository up to
   date wrt to the "want" objects; would this feature involve "you
   asked history up to these commits, but with this pack-uri, you'll
   be getting history up to these (somewhat stale) commits"?

Anyway, let's read on.

> +This feature is available only in protocol version 2.
> +
> +Protocol
> +--------
> +
> +The server advertises `packfile-uris`.
> +
> +If the client replies with the following arguments:
> +
> + * packfile-uris
> + * thin-pack
> + * ofs-delta

"with the following" meaning "with all of the following", or "with
any of the following"?  Is there a reason why the server side must
require that the client understands and is willing to accept a
thin-pack when wanting to use packfile-uris?  The same question for
the ofs-delta.

When the pregenerated constant material the server plans to hand out
the uris for was prepared by using ofs-delta encoding, the server
cannot give the uri to it when the client does not want ofs-delta
encoded packfile, but it feels somewhat strange that we require the
most capable client at the protocol level.  After all, the server
side could prepare one with ofs-delta and another without ofs-delta
and depending on what the client is capable of, hand out different
URIs, if it wanted to.

The reason why I care is because thin and ofs will *NOT* stay
forever be the only optional features of the pack format.  We may
invent yet another such optional 'frotz' feature, which may greatly
help the efficiency of the packfile encoding, hence it may be
preferrable to always generate a CDN packfile with that feature, in
addition to thin and ofs.  Would we add 'frotz' to the above list in
the documentation, then?  What would happen to existing servers and
clients written before that time then?

My recommendation is to drop the mention of "thin" and "ofs" from
the above list, and also from the following paragraph.  The "it MAY
send" will serve as a practical escape clause to allow a server/CDN
implementation that *ALWAYS* prepares pregenerated material that can
only be digested by clients that supports thin and ofs.  Such a server
can send packfile-URIs only when all of the three are given by the
client and be compliant.  And such an update to the proposed document
would allow a more diskful server to prepare both thin and non-thin
pregenerated packs and choose which one to give to the client depending
on the capability.

> +when the server sends the packfile, it MAY send a `packfile-uris` section
> +directly before the `packfile` section (right after `wanted-refs` if it is
> +sent) containing HTTP(S) URIs. See protocol-v2.txt for the documentation of
> +this section.

So, this is OK, but

> +Clients then should understand that the returned packfile could be incomplete,
> +and that it needs to download all the given URIs before the fetch or clone is
> +complete. Each URI should point to a Git packfile (which may be a thin pack and
> +which may contain offset deltas).

weaken or remove the (parenthetical comment) in the last sentence,
and replace the beginning of the section with something like

	If the client replies with 'packfile-uris', when the server
	sends the packfile, it MAY send a `packfile-uris` section...

You may steal what I wrote in the above response to help the
server-side folks to decide how to actually implement the "it MAY
send a packfile-uris" part in the document.

> +Server design
> +-------------
> +
> +The server can be trivially made compatible with the proposed protocol by
> +having it advertise `packfile-uris`, tolerating the client sending
> +`packfile-uris`, and never sending any `packfile-uris` section. But we should
> +include some sort of non-trivial implementation in the Minimum Viable Product,
> +at least so that we can test the client.
> +
> +This is the implementation: a feature, marked experimental, that allows the
> +server to be configured by one or more `uploadpack.blobPackfileUri=<sha1>
> +<uri>` entries. Whenever the list of objects to be sent is assembled, a blob
> +with the given sha1 can be replaced by the given URI. This allows, for example,
> +servers to delegate serving of large blobs to CDNs.

;-)

> +Client design
> +-------------
> +
> +While fetching, the client needs to remember the list of URIs and cannot
> +declare that the fetch is complete until all URIs have been downloaded as
> +packfiles.
> +
> +The division of work (initial fetch + additional URIs) introduces convenient
> +points for resumption of an interrupted clone - such resumption can be done
> +after the Minimum Viable Product (see "Future work").
> +
> +The client can inhibit this feature (i.e. refrain from sending the
> +`packfile-urls` parameter) by passing --no-packfile-urls to `git fetch`.

OK, this comes back to what I alluded to at the beginning.  We could
respond to a full-clone request by feeding a series of packfile-uris
and some ref information, perhaps like this:

	* Grab this packfile and update your remote-tracking refs
          and tags to these values; you'd be as if you cloned the
          project when it was at v1.0.

	* When you are done with the above, grab this packfile and
          update your remote-tracking refs and tags to these values;
          you'd be as if you cloned the project when it was at v2.0.

	* When you are done with the above, grab this packfile and
          update your remote-tracking refs and tags to these values;
          you'd be as if you cloned the project when it was at v3.0.

	...

	* When you are done with the above, here is the remaining
          packdata to bring you fully up to date with your original
          "want"s.

and before fully reading the proposal, I anticipated that it was
what you were going to describe.  The major difference is "up to the
packdata given to you so far, you'd be as if you fetched these" ref
information, which would allow you to be interrupted and then simply
resume, without having to remember the set of packfile-uris yet to
be processed across a fetch/clone failure.  If you sucessfully fetch
packfile for ..v1.0, you can update the remote-tracking refs to
match as if you fetched back when that was the most recent state of
the project, and then if you failed while transferring packfile for
v1.0..v2.0, the resuming would just reissue "git fetch" internally.

I think what you proposed, i.e. without the "with the data up to
this packfile, you have history to these objects", would also work,
even though it requires us to remember more of what we learned
during the initial attempt throughout retrying failed transfers.

> +Future work
> +-----------
> +
> +The protocol design allows some evolution of the server and client without any
> +need for protocol changes, so only a small-scoped design is included here to
> +form the MVP. For example, the following can be done:
> +
> + * On the server, a long-running process that takes in entire requests and
> +   outputs a list of URIs and the corresponding inclusion and exclusion sets of
> +   objects. This allows, e.g., signed URIs to be used and packfiles for common
> +   requests to be cached.
> + * On the client, resumption of clone. If a clone is interrupted, information
> +   could be recorded in the repository's config and a "clone-resume" command
> +   can resume the clone in progress. (Resumption of subsequent fetches is more
> +   difficult because that must deal with the user wanting to use the repository
> +   even after the fetch was interrupted.)
> +
> +There are some possible features that will require a change in protocol:
> +
> + * Additional HTTP headers (e.g. authentication)
> + * Byte range support
> + * Different file formats referenced by URIs (e.g. raw object)
> +
> diff --git a/Documentation/technical/protocol-v2.txt b/Documentation/technical/protocol-v2.txt
> index 345c00e08c..2cb1c41742 100644
> --- a/Documentation/technical/protocol-v2.txt
> +++ b/Documentation/technical/protocol-v2.txt
> @@ -313,7 +313,8 @@ header. Most sections are sent only when the packfile is sent.
>  
>      output = acknowledgements flush-pkt |
>  	     [acknowledgments delim-pkt] [shallow-info delim-pkt]
> -	     [wanted-refs delim-pkt] packfile flush-pkt
> +	     [wanted-refs delim-pkt] [packfile-uris delim-pkt]
> +	     packfile flush-pkt
>  
>      acknowledgments = PKT-LINE("acknowledgments" LF)
>  		      (nak | *ack)
> @@ -331,6 +332,9 @@ header. Most sections are sent only when the packfile is sent.
>  		  *PKT-LINE(wanted-ref LF)
>      wanted-ref = obj-id SP refname
>  
> +    packfile-uris = PKT-LINE("packfile-uris" LF) *packfile-uri
> +    packfile-uri = PKT-LINE("uri" SP *%x20-ff LF)
> +
>      packfile = PKT-LINE("packfile" LF)
>  	       *PKT-LINE(%x01-03 *%x00-ff)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 2/5] Documentation: add Packfile URIs design doc
  2018-12-05  5:02   ` Junio C Hamano
@ 2018-12-05  5:55     ` Junio C Hamano
  2018-12-06 23:16     ` Jonathan Tan
  1 sibling, 0 replies; 30+ messages in thread
From: Junio C Hamano @ 2018-12-05  5:55 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

Junio C Hamano <gitster@pobox.com> writes:

> So, this is OK, but
>
>> +Clients then should understand that the returned packfile could be incomplete,
>> +and that it needs to download all the given URIs before the fetch or clone is
>> +complete. Each URI should point to a Git packfile (which may be a thin pack and
>> +which may contain offset deltas).
>
> weaken or remove the (parenthetical comment) in the last sentence,
> and replace the beginning of the section with something like
>
> 	If the client replies with 'packfile-uris', when the server
> 	sends the packfile, it MAY send a `packfile-uris` section...
>
> You may steal what I wrote in the above response to help the
> server-side folks to decide how to actually implement the "it MAY
> send a packfile-uris" part in the document.

By the way, I do agree with the practical consideration the design
you described makes.  For a pregenerated pack that brings you from
v1.0 to v2.0, "thin" would roughly save the transfer of one full
checkout (compressed, of course), and "ofs" would also save several
bytes per object.  Compared to a single pack that delivers everything
the fetcher wants, concatenation of packs without "thin" to transfer
the same set of objects would cost quite a lot more.

And I do not think we should care too much about fetchers that lack
either thin or ofs, so I'd imagine that any client that ask for
packfile-uris would also advertise thin and ofs as well, so in
practice, a request with packfile-uris that lack thin or ofs would
be pretty rare and requiring all three and requiring only one would
not make much practical difference.  It's just that I think singling
out these two capabilities as hard requirements at the protocol
level is wrong.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 3/5] upload-pack: refactor reading of pack-objects out
  2018-12-03 23:37 ` [WIP RFC 3/5] upload-pack: refactor reading of pack-objects out Jonathan Tan
  2018-12-04  0:30   ` Stefan Beller
@ 2018-12-05  6:30   ` Junio C Hamano
  1 sibling, 0 replies; 30+ messages in thread
From: Junio C Hamano @ 2018-12-05  6:30 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

Jonathan Tan <jonathantanmy@google.com> writes:

> +struct output_state {
> +	char buffer[8193];
> +	int used;
> +};
> +
> +static int read_pack_objects_stdout(int outfd, struct output_state *os)
> +{

The naming feels quite odd.  We are downstream of pack-objects and
reading its standard output stream, so "read stdout-of-pack-objects"
is not wrong per-se, but it may be just me.  Stepping back and what
the function is really about often helps us to understand what we
are doing.

I think the function you extracted from the existing logic is
responsible for relaying the data read from pack-objects to the
fetch-pack client on the other side of the wire.  So from that point
of view, singling out "read" as if it is a primary thing the
function does is already suboptimal.  Perhaps

	static int relay_pack_data(int in, struct output_state *os)

because the fd is what we "read" from (hence, 'in', not 'out'), and
relaying is what we do and reading is only one half of doing so?

> +	/* Data ready; we keep the last byte to ourselves
> +	 * in case we detect broken rev-list, so that we
> +	 * can leave the stream corrupted.  This is
> +	 * unfortunate -- unpack-objects would happily
> +	 * accept a valid packdata with trailing garbage,
> +	 * so appending garbage after we pass all the
> +	 * pack data is not good enough to signal
> +	 * breakage to downstream.
> +	 */

Yeah, I recall writing this funky and unfortunate logic in 2006.
Perhaps we want to update the comment style to make it a bit more
modern?

The "Data ready;" part of the comment applies more to what the
caller of this logic does (i.e. poll returned and revents indicates
we can read, and that is why we are calling into this logic).  The
remainder is the comment that is relevat to this logic.

> +	ssize_t readsz;
> +
> +	readsz = xread(outfd, os->buffer + os->used,
> +		       sizeof(os->buffer) - os->used);

So we read what we could read in the remaining buffer.

I notice that you alloated 8k+1 bytes; would this code end up
reading that 8k+1 bytes in full under the right condition?  I am
mainly wondering where the need for +1 comes from.

> +	if (readsz < 0) {
> +		return readsz;
> +	}

OK, report an error to the caller by returning negative.  I am
guessing that you'd have more code in this block that currently have
only one statement in the future steps, perhaps.

> +	os->used += readsz;
> +
> +	if (os->used > 1) {
> +		send_client_data(1, os->buffer, os->used - 1);
> +		os->buffer[0] = os->buffer[os->used - 1];

OK, this corresponds to the "*cp++ = buffered"  in the original just
before xread().

> +		os->used = 1;
> +	} else {
> +		send_client_data(1, os->buffer, os->used);
> +		os->used = 0;

I am not sure if the code is correct when os->used happens to be 1
(shouldn't we hold the byte, skip the call to send_client_data(),
and go back to poll() to expect more data?), but this is a faithful
code movement and rewrite of the original.

I've read the caller (below, snipped) and the conversion looked
sensible there as well.  The final flushing of the byte(s) held back
in *os also looks good.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 4/5] upload-pack: refactor writing of "packfile" line
  2018-12-03 23:37 ` [WIP RFC 4/5] upload-pack: refactor writing of "packfile" line Jonathan Tan
@ 2018-12-06  6:35   ` Junio C Hamano
  2018-12-06 23:25     ` Jonathan Tan
  0 siblings, 1 reply; 30+ messages in thread
From: Junio C Hamano @ 2018-12-06  6:35 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

Jonathan Tan <jonathantanmy@google.com> writes:

> @@ -126,6 +129,12 @@ static int read_pack_objects_stdout(int outfd, struct output_state *os)
>  	}
>  	os->used += readsz;
>  
> +	if (!os->packfile_started) {
> +		os->packfile_started = 1;
> +		if (use_protocol_v2)
> +			packet_write_fmt(1, "packfile\n");

If we fix this function so that the only byte in the buffer is held
back without emitted when os->used == 1 as I alluded to, this may
have to be done a bit later, as with such a change, it is no longer
guaranteed that send_client_data() will be called after this point.

> +	}
> +
>  	if (os->used > 1) {
>  		send_client_data(1, os->buffer, os->used - 1);
>  		os->buffer[0] = os->buffer[os->used - 1];

> +static void flush_progress_buf(struct strbuf *progress_buf)
> +{
> +	if (!progress_buf->len)
> +		return;
> +	send_client_data(2, progress_buf->buf, progress_buf->len);
> +	strbuf_reset(progress_buf);
> +}

Interesting.

>  static void create_pack_file(const struct object_array *have_obj,
> -			     const struct object_array *want_obj)
> +			     const struct object_array *want_obj,
> +			     int use_protocol_v2)
>  {
>  	struct child_process pack_objects = CHILD_PROCESS_INIT;
>  	struct output_state output_state = {0};
> @@ -260,9 +278,13 @@ static void create_pack_file(const struct object_array *have_obj,
>  			 */
>  			sz = xread(pack_objects.err, progress,
>  				  sizeof(progress));
> -			if (0 < sz)
> -				send_client_data(2, progress, sz);
> -			else if (sz == 0) {
> +			if (0 < sz) {
> +				if (output_state.packfile_started)
> +					send_client_data(2, progress, sz);
> +				else
> +					strbuf_add(&output_state.progress_buf,
> +						   progress, sz);

Isn't progress output that goes to the channel #2 pretty much
independent from the payload stream that carries the pkt-line 
command like "packfile" plus the raw pack stream?  It somehow
feels like an oxymoron to _buffer_ progress indicator, as it
defeats the whole point of progress report to buffer it.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 1/5] Documentation: order protocol v2 sections
  2018-12-05  4:10   ` Junio C Hamano
@ 2018-12-06 22:54     ` Jonathan Tan
  2018-12-09  0:15       ` Junio C Hamano
  0 siblings, 1 reply; 30+ messages in thread
From: Jonathan Tan @ 2018-12-06 22:54 UTC (permalink / raw)
  To: gitster; +Cc: jonathantanmy, git

> > The git command line expects Git servers to follow a specific order of
> 
> "Command line"?  It sounds like you are talking about the order of
> command line arguments and options, but apparently that is not what
> you are doing.  Is it "The git over-the-wire protocol"?

I meant to say the current Git implementation, as opposed to what is
written in the specification. I'll replace it with "The current C Git
implementation".

> Earlier, we said that shallow-info is not given when packfile is not
> there.  That is captured in the updated EBNF above.  We don't have a
> corresponding removal of a bullet point for wanted-refs section below
> but probably that is because the original did not have corresponding
> bullet point to begin with.

That's because the corresponding bullet point had other information.
Quoted in full below:

> 	* This section is only included if the client has requested a
> 	  ref using a 'want-ref' line and if a packfile section is also
> 	  included in the response.

I could reword it to "If a packfile section is included in the response,
this section is only included if the client has requested a ref using a
'want-ref' line", but I don't think that is significantly clearer.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 2/5] Documentation: add Packfile URIs design doc
  2018-12-05  5:02   ` Junio C Hamano
  2018-12-05  5:55     ` Junio C Hamano
@ 2018-12-06 23:16     ` Jonathan Tan
  1 sibling, 0 replies; 30+ messages in thread
From: Jonathan Tan @ 2018-12-06 23:16 UTC (permalink / raw)
  To: gitster; +Cc: jonathantanmy, git

> > +This feature allows servers to serve part of their packfile response as URIs.
> > +This allows server designs that improve scalability in bandwidth and CPU usage
> > +(for example, by serving some data through a CDN), and (in the future) provides
> > +some measure of resumability to clients.
> 
> Without reading the remainder, this makes readers anticipate a few
> good things ;-)
> 
>  - "part of", so pre-generated constant material can be given from
>    CDN and then followed-up by "filling the gaps" small packfile,
>    perhaps?

Yes :-)

>  - The "part of" transmission may not bring the repository up to
>    date wrt to the "want" objects; would this feature involve "you
>    asked history up to these commits, but with this pack-uri, you'll
>    be getting history up to these (somewhat stale) commits"?

It could be, but not necessarily. In my current WIP implementation, for
example, pack URIs don't give you any commits at all (and thus, no
history) - only blobs. Quite a few people first think of the "stale
clone then top-up" case, though - I wonder if it would be a good idea to
give the blob example in this paragraph in order to put people in the
right frame of mind.

> > +If the client replies with the following arguments:
> > +
> > + * packfile-uris
> > + * thin-pack
> > + * ofs-delta
> 
> "with the following" meaning "with all of the following", or "with
> any of the following"?  Is there a reason why the server side must
> require that the client understands and is willing to accept a
> thin-pack when wanting to use packfile-uris?  The same question for
> the ofs-delta.

"All of the following", but from your later comments, we probably don't
need this section anyway.

> My recommendation is to drop the mention of "thin" and "ofs" from
> the above list, and also from the following paragraph.  The "it MAY
> send" will serve as a practical escape clause to allow a server/CDN
> implementation that *ALWAYS* prepares pregenerated material that can
> only be digested by clients that supports thin and ofs.  Such a server
> can send packfile-URIs only when all of the three are given by the
> client and be compliant.  And such an update to the proposed document
> would allow a more diskful server to prepare both thin and non-thin
> pregenerated packs and choose which one to give to the client depending
> on the capability.

That is true - we can just let the server decide. I'll update the patch
accordingly, and state that the URIs should point to packs with features
like thin-pack and ofs-delta only if the client has declared that it
supports them.

> > +Clients then should understand that the returned packfile could be incomplete,
> > +and that it needs to download all the given URIs before the fetch or clone is
> > +complete. Each URI should point to a Git packfile (which may be a thin pack and
> > +which may contain offset deltas).
> 
> weaken or remove the (parenthetical comment) in the last sentence,
> and replace the beginning of the section with something like
> 
> 	If the client replies with 'packfile-uris', when the server
> 	sends the packfile, it MAY send a `packfile-uris` section...
> 
> You may steal what I wrote in the above response to help the
> server-side folks to decide how to actually implement the "it MAY
> send a packfile-uris" part in the document.

OK, will do.

> OK, this comes back to what I alluded to at the beginning.  We could
> respond to a full-clone request by feeding a series of packfile-uris
> and some ref information, perhaps like this:
> 
> 	* Grab this packfile and update your remote-tracking refs
>           and tags to these values; you'd be as if you cloned the
>           project when it was at v1.0.
> 
> 	* When you are done with the above, grab this packfile and
>           update your remote-tracking refs and tags to these values;
>           you'd be as if you cloned the project when it was at v2.0.
> 
> 	* When you are done with the above, grab this packfile and
>           update your remote-tracking refs and tags to these values;
>           you'd be as if you cloned the project when it was at v3.0.
> 
> 	...
> 
> 	* When you are done with the above, here is the remaining
>           packdata to bring you fully up to date with your original
>           "want"s.
> 
> and before fully reading the proposal, I anticipated that it was
> what you were going to describe.  The major difference is "up to the
> packdata given to you so far, you'd be as if you fetched these" ref
> information, which would allow you to be interrupted and then simply
> resume, without having to remember the set of packfile-uris yet to
> be processed across a fetch/clone failure.  If you sucessfully fetch
> packfile for ..v1.0, you can update the remote-tracking refs to
> match as if you fetched back when that was the most recent state of
> the project, and then if you failed while transferring packfile for
> v1.0..v2.0, the resuming would just reissue "git fetch" internally.

The "up to" would work if we had the stale clone + periodic "upgrades"
arrangement you describe, but not when, for example, we just want to
separate large blobs out. If we were to insist on attaching ref
information to each packfile URI (or turn the packfiles into bundles),
it is true that we can have resumable fetch, although I haven't fully
thought out the implications of letting the user modify the repository
while a fetch is in progress. (What happens if the user wipes out their
object store in between fetching one packfile and the next, for
example?) That is why I only talked about resumable clone, not resumable
fetch.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 4/5] upload-pack: refactor writing of "packfile" line
  2018-12-06  6:35   ` Junio C Hamano
@ 2018-12-06 23:25     ` Jonathan Tan
  2018-12-07  0:22       ` Junio C Hamano
  0 siblings, 1 reply; 30+ messages in thread
From: Jonathan Tan @ 2018-12-06 23:25 UTC (permalink / raw)
  To: gitster; +Cc: jonathantanmy, git

> Jonathan Tan <jonathantanmy@google.com> writes:
> 
> > @@ -126,6 +129,12 @@ static int read_pack_objects_stdout(int outfd, struct output_state *os)
> >  	}
> >  	os->used += readsz;
> >  
> > +	if (!os->packfile_started) {
> > +		os->packfile_started = 1;
> > +		if (use_protocol_v2)
> > +			packet_write_fmt(1, "packfile\n");
> 
> If we fix this function so that the only byte in the buffer is held
> back without emitted when os->used == 1 as I alluded to, this may
> have to be done a bit later, as with such a change, it is no longer
> guaranteed that send_client_data() will be called after this point.

I'm not sure what you mean about there being no guarantee that
send_client_data() is not called - in create_pack_file(), there is an
"if (output_state.used > 0)" line (previously "if (0 <= buffered)") that
outputs anything remaining.

> Isn't progress output that goes to the channel #2 pretty much
> independent from the payload stream that carries the pkt-line 
> command like "packfile" plus the raw pack stream?  It somehow
> feels like an oxymoron to _buffer_ progress indicator, as it
> defeats the whole point of progress report to buffer it.

Yes, it is - I don't fully like this part of the design. I alluded to a
similar issue (keepalive) in the toplevel email [1] and that it might be
better if the server can send sideband throughout the whole response -
perhaps that should be investigated first. If we had sideband throughout
the whole response, we wouldn't need to buffer the progress indicator.

[1] https://public-inbox.org/git/cover.1543879256.git.jonathantanmy@google.com/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 4/5] upload-pack: refactor writing of "packfile" line
  2018-12-06 23:25     ` Jonathan Tan
@ 2018-12-07  0:22       ` Junio C Hamano
  0 siblings, 0 replies; 30+ messages in thread
From: Junio C Hamano @ 2018-12-07  0:22 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

Jonathan Tan <jonathantanmy@google.com> writes:

>> Jonathan Tan <jonathantanmy@google.com> writes:
>> 
>> > @@ -126,6 +129,12 @@ static int read_pack_objects_stdout(int outfd, struct output_state *os)
>> >  	}
>> >  	os->used += readsz;
>> >  
>> > +	if (!os->packfile_started) {
>> > +		os->packfile_started = 1;
>> > +		if (use_protocol_v2)
>> > +			packet_write_fmt(1, "packfile\n");
>> 
>> If we fix this function so that the only byte in the buffer is held
>> back without emitted when os->used == 1 as I alluded to, this may
>> have to be done a bit later, as with such a change, it is no longer
>> guaranteed that send_client_data() will be called after this point.
>
> I'm not sure what you mean about there being no guarantee that
> send_client_data() is not called - in create_pack_file(), there is an
> "if (output_state.used > 0)" line (previously "if (0 <= buffered)") that
> outputs anything remaining.

I was referring to this part of the review on the previous step,
which you may not yet have read.

    OK, this corresponds to the "*cp++ = buffered"  in the original just
    before xread().

    > +		os->used = 1;
    > +	} else {
    > +		send_client_data(1, os->buffer, os->used);
    > +		os->used = 0;

    I am not sure if the code is correct when os->used happens to be 1
    (shouldn't we hold the byte, skip the call to send_client_data(),
    and go back to poll() to expect more data?), but this is a faithful
    code movement and rewrite of the original.

The point of this logic is to make sure we always hold back some
bytes and do not feed *all* the bytes to the other side by calling
"send-client-data" until we made sure the upstream of what we are
relaying (pack-objects?) successfully exited, but it looks to me
that the "else" clause above ends up flushing everything when
os->used is 1, which goes against the whole purpose of the code.

And the "fix" I was alluding to was to update that "else" clause to
make it a no-op that keeps os->used non-zero, which would not call
send-client-data.

When that fix happens, the part that early in the function this
patch added "now we know we will call send-client-data, so let's say
'here comes packdata' unless we have already said that" is making
the decision too early.  Depending on the value of os->used when we
enter the code and the number of bytes xread() reads from the
upstream, we might not call send-client-data yet (namely, when we
have no buffered data and we happened to get only one byte).

> ... it might be
> better if the server can send sideband throughout the whole response -
> perhaps that should be investigated first.

Yup.  It just looked quite crazy, and it is even more crazy to
buffer keepalives ;-)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 1/5] Documentation: order protocol v2 sections
  2018-12-06 22:54     ` Jonathan Tan
@ 2018-12-09  0:15       ` Junio C Hamano
  0 siblings, 0 replies; 30+ messages in thread
From: Junio C Hamano @ 2018-12-09  0:15 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

Jonathan Tan <jonathantanmy@google.com> writes:

>> > The git command line expects Git servers to follow a specific order of
>> 
>> "Command line"?  It sounds like you are talking about the order of
>> command line arguments and options, but apparently that is not what
>> you are doing.  Is it "The git over-the-wire protocol"?
>
> I meant to say the current Git implementation, as opposed to what is
> written in the specification. I'll replace it with "The current C Git
> implementation".

Yeah, that would avoid confusing future readers; sounds good.

>> Earlier, we said that shallow-info is not given when packfile is not
>> there.  That is captured in the updated EBNF above.  We don't have a
>> corresponding removal of a bullet point for wanted-refs section below
>> but probably that is because the original did not have corresponding
>> bullet point to begin with.
>
> That's because the corresponding bullet point had other information.
> Quoted in full below:
>
>> 	* This section is only included if the client has requested a
>> 	  ref using a 'want-ref' line and if a packfile section is also
>> 	  included in the response.
>
> I could reword it to "If a packfile section is included in the response,
> this section is only included if the client has requested a ref using a
> 'want-ref' line", but I don't think that is significantly clearer.

I don't either.  I didn't mean to suggest to change anything in this
part.  I was just giving an observation---two parallel things do not
get updates in tandem, and that is because they were not described
the same way to begin with, which was a good enough explanation.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 2/5] Documentation: add Packfile URIs design doc
  2018-12-04 19:29     ` Jonathan Tan
@ 2019-02-19 13:22       ` Christian Couder
  2019-02-19 20:10         ` Jonathan Tan
  0 siblings, 1 reply; 30+ messages in thread
From: Christian Couder @ 2019-02-19 13:22 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: brian m. carlson, git

On Tue, Dec 4, 2018 at 8:31 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > Some thoughts here:
> >
> > First, I'd like to see a section (and a bit in the implementation)
> > requiring HTTPS if the original protocol is secure (SSH or HTTPS).
> > Allowing the server to downgrade to HTTP, even by accident, would be a
> > security problem.
> >
> > Second, this feature likely should be opt-in for SSH. One issue I've
> > seen repeatedly is that people don't want to use HTTPS to fetch things
> > when they're using SSH for Git. Many people in corporate environments
> > have proxies that break HTTP for non-browser use cases[0], and using SSH
> > is the only way that they can make a functional Git connection.
>
> Good points about SSH support and the client needing to control which
> protocols the server will send URIs for. I'll include a line in the
> client request in which the client can specify which protocols it is OK
> with.

What if a client is ok to fetch from some servers but not others (for
example github.com and gitlab.com but nothing else)?

Or what if a client is ok to fetch using SSH from some servers and
HTTPS from other servers but nothing else?

I also wonder in general how this would interact with promisor/partial
clone remotes.

When we discussed promisor/partial clone remotes in the thread
following this email:

https://public-inbox.org/git/20181016174304.GA221682@aiede.svl.corp.google.com/

it looked like you were ok with having many promisor remotes, which I
think could fill the same use cases especially related to large
objects.

As clients would configure promisor remotes explicitly, there would be
no issues about which protocol and servers are allowed or not.

If the issue is that you want the server to decide which promisor
remotes would be used without the client having to do anything, maybe
that could be something added on top of the possibility to have many
promisor remotes.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 2/5] Documentation: add Packfile URIs design doc
  2018-12-04  1:54   ` brian m. carlson
  2018-12-04 19:29     ` Jonathan Tan
@ 2019-02-19 13:44     ` Ævar Arnfjörð Bjarmason
  2019-02-21  1:09       ` brian m. carlson
  1 sibling, 1 reply; 30+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-02-19 13:44 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Jonathan Tan, git


On Tue, Dec 04 2018, brian m. carlson wrote:

> On Mon, Dec 03, 2018 at 03:37:35PM -0800, Jonathan Tan wrote:
>> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
>> ---
>>  Documentation/technical/packfile-uri.txt | 83 ++++++++++++++++++++++++
>>  Documentation/technical/protocol-v2.txt  |  6 +-
>>  2 files changed, 88 insertions(+), 1 deletion(-)
>>  create mode 100644 Documentation/technical/packfile-uri.txt
>>
>> diff --git a/Documentation/technical/packfile-uri.txt b/Documentation/technical/packfile-uri.txt
>> new file mode 100644
>> index 0000000000..6535801486
>> --- /dev/null
>> +++ b/Documentation/technical/packfile-uri.txt
>> @@ -0,0 +1,83 @@
>> +Packfile URIs
>> +=============
>> +
>> +This feature allows servers to serve part of their packfile response as URIs.
>> +This allows server designs that improve scalability in bandwidth and CPU usage
>> +(for example, by serving some data through a CDN), and (in the future) provides
>> +some measure of resumability to clients.
>> +
>> +This feature is available only in protocol version 2.
>> +
>> +Protocol
>> +--------
>> +
>> +The server advertises `packfile-uris`.
>> +
>> +If the client replies with the following arguments:
>> +
>> + * packfile-uris
>> + * thin-pack
>> + * ofs-delta
>> +
>> +when the server sends the packfile, it MAY send a `packfile-uris` section
>> +directly before the `packfile` section (right after `wanted-refs` if it is
>> +sent) containing HTTP(S) URIs. See protocol-v2.txt for the documentation of
>> +this section.
>> +
>> +Clients then should understand that the returned packfile could be incomplete,
>> +and that it needs to download all the given URIs before the fetch or clone is
>> +complete. Each URI should point to a Git packfile (which may be a thin pack and
>> +which may contain offset deltas).
>
>
> First, I'd like to see a section (and a bit in the implementation)
> requiring HTTPS if the original protocol is secure (SSH or HTTPS).
> Allowing the server to downgrade to HTTP, even by accident, would be a
> security problem.

Maybe I've misunderstood the design (I'm writing some other follow-up
E-Mails in this thread which might clarify things for me), but I don't
see why.

We get the ref advertisement from the server. We don't need to trust the
CDN server or the transport layer. We just download whatever we get from
there, validate the packfile with SHA-1 (and in the future SHA-256). It
doesn't matter if the CDN transport is insecure.

You can do this offline with git today, you don't need to trust me to
trust that my copy of git.git I give you on a sketchy USB stick is
genuine. Just unpack it, then compare the SHA-1s you get with:

    git ls-remote https://github.com/git/git.git

So this is a case similar to Debian's where they distribute packages
over http, but manifests over https: https://whydoesaptnotusehttps.com

> Second, this feature likely should be opt-in for SSH. One issue I've
> seen repeatedly is that people don't want to use HTTPS to fetch things
> when they're using SSH for Git. Many people in corporate environments
> have proxies that break HTTP for non-browser use cases[0], and using SSH
> is the only way that they can make a functional Git connection.

Yeah, there should definitely be accommodations for such clients, per my
reading clients can always ignore the CDN and proceed with a normal
negotiation. Isn't that enough, or is something extra needed?

> Third, I think the server needs to be required to both support Range
> headers and never change the content of a URI, so that we can have
> resumable clone implicit in this design. There are some places in the
> world where connections are poor and fetching even the initial packfile
> at once might be a problem. (I've seen such questions on Stack
> Overflow, for example.)

I think this should be a MAY not a MUST in RFC 2119 terms. There's still
many users who might want to offload things to a very dumb CDN, such as
Debian where they don't control their own mirrors, but might want to
offload a 1GB packfile download to some random university's Debian
mirror.

Such a download (over http) will work most of the time. If it's not
resumable it still sucks less than no CDN at all, and client can always
fall back if the CDN breaks, which they should be doing anyway in case
of other sorts of issues.

> Having said that, I think overall this is a good idea and I'm glad to
> see a proposal for it.
>
> [0] For example, a naughty-word filter may corrupt or block certain byte
> sequences that occur incidentally in the pack stream.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 2/5] Documentation: add Packfile URIs design doc
  2018-12-03 23:37 ` [WIP RFC 2/5] Documentation: add Packfile URIs design doc Jonathan Tan
                     ` (2 preceding siblings ...)
  2018-12-05  5:02   ` Junio C Hamano
@ 2019-02-19 14:28   ` Ævar Arnfjörð Bjarmason
  2019-02-19 22:06     ` Jonathan Tan
  3 siblings, 1 reply; 30+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-02-19 14:28 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git

On Tue, Dec 04 2018, Jonathan Tan wrote:

I meant to follow-up after Git Merge, but didn't remember until this
thread was bumped.

But some things I'd like to clarify / am concerned about...

> +when the server sends the packfile, it MAY send a `packfile-uris` section
> +directly before the `packfile` section (right after `wanted-refs` if it is
> +sent) containing HTTP(S) URIs. See protocol-v2.txt for the documentation of
> +this section.
> +
> +Clients then should understand that the returned packfile could be incomplete,
> +and that it needs to download all the given URIs before the fetch or clone is
> +complete. Each URI should point to a Git packfile (which may be a thin pack and
> +which may contain offset deltas).
> [...]
> +This is the implementation: a feature, marked experimental, that allows the
> +server to be configured by one or more `uploadpack.blobPackfileUri=<sha1>
> +<uri>` entries. Whenever the list of objects to be sent is assembled, a blob
> +with the given sha1 can be replaced by the given URI. This allows, for example,
> +servers to delegate serving of large blobs to CDNs.

Okey, so the server advertisement is not just "<urls>" but <oid><url>
pairs. More on this later...

> +While fetching, the client needs to remember the list of URIs and cannot
> +declare that the fetch is complete until all URIs have been downloaded as
> +packfiles.

And this. I don't quite understand this well enough, but maybe it helps
if I talk about what I'd expect out of CDN offloading. It comes down to
three things:

 * The server should be able to point to some "seed" packfiles *without*
   necessarily knowing what OIDs are in it, or have to tell the client.

 * The client should be able to just blindly get this data ("I guess
   this is where most of it is"), unpack it, see what OIDs it has, and
   *then* without initiating a new connection continue a want/have
   dialog.

   This effectively "bootstraps" a "clone" mid way into an arbitrary
   "fetch".

 * There should be no requirement that a client successfully downloads
   the advertised CDNs, for fault handling (also discussed in
   https://public-inbox.org/git/87lg2b6gg0.fsf@evledraar.gmail.com/)

More concretely, I'd like to have a setup where a server can just dumbly
point to some URL that probably has most of the data, without having any
idea what OIDs are in it. So that e.g. some machine entirely
disconnected from the server (and with just a regular clone) can
continually generating an up-to-date-enough packfile.

I don't see how this is compatible with the server needing to send a
bunch of "<oid> <url>" lines, or why a client "cannot declare that the
fetch is complete until all URIs have been downloaded as
packfiles". Can't it fall back on the normal dialog?

Other thoughts:

 * If there isn't such a close coordination between git server & CDN, is
   there a case for having pack *.idx files on the CDN, so clients can
   inspect them to see if they'd like to download the full referenced
   pack?

 * Without the server needing to know enough about the packs to
   advertise "<oid> <url>" is there a way to e.g. advertise 4x packs to
   clients:

       big.pack, last-month.pack, last-week.pack, last-day.pack

   Or some other optimistic negotiation where clients, even ones just
   doing regular fetches, can seek to get more up-to-date with one of
   the more recent packs before doing the first fetch in 3 days?

   In the past I'd toyed with creating a similar "not quite CDN" setup
   using git-bundle.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 2/5] Documentation: add Packfile URIs design doc
  2019-02-19 13:22       ` Christian Couder
@ 2019-02-19 20:10         ` Jonathan Tan
  2019-02-22 11:35           ` Christian Couder
  0 siblings, 1 reply; 30+ messages in thread
From: Jonathan Tan @ 2019-02-19 20:10 UTC (permalink / raw)
  To: christian.couder; +Cc: jonathantanmy, sandals, git

> > Good points about SSH support and the client needing to control which
> > protocols the server will send URIs for. I'll include a line in the
> > client request in which the client can specify which protocols it is OK
> > with.
> 
> What if a client is ok to fetch from some servers but not others (for
> example github.com and gitlab.com but nothing else)?
> 
> Or what if a client is ok to fetch using SSH from some servers and
> HTTPS from other servers but nothing else?

The objects received from the various CDNs are still rehashed by the
client (so they are identified with the correct name), and if the client
is fetching from a server, presumably it can trust the URLs it receives
(just like it trusts ref names, and so on). Do you know of a specific
case in which a client wants to fetch from some servers but not others?
(In any case, if this happens, the client can just disable the CDN
support.)

> I also wonder in general how this would interact with promisor/partial
> clone remotes.
> 
> When we discussed promisor/partial clone remotes in the thread
> following this email:
> 
> https://public-inbox.org/git/20181016174304.GA221682@aiede.svl.corp.google.com/
> 
> it looked like you were ok with having many promisor remotes, which I
> think could fill the same use cases especially related to large
> objects.
>
> As clients would configure promisor remotes explicitly, there would be
> no issues about which protocol and servers are allowed or not.
> 
> If the issue is that you want the server to decide which promisor
> remotes would be used without the client having to do anything, maybe
> that could be something added on top of the possibility to have many
> promisor remotes.

It's true that there is a slight overlap with respect to large objects,
but this protocol can also handle large sets of objects being offloaded
to CDN, not only single ones. (The included implementation only handles
single objects, as a minimum viable product, but it is conceivable that
the server implementation is later expanded to allow offloading of sets
of objects.)

And this protocol is meant to be able to use CDNs to help serve objects,
whether single objects or sets of objects. In the case of promisor
remotes, the thing we fetch from has to be a Git server. (We could use
dumb HTTP from a CDN, but that defeats the purpose in at least one way -
with dumb HTTP, we have to fetch objects individually, but with URL
support, we can fetch objects as sets too.)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 2/5] Documentation: add Packfile URIs design doc
  2019-02-19 14:28   ` Ævar Arnfjörð Bjarmason
@ 2019-02-19 22:06     ` Jonathan Tan
  0 siblings, 0 replies; 30+ messages in thread
From: Jonathan Tan @ 2019-02-19 22:06 UTC (permalink / raw)
  To: avarab; +Cc: jonathantanmy, git

> > +when the server sends the packfile, it MAY send a `packfile-uris` section
> > +directly before the `packfile` section (right after `wanted-refs` if it is
> > +sent) containing HTTP(S) URIs. See protocol-v2.txt for the documentation of
> > +this section.
> > +
> > +Clients then should understand that the returned packfile could be incomplete,
> > +and that it needs to download all the given URIs before the fetch or clone is
> > +complete. Each URI should point to a Git packfile (which may be a thin pack and
> > +which may contain offset deltas).
> > [...]
> > +This is the implementation: a feature, marked experimental, that allows the
> > +server to be configured by one or more `uploadpack.blobPackfileUri=<sha1>
> > +<uri>` entries. Whenever the list of objects to be sent is assembled, a blob
> > +with the given sha1 can be replaced by the given URI. This allows, for example,
> > +servers to delegate serving of large blobs to CDNs.
> 
> Okey, so the server advertisement is not just "<urls>" but <oid><url>
> pairs. More on this later...

Actually, the server advertisement is just "<urls>". (The OID is there
to tell the server which object to omit if it sends the URL.) But I see
that the rest of your comments still stand.

> More concretely, I'd like to have a setup where a server can just dumbly
> point to some URL that probably has most of the data, without having any
> idea what OIDs are in it. So that e.g. some machine entirely
> disconnected from the server (and with just a regular clone) can
> continually generating an up-to-date-enough packfile.

Thanks for the concrete use case. Server ignorance would work in this
case, since the client can concisely communicate to the server what
objects it obtained from the CDN (in this case, through "have" lines),
but it does not seem to work in the general case (e.g. offloading large
blobs; or the CDN serving a pack suitable for a shallow clone -
containing all objects referenced by the last few commits, whether
changed in that commit or not).

In this case, maybe the batch job can also inform the server which
commit the CDN is prepared to serve.

> I don't see how this is compatible with the server needing to send a
> bunch of "<oid> <url>" lines, or why a client "cannot declare that the
> fetch is complete until all URIs have been downloaded as
> packfiles". Can't it fall back on the normal dialog?

As stated above, the server advertisement is just "<url>", but you're
right that the server still needs to know their corresponding OIDs (or
have some knowledge like "this pack contains all objects in between this
commit and that commit").

I was thinking that there is no normal dialog to be had with this
protocol, since (as above) in the general case, the client cannot
concisely communicate what objects it obtained from the CDN.

> Other thoughts:
> 
>  * If there isn't such a close coordination between git server & CDN, is
>    there a case for having pack *.idx files on the CDN, so clients can
>    inspect them to see if they'd like to download the full referenced
>    pack?

I'm not sure if I understand this fully, but off the top of my head, the
.idx file doesn't contain relations between objects, so I don't think
the client has enough information to decide if it wants to download the
corresponding packfile.

>  * Without the server needing to know enough about the packs to
>    advertise "<oid> <url>" is there a way to e.g. advertise 4x packs to
>    clients:
> 
>        big.pack, last-month.pack, last-week.pack, last-day.pack
> 
>    Or some other optimistic negotiation where clients, even ones just
>    doing regular fetches, can seek to get more up-to-date with one of
>    the more recent packs before doing the first fetch in 3 days?
> 
>    In the past I'd toyed with creating a similar "not quite CDN" setup
>    using git-bundle.

I think such optimistic downloading of packs during a regular fetch
would only work in a partial clone where "holes" are tolerated. (That
does bring up the possibility of having a fetch mode in which we
download potentially incomplete packfiles into a partial repo and then
"completing" the repo through a not-yet-implemented process, but I
haven't thought through this.)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 2/5] Documentation: add Packfile URIs design doc
  2019-02-19 13:44     ` Ævar Arnfjörð Bjarmason
@ 2019-02-21  1:09       ` brian m. carlson
  2019-02-22  9:34         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 30+ messages in thread
From: brian m. carlson @ 2019-02-21  1:09 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Jonathan Tan, git

[-- Attachment #1: Type: text/plain, Size: 3331 bytes --]

On Tue, Feb 19, 2019 at 02:44:31PM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
> On Tue, Dec 04 2018, brian m. carlson wrote:
> > First, I'd like to see a section (and a bit in the implementation)
> > requiring HTTPS if the original protocol is secure (SSH or HTTPS).
> > Allowing the server to downgrade to HTTP, even by accident, would be a
> > security problem.
> 
> Maybe I've misunderstood the design (I'm writing some other follow-up
> E-Mails in this thread which might clarify things for me), but I don't
> see why.
> 
> We get the ref advertisement from the server. We don't need to trust the
> CDN server or the transport layer. We just download whatever we get from
> there, validate the packfile with SHA-1 (and in the future SHA-256). It
> doesn't matter if the CDN transport is insecure.
> 
> You can do this offline with git today, you don't need to trust me to
> trust that my copy of git.git I give you on a sketchy USB stick is
> genuine. Just unpack it, then compare the SHA-1s you get with:
> 
>     git ls-remote https://github.com/git/git.git
> 
> So this is a case similar to Debian's where they distribute packages
> over http, but manifests over https: https://whydoesaptnotusehttps.com

This assumes that integrity of the data is the only reason you'd want to
use HTTPS. There's also confidentiality. Perhaps a user is downloading
data that will help them circumvent the Great Firewall of China. A
downgrade to HTTP could result in a long prison sentence.

Furthermore, some ISPs tamper with headers to allow tracking, and some
environments (e.g. schools and libraries) perform opportunistic
filtering on HTTP connections to filter certain content (and a lot of
this filtering is really simplistic).

Moreover, Google is planning on using this and filters in place of Git
LFS for large objects. I expect that if this approach becomes viable, it
may actually grow authentication functionality, or, depending on how the
series uses the existing code, it may already have it. In such a case,
we should not allow authentication to go over a plaintext connection
when the user thinks that the connection they're using is encrypted
(since they used an SSH or HTTPS URL to clone or fetch).

Downgrades from HTTPS to HTTP are generally considered CVE-worthy. We
need to make sure that we refuse to allow a downgrade on the client
side, even if the server ignores our request for a secure protocol.

> > Second, this feature likely should be opt-in for SSH. One issue I've
> > seen repeatedly is that people don't want to use HTTPS to fetch things
> > when they're using SSH for Git. Many people in corporate environments
> > have proxies that break HTTP for non-browser use cases[0], and using SSH
> > is the only way that they can make a functional Git connection.
> 
> Yeah, there should definitely be accommodations for such clients, per my
> reading clients can always ignore the CDN and proceed with a normal
> negotiation. Isn't that enough, or is something extra needed?

I think at least a config option and a command line flag are needed to
be able to turn CDN usage off. There needs to be an easy way for people
in broken environments to circumvent the breakage.
-- 
brian m. carlson: Houston, Texas, US
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 2/5] Documentation: add Packfile URIs design doc
  2019-02-21  1:09       ` brian m. carlson
@ 2019-02-22  9:34         ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 30+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2019-02-22  9:34 UTC (permalink / raw)
  To: brian m. carlson; +Cc: Jonathan Tan, git


On Thu, Feb 21 2019, brian m. carlson wrote:

> On Tue, Feb 19, 2019 at 02:44:31PM +0100, Ævar Arnfjörð Bjarmason wrote:
>>
>> On Tue, Dec 04 2018, brian m. carlson wrote:
>> > First, I'd like to see a section (and a bit in the implementation)
>> > requiring HTTPS if the original protocol is secure (SSH or HTTPS).
>> > Allowing the server to downgrade to HTTP, even by accident, would be a
>> > security problem.
>>
>> Maybe I've misunderstood the design (I'm writing some other follow-up
>> E-Mails in this thread which might clarify things for me), but I don't
>> see why.
>>
>> We get the ref advertisement from the server. We don't need to trust the
>> CDN server or the transport layer. We just download whatever we get from
>> there, validate the packfile with SHA-1 (and in the future SHA-256). It
>> doesn't matter if the CDN transport is insecure.
>>
>> You can do this offline with git today, you don't need to trust me to
>> trust that my copy of git.git I give you on a sketchy USB stick is
>> genuine. Just unpack it, then compare the SHA-1s you get with:
>>
>>     git ls-remote https://github.com/git/git.git
>>
>> So this is a case similar to Debian's where they distribute packages
>> over http, but manifests over https: https://whydoesaptnotusehttps.com
>
> This assumes that integrity of the data is the only reason you'd want to
> use HTTPS. There's also confidentiality. Perhaps a user is downloading
> data that will help them circumvent the Great Firewall of China. A
> downgrade to HTTP could result in a long prison sentence.
>
> Furthermore, some ISPs tamper with headers to allow tracking, and some
> environments (e.g. schools and libraries) perform opportunistic
> filtering on HTTP connections to filter certain content (and a lot of
> this filtering is really simplistic).
>
> Moreover, Google is planning on using this and filters in place of Git
> LFS for large objects. I expect that if this approach becomes viable, it
> may actually grow authentication functionality, or, depending on how the
> series uses the existing code, it may already have it. In such a case,
> we should not allow authentication to go over a plaintext connection
> when the user thinks that the connection they're using is encrypted
> (since they used an SSH or HTTPS URL to clone or fetch).
>
> Downgrades from HTTPS to HTTP are generally considered CVE-worthy. We
> need to make sure that we refuse to allow a downgrade on the client
> side, even if the server ignores our request for a secure protocol.

All good points, I definitely agree we shouldn't do downgrading by
default for the reasons you've outlined, and e.g. make this an opt-in.

I'm just mindful that git's used as infrastructure in a lot of unusual
case, e.g. something like what apt's doing (after carefully weighing
http v.s. https for their use-case).

So I think providing some optional escape hatch is still a good idea.

>> > Second, this feature likely should be opt-in for SSH. One issue I've
>> > seen repeatedly is that people don't want to use HTTPS to fetch things
>> > when they're using SSH for Git. Many people in corporate environments
>> > have proxies that break HTTP for non-browser use cases[0], and using SSH
>> > is the only way that they can make a functional Git connection.
>>
>> Yeah, there should definitely be accommodations for such clients, per my
>> reading clients can always ignore the CDN and proceed with a normal
>> negotiation. Isn't that enough, or is something extra needed?
>
> I think at least a config option and a command line flag are needed to
> be able to turn CDN usage off. There needs to be an easy way for people
> in broken environments to circumvent the breakage.

Yeah, but let's try hard to make it Just Work. I.e. if in the middle of
the dialog the CDN connection is broken can we retry then, and if that
fails just continue with negotiation against the server?

As opposed to erroring by default, and the user needing to retry with
some config option...

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [WIP RFC 2/5] Documentation: add Packfile URIs design doc
  2019-02-19 20:10         ` Jonathan Tan
@ 2019-02-22 11:35           ` Christian Couder
  0 siblings, 0 replies; 30+ messages in thread
From: Christian Couder @ 2019-02-22 11:35 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: brian m. carlson, git

On Tue, Feb 19, 2019 at 9:10 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > > Good points about SSH support and the client needing to control which
> > > protocols the server will send URIs for. I'll include a line in the
> > > client request in which the client can specify which protocols it is OK
> > > with.
> >
> > What if a client is ok to fetch from some servers but not others (for
> > example github.com and gitlab.com but nothing else)?
> >
> > Or what if a client is ok to fetch using SSH from some servers and
> > HTTPS from other servers but nothing else?
>
> The objects received from the various CDNs are still rehashed by the
> client (so they are identified with the correct name), and if the client
> is fetching from a server, presumably it can trust the URLs it receives
> (just like it trusts ref names, and so on). Do you know of a specific
> case in which a client wants to fetch from some servers but not others?

For example I think the Great Firewall of China lets people in China
use GitHub.com but not Google.com. So if people start configuring
their repos on GitHub so that they send packs that contain Google.com
CDN URLs (or actually anything that the Firewall blocks), it might
create many problems for users in China if they don't have a way to
opt out of receiving packs with those kind of URLs.

> (In any case, if this happens, the client can just disable the CDN
> support.)

Would this mean that people in China will not be able to use the
feature at all, because too many of their clones could be blocked? Or
that they will have to create forks to mirror any interesting repo and
 reconfigure those forks to work well from China?

> > I also wonder in general how this would interact with promisor/partial
> > clone remotes.
> >
> > When we discussed promisor/partial clone remotes in the thread
> > following this email:
> >
> > https://public-inbox.org/git/20181016174304.GA221682@aiede.svl.corp.google.com/
> >
> > it looked like you were ok with having many promisor remotes, which I
> > think could fill the same use cases especially related to large
> > objects.
> >
> > As clients would configure promisor remotes explicitly, there would be
> > no issues about which protocol and servers are allowed or not.
> >
> > If the issue is that you want the server to decide which promisor
> > remotes would be used without the client having to do anything, maybe
> > that could be something added on top of the possibility to have many
> > promisor remotes.
>
> It's true that there is a slight overlap with respect to large objects,
> but this protocol can also handle large sets of objects being offloaded
> to CDN, not only single ones.

Isn't partial clone also designed to handle large sets of objects?

> (The included implementation only handles
> single objects, as a minimum viable product, but it is conceivable that
> the server implementation is later expanded to allow offloading of sets
> of objects.)
>
> And this protocol is meant to be able to use CDNs to help serve objects,
> whether single objects or sets of objects. In the case of promisor
> remotes, the thing we fetch from has to be a Git server.

When we discussed the plan for many promisor remotes, Jonathan Nieder
(in the email linked above) suggested:

 2. Simplifying the protocol for fetching missing objects so that it
    can be satisfied by a lighter weight object storage system than
    a full Git server.  The ODB helpers introduced in this series are
    meant to speak such a simpler protocol since they are only used
    for one-off requests of a collection of missing objects instead of
    needing to understand refs, Git's negotiation, etc.

and I agreed with that point.

Is there something that you don't like in many promisor remotes?

> (We could use
> dumb HTTP from a CDN, but that defeats the purpose in at least one way -
> with dumb HTTP, we have to fetch objects individually, but with URL
> support, we can fetch objects as sets too.)

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2019-02-22 11:35 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-03 23:37 [WIP RFC 0/5] Design for offloading part of packfile response to CDN Jonathan Tan
2018-12-03 23:37 ` [WIP RFC 1/5] Documentation: order protocol v2 sections Jonathan Tan
2018-12-05  4:10   ` Junio C Hamano
2018-12-06 22:54     ` Jonathan Tan
2018-12-09  0:15       ` Junio C Hamano
2018-12-03 23:37 ` [WIP RFC 2/5] Documentation: add Packfile URIs design doc Jonathan Tan
2018-12-04  0:21   ` Stefan Beller
2018-12-04  1:54   ` brian m. carlson
2018-12-04 19:29     ` Jonathan Tan
2019-02-19 13:22       ` Christian Couder
2019-02-19 20:10         ` Jonathan Tan
2019-02-22 11:35           ` Christian Couder
2019-02-19 13:44     ` Ævar Arnfjörð Bjarmason
2019-02-21  1:09       ` brian m. carlson
2019-02-22  9:34         ` Ævar Arnfjörð Bjarmason
2018-12-05  5:02   ` Junio C Hamano
2018-12-05  5:55     ` Junio C Hamano
2018-12-06 23:16     ` Jonathan Tan
2019-02-19 14:28   ` Ævar Arnfjörð Bjarmason
2019-02-19 22:06     ` Jonathan Tan
2018-12-03 23:37 ` [WIP RFC 3/5] upload-pack: refactor reading of pack-objects out Jonathan Tan
2018-12-04  0:30   ` Stefan Beller
2018-12-05  6:30   ` Junio C Hamano
2018-12-03 23:37 ` [WIP RFC 4/5] upload-pack: refactor writing of "packfile" line Jonathan Tan
2018-12-06  6:35   ` Junio C Hamano
2018-12-06 23:25     ` Jonathan Tan
2018-12-07  0:22       ` Junio C Hamano
2018-12-03 23:37 ` [WIP RFC 5/5] upload-pack: send part of packfile response as uri Jonathan Tan
2018-12-04 20:09   ` Stefan Beller
2018-12-04  0:01 ` [WIP RFC 0/5] Design for offloading part of packfile response to CDN Stefan Beller

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).