git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>,
	Jeff King <peff@peff.net>, Patrick Steinhardt <ps@pks.im>,
	Christian Couder <christian.couder@gmail.com>,
	Albert Cui <albertqcui@gmail.com>,
	Jonathan Tan <jonathantanmy@google.com>,
	Jonathan Nieder <jrnieder@gmail.com>,
	"brian m . carlson" <sandals@crustytoothpaste.net>,
	"Robin H . Johnson" <robbat2@gentoo.org>
Subject: Re: [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation
Date: Sat, 30 Oct 2021 09:21:38 +0200	[thread overview]
Message-ID: <211030.86mtmr3po0.gmgdl@evledraar.gmail.com> (raw)
In-Reply-To: <c9c4e1e7-aaa2-bbad-355b-8439fad93fa7@gmail.com>


On Fri, Oct 29 2021, Derrick Stolee wrote:

> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
>> This implements a new "bundle-uri" protocol v2 extension, which allows
>> servers to advertise *.bundle files which clients can pre-seed their
>> full "clone"'s or incremental "fetch"'s from.
>> 
>> This is both an alternative to, and complimentary to the existing
>> "packfile-uri" mechanism, i.e. servers and/or clients can pick one or
>> both, but would generally pick one over the other.
>> 
>> This "bundle-uri" mechanism has the advantage of being dumber, and
>> offloads more complexity from the server side to the client
>> side.
>
> Generally, I like that using bundles presents an easier way to serve
> static content from an alternative source and then let Git's fetch
> negotiation catch up with the remainder.
>
> However, after inspecting your design and talking to some GitHub
> engineers who know more about CDNs and general internet things than I
> do, I want to propose an alternative design. I think this new design
> is simultaneously more flexible as well as promotes further decoupling
> of the origin Git server and the bundle contents.
>
> Your proposed design extends protocol v2 to let the client request a
> list of bundle URIs from the origin server. However, this still requires
> the origin server to know about this list. [...]

Interesting, more below...

> Further, your implementation focuses on the server side without
> integrating with the client.

Do you mean these 3 patches we're discussing now? Yes, that's the
server-side and protocol specification only, because I figured talking
about just the spec might be helpful.

But as noted in the CL and previously on-list I have a larger set of
patches to implement the client behavior, an old RFC version of that
here (I've since changed some things...):
https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/

I mean, you commented on those too, so I'm not sure if that's what you
meant, but just for context...

> I propose that we flip this around. The "bundle server" should know
> which bundles are available at which URIs, and the client should contact
> the bundle server directly for a "table of contents" that lists these
> URIs, along with metadata related to each URI. The origin Git server
> then would only need to store the list of bundle servers and the URIs
> to their table of contents. The client could then pick from among those
> bundle servers (probably by ping time, or randomly) to start the bundle
> downloads.

I hadn't considered the server not advertising the list, but pointing to
another URI that has the list. I was thinking that the server would be
close enough to whatever's generating the list that updating the list
there wouldn't be a meaningful limitation for anyone.

But you seem to have a use-case for it, I'd be curious to hear why
specifically, but in any case that's easy to support in the client
patches I have.

There's a point at which we get the list of URIs from the server, to
support your case the client would just advertise the one TOC URI.

Then similarly to the "packfile-uri" special-case of handling a *.bundle
instead of a PACK that I noted in [1], the downloader would just spot
"oh this isn't a bundle, but list of URIs, and then fetch those (even
recursively), and eventually get to *.bundle files.

> To summarize, there are two pieces here, that can be implemented at
> different times:
>
> 1. Create a specification for a "bundle server" that doesn't need to
>    speak the Git protocol at all. This could be a REST API specification
>    using well-established standards such as JSON for the table of
>    contents.
>
> 2. Create a way for the origin Git server to advertise known bundle
>    servers to clients so they can automatically benefit from faster
>    downloads without needing to know about bundle servers.
>
> There are a few key benefits to this approach:
>
>  * Further decoupling. The origin Git server doesn't need to know how
>    the bundle server organizes its bundles. This allows maximum flexibility
>    depending on whether the bundles are stored in something like a CDN
>    (where bundles can't be too big) or some kind of blob storage (where
>    they can have arbitrarily large size).
>
>  * The bundle servers could be run completely independently from the
>    origin Git server. Organizations could run their own bundle servers to
>    host data in the same building as their build farms. As long as they
>    can configure the bundle location at clone/fetch time, the origin Git
>    server doesn't need to be involved.
>
> While I didn't go so far as to create a clear standard or implement a
> prototype in the Git codebase, I created a very simple prototype [1] using
> a python script that parses a JSON table of contents and downloads
> bundles into the Git repository. Then, I made a 'clone.sh' script that
> initializes a repository using the bundle fetcher and fetching the
> remainder from the origin Git server. I even computed static bundles for
> the git.git repository based on where 'master' has been over several days
> in the past month, to give an example of incremental bundles. You can
> test the approach all the way to including the fetch to github.com (note
> how the GitHub servers were not modified in any way for this).
>
> [1] https://github.com/derrickstolee/bundles
>
> There are a lot of limitations to the prototype, but it hopefully
> demonstrates the possibility of using something other than the Git protocol
> to solve these problems.

In your proposal the TOC bundle itself doesn't need to speak the git
protocol.

But as as soon as we specify such a thing all of that becomes a part of
the git protocol at large in any meaningful way, i.e. git.git's client
and any other client that wants to implement the full protocol at large
would now need to understand not only pkt-line but also ship a JSON
decoder etc.

I don't see an inherent problem with us wanting to support some nested
encoding format as part of the protocol, but JSON seems like a
particularly bad choice. It's specified as UTF-8 only (or rather, "a
Unicode enoding"), so you can't stick both valid UTF-8 and binary data
into it.

Our refs on the other hand don't conform to that, so having a JSON
format means you can never have something that refers to refnames, which
given that we're talking about bundles, whose own header already has
that information.

> Let me know if you are interested in switching your approach to something
> more like what I propose here. There are many more questions about what
> information could/should be located in the table of contents and how it can
> be extended in the future. I'm interested to explore that space with you.

As noted above, the TOC part of this seems interesting, and I don't see
a reason not to implement that.

But as noted in [1] I really don't see why it would be a good idea to
implement a side-format that's encoding a limited subset of what you'd
find in bundle headers.

Specifically on the meta-information you're proposing:

== requires

In your example you've added a monolithic "requires" relationship
between bundles, saying "This assumes that the bundles can be ordered".

But that's not something you can assume for actual bundle files,
i.e. the prerequisite relationship is per-reftip, it's not the case that
a given bundle requires another bundle, it's the case that tips found in
them may or may not depend on other prerequisites.

If you're creating bundles that contain only one tip there's a 1=1
mapping to what you're proposing with "requires", but otherwise there
isn't.

== timestamp

"This allows us to reexamine the table of contents and only download the
bundles that are newer than that timestamp."

We're usually going to be fetching these over http(s), why duplicate
what you can already get if the server just takes care to create unique
filenames (e.g. as a function of the SHA of their contents), and then
provides appropriate caching headers to a client so that they'll be
cached forever?

I think that gives you everything you'd like out of the "timestamp" and
more, the "more" being that since it's part of a protocol that's already
standard you'd have e.g. intermediate caching proxies understanding this
implicitly, in addition to the git client itself.

So on a network that's say locally unpacking https connections to a
public CDN you could have a local caching proxy for your N local
clients, as opposed to a custom "timestamp" value, which only each local
git client will understand.

== Generally

Sorry, I've got to run, so I haven't addressed all the things you
brought up, but generally while I think that the TOC idea is a good one.

I don't see a reason for why most/all of the other bits shouldn't be
leaning into either the bundle header (and for any TOC shortcut, dump it
as-is, as noted in [1]), or in the case of "timestamp" lean into the
properties of the transport protocol.

And just generally on overall protocol complexity, wouldn't it be OK if
any such TOC is just in pkt-line format?

We could just provide a git plumbing tool to spew that out, and having
some static server job call that once and ever more serve up a a
plain-file doesn't seem like a big restriction, and would mean that any
git client code wouldn't need to deal with another encoding format.

1. https://lore.kernel.org/git/211027.86a6iuxk3x.gmgdl@evledraar.gmail.com/

  reply	other threads:[~2021-10-30  8:03 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-25 21:25 [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation Ævar Arnfjörð Bjarmason
2021-10-25 21:25 ` [PATCH 1/3] leak tests: mark t5701-git-serve.sh as passing SANITIZE=leak Ævar Arnfjörð Bjarmason
2021-10-25 21:25 ` [PATCH 2/3] protocol v2: specify static seeding of clone/fetch via "bundle-uri" Ævar Arnfjörð Bjarmason
2021-10-26 14:00   ` Derrick Stolee
2021-10-26 15:00     ` Ævar Arnfjörð Bjarmason
2021-10-27  1:55       ` Derrick Stolee
2021-10-27 17:49         ` Ævar Arnfjörð Bjarmason
2021-10-27  2:01   ` Derrick Stolee
2021-10-27  8:29     ` Ævar Arnfjörð Bjarmason
2021-10-27 16:31       ` Derrick Stolee
2021-10-27 18:01         ` Ævar Arnfjörð Bjarmason
2021-10-27 19:23           ` Derrick Stolee
2021-10-27 20:22             ` Ævar Arnfjörð Bjarmason
2021-10-29 18:30               ` Derrick Stolee
2021-10-30 14:51           ` Philip Oakley
2021-10-25 21:25 ` [PATCH 3/3] bundle-uri client: add "bundle-uri" parsing + tests Ævar Arnfjörð Bjarmason
2021-10-26 14:05   ` Derrick Stolee
2021-10-29 18:46 ` [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation Derrick Stolee
2021-10-30  7:21   ` Ævar Arnfjörð Bjarmason [this message]
2021-11-01 21:00     ` Derrick Stolee
2021-11-01 23:18       ` Ævar Arnfjörð Bjarmason
2022-03-11 16:24 ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 01/13] protocol v2: add server-side "bundle-uri" skeleton Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 02/13] bundle-uri docs: add design notes Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 03/13] bundle-uri client: add "bundle-uri" parsing + tests Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 04/13] connect.c: refactor sending of agent & object-format Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 05/13] bundle-uri client: add minimal NOOP client Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 06/13] bundle-uri client: add "git ls-remote-bundle-uri" Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 07/13] bundle-uri client: add transfer.injectBundleURI support Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 08/13] bundle-uri client: add boolean transfer.bundleURI setting Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 09/13] fetch-pack: add a deref_without_lazy_fetch_extended() Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 10/13] fetch-pack: move --keep=* option filling to a function Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 11/13] bundle.h: make "fd" version of read_bundle_header() public Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 12/13] bundle-uri client: support for bundle-uri with "clone" Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 13/13] bundle-uri: make the download program configurable Ævar Arnfjörð Bjarmason
2022-03-11 21:28   ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Derrick Stolee
2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 01/36] connect.c: refactor sending of agent & object-format Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 02/36] dir API: add a generalized path_match_flags() function Ævar Arnfjörð Bjarmason
2022-04-21 17:26       ` Derrick Stolee
2022-04-18 17:23     ` [RFC PATCH v2 03/36] fetch-pack: add a deref_without_lazy_fetch_extended() Ævar Arnfjörð Bjarmason
2022-04-21 17:28       ` Derrick Stolee
2022-04-18 17:23     ` [RFC PATCH v2 04/36] fetch-pack: move --keep=* option filling to a function Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 05/36] http: make http_get_file() external Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 06/36] remote: move relative_url() Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 07/36] remote: allow relative_url() to return an absolute url Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 08/36] bundle.h: make "fd" version of read_bundle_header() public Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 09/36] protocol v2: add server-side "bundle-uri" skeleton Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 10/36] bundle-uri client: add "bundle-uri" parsing + tests Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 11/36] bundle-uri client: add minimal NOOP client Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 12/36] bundle-uri client: add "git ls-remote-bundle-uri" Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 13/36] bundle-uri client: add transfer.injectBundleURI support Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 14/36] bundle-uri client: add boolean transfer.bundleURI setting Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 15/36] bundle-uri client: support for bundle-uri with "clone" Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 16/36] bundle-uri: make the download program configurable Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 17/36] remote-curl: add 'get' capability Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 18/36] bundle: implement 'fetch' command for direct bundles Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 19/36] bundle: parse table of contents during 'fetch' Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 20/36] bundle: add --filter option to 'fetch' Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 21/36] bundle: allow relative URLs in table of contents Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 22/36] bundle: make it easy to call 'git bundle fetch' Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 23/36] clone: add --bundle-uri option Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 24/36] clone: --bundle-uri cannot be combined with --depth Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 25/36] bundle: only fetch bundles if timestamp is new Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 26/36] fetch: fetch bundles before fetching original data Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 27/36] protocol-caps: implement cap_features() Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 28/36] serve: understand but do not advertise 'features' capability Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 29/36] serve: advertise 'features' when config exists Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 30/36] connect: implement get_recommended_features() Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 31/36] transport: add connections for 'features' capability Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 32/36] clone: use server-recommended bundle URI Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 33/36] t5601: basic bundle URI test Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 34/36] protocol v2: add server-side "bundle-uri" skeleton (docs) Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 35/36] bundle-uri docs: add design notes Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 36/36] docs: document bundle URI standard Ævar Arnfjörð Bjarmason
2022-04-21 19:54     ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Derrick Stolee
2022-04-22  9:37       ` Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=211030.86mtmr3po0.gmgdl@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=albertqcui@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=jrnieder@gmail.com \
    --cc=peff@peff.net \
    --cc=ps@pks.im \
    --cc=robbat2@gentoo.org \
    --cc=sandals@crustytoothpaste.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).