git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Derrick Stolee <derrickstolee@github.com>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>, git@vger.kernel.org
Cc: Junio C Hamano <gitster@pobox.com>,
	Jonathan Tan <jonathantanmy@google.com>,
	Jonathan Nieder <jrnieder@gmail.com>,
	Albert Cui <albertqcui@gmail.com>,
	"Robin H . Johnson" <robbat2@gentoo.org>,
	Teng Long <dyroneteng@gmail.com>
Subject: Re: [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git
Date: Fri, 11 Mar 2022 16:28:17 -0500	[thread overview]
Message-ID: <d745488d-d70e-0b1d-ce08-ce276a9d4310@github.com> (raw)
In-Reply-To: <RFC-cover-v2-00.13-00000000000-20220311T155841Z-avarab@gmail.com>

On 3/11/2022 11:24 AM, Ævar Arnfjörð Bjarmason wrote:
> Per recent discussion[1] this is my not-quite-feature-complete version
> of the bundle-uri capability.  This was sent to the list in some form
> beforfe in [2] and [3].
> 
> Recently Derrick Stolee has sent an alternate implementation of some
> of the same ideas in [4]. Per [1] we're planning to work together on
> getting a version of this into git that makes everyone happy, sending
> what I've got here is the first step in that.

Thanks! It's good to see your intended end-to-end for comparison
before we start the combined effort. I look forward to the additional
details coming next week, because there are a lot of optimizations in
there that will inform our direction.
 
> A high-level summary of the important differences in my approach &
> Derrick's (which I hope I'm summarizing fairly here) is that his
> approach optionally adds a bundle TOC format, that format allows you
> to define topology relationships between bundles to guide a
> (returning) client in what it needs to fetch.
> 
> Whereas the idea in this series is to lean entirely on the client
> downloading bundles & inferring what needs to be done via the
> tip/prereqs listed in the header format of the (existing, not changed
> here) bundle format.
> 
> Both have pros & cons, I started trying to summarize those, but let's
> leave that for later.

I know you were in a rush to deliver this, so I'm going to assume
that "leave that for later" was "I didn't have time to write that
here". The very high-level comparison that I gathered from our
chat were these points:

 1. My version uses the TOC and its "timestamp" heuristic to
    minimize how much data the client needs to download on a "no-op"
    fetch. Your version requires that the client downloads some
    initial range of data from each advertised URI.

 2. My version lets the TOC sit at one well-known URI that can be
    advertised independently of the origin server. I don't see an
    equivalent in yours (so far).

 3. The TOC in my version allows the server advertisement be an "OR"
    (download from any of these locations... some might be closer to
    you than others) and yours is an "AND" (this is the list of
    bundles... you probably need all of them for a full clone). This
    difference is something that can be worked into the advertisement
    to allow both modes, if that is a valuable mechanism.

I'm also interested to see how you allow for someone to create their
own local bundle server that is independent of the origin Git server
(so the bundle-uri advertisement does not help them see the bundles).
Perhaps that's just not part of your design, so will be part of the
combined effort.

> There's also some high-level "journey" differences in the
> two. E.g. Derrick implemented the ability to have "git bundle" itself
> fetch bundles remotely, I don't have that and instead lean entirely on
> the protocol and "fetch". Those differences really aren't important,
> and we can have our cake & eat it too on that front. I.e. end up with
> some sensible intersection (or union) of the tooling.

I agree that this is something to work out later. I think it is nice
to allow the user a way to download bundles from a specific URI if
they happen to have one, but that could easily be embedded into a
'git fetch <bundle-URI>' kind of command. Please redesign this as you
see fit when combining efforts.

> I ran out of time to finish up some of what I had on this topic this
> week, but figured (especially since I'd promised to get it done this
> week) to send what I have now for discussion.
> 
> Things missing & reader's notes:
> 
>  * I had some amendmends to the protocol I meant to distill further
>    into the protocol docs at [5]. Basically omitting the ability to
>    transmit key-values and to have it just be a list of URIs with an
>    optional <header> for each one, which is purely a server-to-client
>    aid (i.e. those headers will be what you'll find in the pointed-to
>    bundles).

As long as early versions of the client can ignore the extra key-value
pairs advertised by later versions without issue, it makes sense to
avoid this in early versions.

It would be nice to delay the use of advertising these headers inline
with the advertisement until more of the idea is made concrete. In
particular, the more we can strip out things in early versions that
can be applied later as an optimization, the better. I'm thinking
specifically about how your incremental fetch story will download only
the headers of the bundles to discover which ones are necessary. That
can also be used in the TOC with a timestamp heuristic to discover
that the client already has all of the information from the latest
bundle, even though the server timestamp advanced. Showing the value
to that case _plus_ the "AND" case of bundle-uri advertisements
would be a nice justification for the complication involved there.

> * This series goes up to "clone", but I also have incremental fetch
>   patches. I ran into an (easily solvable bug) in that & thought it
>   was best to omit it for now. It'll be here soon.
> 
>   Basically for incremental fetch we'll in parallel get the headers
>   for remote bundles, and then do an early abort of those downloads
>   that we deem that we don't need.

This is the main thing I was missing in our earlier discussions
(in August and October): this feature of downloading the headers for
the remote bundles is critical for allowing incremental fetch to
work in your model. It's a clever way to solve the problem.

I'm interested to see how well it performs in real-world scenarios.

I'm imagining a way to incrementally build things from simplest to
most complicated, and it goes in this order:

 0. Implement 'git clone --bundle-uri=<X> <url>' where we expect
    a bundle at the given uri <X>.

 1. Implement 'git clone <url>' to understand a bundle-uri
    advertisement with AND (get all bundles and unbundle them in
    some order before fetching) and OR (get any _one_ of these
    full bundles) logic.

 2. Extend the bundle downloading to understand a TOC, allowing
    the OR advertisement to advertise TOC (perhaps guarded with
    some metadata in the advertisement).

 3. For the TOC model, allow 'git fetch' to update from the TOC.

 4. Extend the AND advertisements to do parallel header-only
    downloads to integrate with 'git fetch'. The implementation
    of these pieces also improve performance of the TOC model.

This is me just spitballing of a way that we can make incremental
progress towards this feature without needing to go super-deep
into one model or the other before we are able to test this
example against real-world bundle server prototypes.

>   Clever (but all standard & boring) use of HTTP caching headers
>   between client & servers then allows the client to not request the
>   same thing again and again. I.e. want less server load on your CDN?
>   Just have the bundles be unique URLs and set appropriate caching
>   headers.

I was hoping that the TOC model would avoid the need for cleverness
at all, but I'm interested to see what we can do with these tricks
in all cases.
 
> * A problem with this implementation (and Derrick's, I believe) is
>   that it keeps a server waiting while we twiddle our thumbs and wget
>   (or equivalent) the bundle(s) locally. If you e.g. clone
>   "chromium.git" the server will get tired of waiting, drop the
>   connection, and unless the bundle is 100% up-to-date the "clone"
>   will fail.

This is absolutely the case with my implementation. I call it out
with comments, but I didn't have a solution in mind other than
"disconnect if necessary, then reconnect."
 
>   The solution to this is to get the bundle headers in parallel, and
>   as soon as we've got them present the OIDs in the headers as "HAVE"
>   to the server, which'll then send us an incremental PACK (and even
>   possibly a packfile-uri) for the difference between those bundle(s)
>   and what its tips are.
> 
>   We can then simply disconnect, download/index-pack everything, and
>   do a connectivity checkat the end.

This is a clever idea and is absolutely something that can be done
as a later step (even after step 4 from my outline above).

>   This requires some careful error handling, e.g. if the resulting
>   repo isn't valid we'll need to retry, and the current negotiation
>   API isn't very happy to negotiate on the basis of OIDs we don't have
>   in the object store (but the protocol handles it just fine).

This is exactly why we shouldn't over-index on that idea too early,
but definitely keep it in our back pockets for a future improvement.

Thanks for the detailed cover letter. I hope to look more closely
at the patches themselves for feedback early next week.

Thanks,
-Stolee

  parent reply	other threads:[~2022-03-11 22:45 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-25 21:25 [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation Ævar Arnfjörð Bjarmason
2021-10-25 21:25 ` [PATCH 1/3] leak tests: mark t5701-git-serve.sh as passing SANITIZE=leak Ævar Arnfjörð Bjarmason
2021-10-25 21:25 ` [PATCH 2/3] protocol v2: specify static seeding of clone/fetch via "bundle-uri" Ævar Arnfjörð Bjarmason
2021-10-26 14:00   ` Derrick Stolee
2021-10-26 15:00     ` Ævar Arnfjörð Bjarmason
2021-10-27  1:55       ` Derrick Stolee
2021-10-27 17:49         ` Ævar Arnfjörð Bjarmason
2021-10-27  2:01   ` Derrick Stolee
2021-10-27  8:29     ` Ævar Arnfjörð Bjarmason
2021-10-27 16:31       ` Derrick Stolee
2021-10-27 18:01         ` Ævar Arnfjörð Bjarmason
2021-10-27 19:23           ` Derrick Stolee
2021-10-27 20:22             ` Ævar Arnfjörð Bjarmason
2021-10-29 18:30               ` Derrick Stolee
2021-10-30 14:51           ` Philip Oakley
2021-10-25 21:25 ` [PATCH 3/3] bundle-uri client: add "bundle-uri" parsing + tests Ævar Arnfjörð Bjarmason
2021-10-26 14:05   ` Derrick Stolee
2021-10-29 18:46 ` [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation Derrick Stolee
2021-10-30  7:21   ` Ævar Arnfjörð Bjarmason
2021-11-01 21:00     ` Derrick Stolee
2021-11-01 23:18       ` Ævar Arnfjörð Bjarmason
2022-03-11 16:24 ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 01/13] protocol v2: add server-side "bundle-uri" skeleton Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 02/13] bundle-uri docs: add design notes Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 03/13] bundle-uri client: add "bundle-uri" parsing + tests Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 04/13] connect.c: refactor sending of agent & object-format Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 05/13] bundle-uri client: add minimal NOOP client Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 06/13] bundle-uri client: add "git ls-remote-bundle-uri" Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 07/13] bundle-uri client: add transfer.injectBundleURI support Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 08/13] bundle-uri client: add boolean transfer.bundleURI setting Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 09/13] fetch-pack: add a deref_without_lazy_fetch_extended() Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 10/13] fetch-pack: move --keep=* option filling to a function Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 11/13] bundle.h: make "fd" version of read_bundle_header() public Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 12/13] bundle-uri client: support for bundle-uri with "clone" Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 13/13] bundle-uri: make the download program configurable Ævar Arnfjörð Bjarmason
2022-03-11 21:28   ` Derrick Stolee [this message]
2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 01/36] connect.c: refactor sending of agent & object-format Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 02/36] dir API: add a generalized path_match_flags() function Ævar Arnfjörð Bjarmason
2022-04-21 17:26       ` Derrick Stolee
2022-04-18 17:23     ` [RFC PATCH v2 03/36] fetch-pack: add a deref_without_lazy_fetch_extended() Ævar Arnfjörð Bjarmason
2022-04-21 17:28       ` Derrick Stolee
2022-04-18 17:23     ` [RFC PATCH v2 04/36] fetch-pack: move --keep=* option filling to a function Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 05/36] http: make http_get_file() external Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 06/36] remote: move relative_url() Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 07/36] remote: allow relative_url() to return an absolute url Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 08/36] bundle.h: make "fd" version of read_bundle_header() public Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 09/36] protocol v2: add server-side "bundle-uri" skeleton Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 10/36] bundle-uri client: add "bundle-uri" parsing + tests Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 11/36] bundle-uri client: add minimal NOOP client Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 12/36] bundle-uri client: add "git ls-remote-bundle-uri" Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 13/36] bundle-uri client: add transfer.injectBundleURI support Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 14/36] bundle-uri client: add boolean transfer.bundleURI setting Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 15/36] bundle-uri client: support for bundle-uri with "clone" Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 16/36] bundle-uri: make the download program configurable Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 17/36] remote-curl: add 'get' capability Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 18/36] bundle: implement 'fetch' command for direct bundles Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 19/36] bundle: parse table of contents during 'fetch' Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 20/36] bundle: add --filter option to 'fetch' Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 21/36] bundle: allow relative URLs in table of contents Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 22/36] bundle: make it easy to call 'git bundle fetch' Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 23/36] clone: add --bundle-uri option Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 24/36] clone: --bundle-uri cannot be combined with --depth Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 25/36] bundle: only fetch bundles if timestamp is new Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 26/36] fetch: fetch bundles before fetching original data Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 27/36] protocol-caps: implement cap_features() Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 28/36] serve: understand but do not advertise 'features' capability Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 29/36] serve: advertise 'features' when config exists Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 30/36] connect: implement get_recommended_features() Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 31/36] transport: add connections for 'features' capability Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 32/36] clone: use server-recommended bundle URI Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 33/36] t5601: basic bundle URI test Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 34/36] protocol v2: add server-side "bundle-uri" skeleton (docs) Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 35/36] bundle-uri docs: add design notes Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 36/36] docs: document bundle URI standard Ævar Arnfjörð Bjarmason
2022-04-21 19:54     ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Derrick Stolee
2022-04-22  9:37       ` Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d745488d-d70e-0b1d-ce08-ce276a9d4310@github.com \
    --to=derrickstolee@github.com \
    --cc=albertqcui@gmail.com \
    --cc=avarab@gmail.com \
    --cc=dyroneteng@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=jrnieder@gmail.com \
    --cc=robbat2@gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).