git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
To: Derrick Stolee <stolee@gmail.com>
Cc: git@vger.kernel.org, Junio C Hamano <gitster@pobox.com>,
	Jeff King <peff@peff.net>, Patrick Steinhardt <ps@pks.im>,
	Christian Couder <christian.couder@gmail.com>,
	Albert Cui <albertqcui@gmail.com>,
	Jonathan Tan <jonathantanmy@google.com>,
	Jonathan Nieder <jrnieder@gmail.com>,
	"brian m . carlson" <sandals@crustytoothpaste.net>,
	"Robin H . Johnson" <robbat2@gentoo.org>
Subject: Re: [PATCH 2/3] protocol v2: specify static seeding of clone/fetch via "bundle-uri"
Date: Wed, 27 Oct 2021 22:22:28 +0200	[thread overview]
Message-ID: <211027.86a6iuxk3x.gmgdl@evledraar.gmail.com> (raw)
In-Reply-To: <e3c923b4-cce0-372c-31f4-9f091239ad8b@gmail.com>


On Wed, Oct 27 2021, Derrick Stolee wrote:

> On 10/27/2021 2:01 PM, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Wed, Oct 27 2021, Derrick Stolee wrote:
>> 
>>> On 10/27/2021 4:29 AM, Ævar Arnfjörð Bjarmason wrote:
>>>>
>>>> On Tue, Oct 26 2021, Derrick Stolee wrote:
>>>>
>>>>> On 10/25/2021 5:25 PM, Ævar Arnfjörð Bjarmason wrote:
>>>>>> Add a server-side implementation of a new "bundle-uri" command to
>>>>>> protocol v2. As discussed in the updated "protocol-v2.txt" this will
>>>>>> allow conforming clients to optionally seed their initial clones or
>>>>>> incremental fetches from URLs containing "*.bundle" files created with
>>>>>> "git bundle create".
>>>>>
>>>>> ...
>>>>>
>>>>>> +DISCUSSION of bundle-uri
>>>>>> +^^^^^^^^^^^^^^^^^^^^^^^^
>>>>>> +
>>>>>> +The intent of the feature is optimize for server resource consumption
>>>>>> +in the common case by changing the common case of fetching a very
>>>>>> +large PACK during linkgit:git-clone[1] into a smaller incremental
>>>>>> +fetch.
>>>>>> +
>>>>>> +It also allows servers to achieve better caching in combination with
>>>>>> +an `uploadpack.packObjectsHook` (see linkgit:git-config[1]).
>>>>>> +
>>>>>> +By having new clones or fetches be a more predictable and common
>>>>>> +negotiation against the tips of recently produces *.bundle file(s).
>>>>>> +Servers might even pre-generate the results of such negotiations for
>>>>>> +the `uploadpack.packObjectsHook` as new pushes come in.
>>>>>> +
>>>>>> +I.e. the server would anticipate that fresh clones will download a
>>>>>> +known bundle, followed by catching up to the current state of the
>>>>>> +repository using ref tips found in that bundle (or bundles).
>>>>>> +
>>>>>> +PROTOCOL for bundle-uri
>>>>>> +^^^^^^^^^^^^^^^^^^^^^^^
>>>>>> +
>>>>>> +A `bundle-uri` request takes no arguments, and as noted above does not
>>>>>> +currently advertise a capability value. Both may be added in the
>>>>>> +future.
>>>>>
>>>>> One thing I realized was missing from this proposal is any interaction
>>>>> with partial clone. It would be disappointing if we could not advertise
>>>>> bundles of commit-and-tree packfiles for blobless partial clones.
>>>>>
>>>>> There is currently no way for the client to signal the filter type
>>>>> during this command. Not having any way to extend to include that
>>>>> seems like an oversight we should remedy before committing to a
>>>>> protocol that can't be extended.
>>>>>
>>>>> (This also seems like a good enough reason to group the URIs into a
>>>>> struct-like storage, because the filter type could be stored next to
>>>>> the URI.)
>>>>
>>>> I'll update the docs to note that. I'd definitely like to leave out any
>>>> implementation of filter/shallow for an initial iteration of this for
>>>> simplicity, but the protocol keyword/behavior is open-ended enough to
>>>> permit any extension.
>>>
>>> It would be good to be explicit about how this would work. Looking at
>>> it fresh, it seems that the server could send multiple bundle URIs with
>>> the extra metadata to say which ones have a filter (and what that filter
>>> is). The client could then check if a bundle matches the given filter.
>>>
>>> But this is a bit inverted: the filter mechanism currently has the client
>>> request a given filter and the server responds with _at least_ that much
>>> data. This allows the server to ignore things like pathspec-filters or
>>> certain size-based filters. If the client just ignores a bundle URI
>>> because it doesn't match the exact filter, this could lead the client to
>>> ask for the data without a bundle, even if it would be faster to just
>>> download the advertised bundle.
>>>
>>> For this reason, I think it would be valuable for the client to tell
>>> the server the intended filter, and the server responds with bundle
>>> URIs that contain a subset of the information that would be provided
>>> by a later fetch request with that filter.
>>>
>>>> I.e. the server can start advertising "bundle-uri=shallow", and future
>>>> clients can request arbitrary key-value pairs in addition to just
>>>> "bundle-uri" now.
>>>>
>>>> Having said that I think that *probably* this is something that'll never
>>>> be implemented, but maybe I'll eat my words there.
>> 
>> I didn't mean to elide past "filter", but was just using "shallow" as a
>> short-hand for one thing in the "fetch" dialog that a client can mention
>> that'll impact PACK generation, just like filter.
>> 
>> Having thought about this a bit more, I think it should be an invariant
>> in any bundle-uri design that the server shouldn't communicate any
>> side-channel information whatsoever about a bundle it advertises, if
>> that information can't be discovered in the header of that bundle file.
>> 
>> Mind you, this means throwing out large parts of my current proposed
>> over-the-wire design, but I think for the better. I.e. the whole
>> response part where we communicate:
>> 
>>     (bundle-uri (SP bundle-feature-key (=bundle-feature-val)?)* LF)*
>>     flush-pkt
>> 
>> Would just become something like:
>> 
>>     (bundle-uri delim-pkt bundle-header? delim-pkt)*
>>     flush-pkt
>> 
>> I.e. we'd optionally transfer the content of the bundle header (content
>> up to the first "\n\n") to the client, but *only* ever as a shorthand
>> for saving the client a roundtrip.
>
> It still seems like we're better off letting the client request a
> filter and have the server present URIs that the client can use,
> and the server can choose to ignore the filter or provide URIs that
> are specific to that filter.[...]

I've tested this a bit now and think there's no way to create such a
bundle currently. I.e. try:

    git clone --filter=blob:none --single-branch --no-tags https://github.com/git/git.git
    cd git
    git config --unset remote.origin.partialclonefilter
    git config --unset remote.origin.promisor

You'll get:
    
    $ GIT_TRACE_PACKET=1 git bundle create --version=3 master.bdl master
    Enumerating objects: 306784, done.
    Counting objects: 100% (306784/306784), done.
    Compressing objects: 100% (69265/69265), done.
    fatal: unable to read c85385dc03228450cb7fb6d306252038a91b47e6
    error: pack-objects died

If you didn't do that config munging we'd create the pack, but it would
be inflated with the blobs (after going back and getting them from the
server).

So aside from any questions of how you'd hypothetically communicate your
desire to get such bundle from the server, I don't think it could serve
one up.

So I think this is moot until the bundle format itself could support
it. I'll need to "git bundle [verify|unbundle]" whatever I get on the
other end.

I really don't mean this in any way as dodging the desirability of this
feature. I'd really like to have it too. I think implementing it should
be relatively simple, and I've got an implementation in mind that makes
this future-proof for anything else we'd like to add.

I.e. if you look at that v3 format bundle you'll see:
    
    $ head -c 100 master.bdl
    # v3 git bundle
    @object-format=sha1
    e9e5ba39a78c8f5057262d49e261b42a8660d5b9 refs/heads/master
    
    PACK

Wouldn't this just be a matter of including extra lines with:

    # I'm assuming that the promisor url can be assumed to be "the url
    # we cloned this from", but maybe we need @remoteURL=https://....
    @promisor=true
    @filter = blob:none

I.e. exactly corresponding to the .git/config we'd end up with,
config. We'd then (I think) create .git/objects/pack/*.promisor with the
OIDs of each of the inflated tips (I'm not familiar with .promisor
files).

And a thing I need to include in the bundle-uri protocol is that the
client should not just include a "bundle-uri" attribute, but have a
"value" describing the bundle format it accepts. I.e. now:

    bundle-uri=v3,object-format

And for supporting the above:

    bundle-uri=v3,object-format,promisor,filter

I.e. currently we die on any bundle capability except "object-format",
if we're going to discover what to send we'd like a less crappy way than
parsing the version from the "agent" field.

> Sending the full list and making the client decide what it wants seems
> like it will be more complicated than necessary. However, I'll
> withhold complete judgement until I see a full proposal in a v2.

I'm very thankful for the thorough review, and it's exciting that you'd
like to use this feature in some form, and I'll definitely do my best to
support (and if not, future-proof) any use-cases you have in mind.

But I really don't get how this wouldn't effectively be functionally
indistinguishable from packfile-uri, sans a nit here and there.

I can see the convenience of having say 100 bundles, advertising 5, and
then after a full negotiation dialog pointing the equivalent of a
packfile-uri at a *.bundle file, just because that's what you happen to
have around already. If bundle-uri is your main static file distribution
you don't want a duplicate *.pack (without the bundle header) just for
that.

I think a logical extension of the packfile-uri feature for those that
need extended negotiation before deciding on the static URL would be to
teach the packfile-uri downloader to ignore an optional bundle header of
any PACK it finds at a URL (which would not be the same as this
proposal), just to support that use-case.

But, isn't that essentially what you'd want in those cases?

Spewing a "here's my bundles" at a client gets it started quickly, and
also has the side-benefit of making those assets more cachable, as well
as creating a known base for the caching of any subsequent "...and a
PACK negotiation to make it fully up-to-date" request.

The bundles are also our de-facto sneakernet and format, and can be used
for incremental replication. All of which is also a sweet spot for
bundle-uri, i.e. the combination of being able to re-use already
"replicated" files for CDN-ing, and providing wider access to CDN
features for "dumb" servers.

But once we're in dialog with a client to discuss its arbitrary filter
preferences before giving it a URL we're going to be most likely
implementing that as a mode of upload-pack.c anyway, and when it spews
out an optional URL at the end of that dialog....

>> I think the sweet spot for "bundle-uri" is to advertise a small number
>> of bundles that encompass common clone/fetch patterns. I.e. something
>> like a bundle for a full clone with the repo data up to today, and maybe
>> a couple of other bundles covering a time period that clients would be
>> likely to incrementally update within, e.g. 1 week ago..today &&
>> yesterday..now.
>> 
>> I agree that adding say "full clone, --depth=1" and "full clone, no
>> blobs" etc. to that might be *very* useful for some deployments, but per
>> the above I think we should really add that to the bundle format first,
>> not protocol v2.
>  
> I'm focusing my interest in this topic not on "how can we make what we
> already do faster?" but "how can we unlock scale not previously
> possible?" Allowing blobless clones is an important part of this, in
> my opinion, so it is my _default_ mode of operating.

*nod*, I'm keen to support it using something like what's described
above. I.e. stick it as new headers in the bundle format, then be able
to advertise those for common cases. I'd think most of these would be a
few combinations (usually something copy/pasted from a relevant dev
guide), e.g. "all history, no blobs", "main branch, no tags, no blobs"
etc.

Aside from this bundle-uri protocol proposal being able to sneakernet a
repo you cloned like that around as-is seems highly desirable, and just
a feature gap we added when those features were added to "git fetch" and
friends, but not "git bundle".

  reply	other threads:[~2021-10-27 20:56 UTC|newest]

Thread overview: 77+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-10-25 21:25 [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation Ævar Arnfjörð Bjarmason
2021-10-25 21:25 ` [PATCH 1/3] leak tests: mark t5701-git-serve.sh as passing SANITIZE=leak Ævar Arnfjörð Bjarmason
2021-10-25 21:25 ` [PATCH 2/3] protocol v2: specify static seeding of clone/fetch via "bundle-uri" Ævar Arnfjörð Bjarmason
2021-10-26 14:00   ` Derrick Stolee
2021-10-26 15:00     ` Ævar Arnfjörð Bjarmason
2021-10-27  1:55       ` Derrick Stolee
2021-10-27 17:49         ` Ævar Arnfjörð Bjarmason
2021-10-27  2:01   ` Derrick Stolee
2021-10-27  8:29     ` Ævar Arnfjörð Bjarmason
2021-10-27 16:31       ` Derrick Stolee
2021-10-27 18:01         ` Ævar Arnfjörð Bjarmason
2021-10-27 19:23           ` Derrick Stolee
2021-10-27 20:22             ` Ævar Arnfjörð Bjarmason [this message]
2021-10-29 18:30               ` Derrick Stolee
2021-10-30 14:51           ` Philip Oakley
2021-10-25 21:25 ` [PATCH 3/3] bundle-uri client: add "bundle-uri" parsing + tests Ævar Arnfjörð Bjarmason
2021-10-26 14:05   ` Derrick Stolee
2021-10-29 18:46 ` [PATCH 0/3] bundle-uri: "dumb" static CDN offloading, spec & server implementation Derrick Stolee
2021-10-30  7:21   ` Ævar Arnfjörð Bjarmason
2021-11-01 21:00     ` Derrick Stolee
2021-11-01 23:18       ` Ævar Arnfjörð Bjarmason
2022-03-11 16:24 ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 01/13] protocol v2: add server-side "bundle-uri" skeleton Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 02/13] bundle-uri docs: add design notes Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 03/13] bundle-uri client: add "bundle-uri" parsing + tests Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 04/13] connect.c: refactor sending of agent & object-format Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 05/13] bundle-uri client: add minimal NOOP client Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 06/13] bundle-uri client: add "git ls-remote-bundle-uri" Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 07/13] bundle-uri client: add transfer.injectBundleURI support Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 08/13] bundle-uri client: add boolean transfer.bundleURI setting Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 09/13] fetch-pack: add a deref_without_lazy_fetch_extended() Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 10/13] fetch-pack: move --keep=* option filling to a function Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 11/13] bundle.h: make "fd" version of read_bundle_header() public Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 12/13] bundle-uri client: support for bundle-uri with "clone" Ævar Arnfjörð Bjarmason
2022-03-11 16:24   ` [RFC PATCH v2 13/13] bundle-uri: make the download program configurable Ævar Arnfjörð Bjarmason
2022-03-11 21:28   ` [RFC PATCH v2 00/13] bundle-uri: a "dumb CDN" for git Derrick Stolee
2022-04-18 17:23   ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 01/36] connect.c: refactor sending of agent & object-format Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 02/36] dir API: add a generalized path_match_flags() function Ævar Arnfjörð Bjarmason
2022-04-21 17:26       ` Derrick Stolee
2022-04-18 17:23     ` [RFC PATCH v2 03/36] fetch-pack: add a deref_without_lazy_fetch_extended() Ævar Arnfjörð Bjarmason
2022-04-21 17:28       ` Derrick Stolee
2022-04-18 17:23     ` [RFC PATCH v2 04/36] fetch-pack: move --keep=* option filling to a function Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 05/36] http: make http_get_file() external Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 06/36] remote: move relative_url() Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 07/36] remote: allow relative_url() to return an absolute url Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 08/36] bundle.h: make "fd" version of read_bundle_header() public Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 09/36] protocol v2: add server-side "bundle-uri" skeleton Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 10/36] bundle-uri client: add "bundle-uri" parsing + tests Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 11/36] bundle-uri client: add minimal NOOP client Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 12/36] bundle-uri client: add "git ls-remote-bundle-uri" Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 13/36] bundle-uri client: add transfer.injectBundleURI support Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 14/36] bundle-uri client: add boolean transfer.bundleURI setting Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 15/36] bundle-uri client: support for bundle-uri with "clone" Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 16/36] bundle-uri: make the download program configurable Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 17/36] remote-curl: add 'get' capability Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 18/36] bundle: implement 'fetch' command for direct bundles Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 19/36] bundle: parse table of contents during 'fetch' Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 20/36] bundle: add --filter option to 'fetch' Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 21/36] bundle: allow relative URLs in table of contents Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 22/36] bundle: make it easy to call 'git bundle fetch' Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 23/36] clone: add --bundle-uri option Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 24/36] clone: --bundle-uri cannot be combined with --depth Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 25/36] bundle: only fetch bundles if timestamp is new Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 26/36] fetch: fetch bundles before fetching original data Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 27/36] protocol-caps: implement cap_features() Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 28/36] serve: understand but do not advertise 'features' capability Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 29/36] serve: advertise 'features' when config exists Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 30/36] connect: implement get_recommended_features() Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 31/36] transport: add connections for 'features' capability Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 32/36] clone: use server-recommended bundle URI Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 33/36] t5601: basic bundle URI test Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 34/36] protocol v2: add server-side "bundle-uri" skeleton (docs) Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 35/36] bundle-uri docs: add design notes Ævar Arnfjörð Bjarmason
2022-04-18 17:23     ` [RFC PATCH v2 36/36] docs: document bundle URI standard Ævar Arnfjörð Bjarmason
2022-04-21 19:54     ` [RFC PATCH v2 00/36] bundle-uri: a "dumb CDN" for git + TOC format Derrick Stolee
2022-04-22  9:37       ` Ævar Arnfjörð Bjarmason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=211027.86a6iuxk3x.gmgdl@evledraar.gmail.com \
    --to=avarab@gmail.com \
    --cc=albertqcui@gmail.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=jonathantanmy@google.com \
    --cc=jrnieder@gmail.com \
    --cc=peff@peff.net \
    --cc=ps@pks.im \
    --cc=robbat2@gentoo.org \
    --cc=sandals@crustytoothpaste.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).