Re: [PATCH 01/25] docs: document bundle URI standard

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Derrick Stolee <derrickstolee@github.com>
To: Elijah Newren <newren@gmail.com>,
	Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: "Git Mailing List" <git@vger.kernel.org>,
	"Junio C Hamano" <gitster@pobox.com>,
	"Taylor Blau" <me@ttaylorr.com>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Subject: Re: [PATCH 01/25] docs: document bundle URI standard
Date: Wed, 2 Mar 2022 09:06:53 -0500	[thread overview]
Message-ID: <4f5f4751-c047-b9de-28a7-6ee3c31826f0@github.com> (raw)
In-Reply-To: <CABPp-BEXgmGW=Lk5-JE6bc1F8RbGidDVjALAZraeZ-2_u476gg@mail.gmail.com>

On 3/1/2022 9:28 PM, Elijah Newren wrote:
> On Wed, Feb 23, 2022 at 10:31 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <derrickstolee@github.com>
>>
>> Introduce the idea of bundle URIs to the Git codebase through an
>> aspirational design document. This document includes the full design
>> intended to include the feature in its fully-implemented form. This will
>> take several steps as detailed in the Implementation Plan section.
>>
>> By committing this document now, it can be used to motivate changes
>> necessary to reach these final goals. The design can still be altered as
>> new information is discovered.
>>
>> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
>> ---
>>  Documentation/technical/bundle-uri.txt | 404 +++++++++++++++++++++++++
>>  1 file changed, 404 insertions(+)
>>  create mode 100644 Documentation/technical/bundle-uri.txt
>>
>> diff --git a/Documentation/technical/bundle-uri.txt b/Documentation/technical/bundle-uri.txt
>> new file mode 100644
>> index 00000000000..5c0b9e8e3ef
>> --- /dev/null
>> +++ b/Documentation/technical/bundle-uri.txt
>> @@ -0,0 +1,404 @@
>> +Bundle URIs
>> +===========
>> +
>> +Bundle URIs are locations where Git can download one or more bundles in
>> +order to bootstrap the object database in advance of fetching the remaining
>> +objects from a remote.
>> +
>> +One goal is to speed up clones and fetches for users with poor network
>> +connectivity to the origin server. Another benefit is to allow heavy users,
>> +such as CI build farms, to use local resources for the majority of Git data
>> +and thereby reducing the load on the origin server.
>> +
>> +To enable the bundle URI feature, users can specify a bundle URI using
>> +command-line options or the origin server can advertise one or more URIs
>> +via a protocol v2 capability.
>> +
>> +Server requirements
>> +-------------------
>> +
>> +To provide a server-side implementation of bundle servers, no other parts
>> +of the Git protocol are required. This allows server maintainers to use
>> +static content solutions such as CDNs in order to serve the bundle files.
>> +
>> +At the current scope of the bundle URI feature, all URIs are expected to
>> +be HTTP(S) URLs where content is downloaded to a local file using a `GET`
>> +request to that URL. The server could include authentication requirements
>> +to those requests with the aim of triggering the configured credential
>> +helper for secure access.
> 
> So folks using ssh to clone, who have never configured a credential
> helper before, might need to start doing so.  This makes sense and I
> don't think I see a way around it, but we might want to call it out a
> bit more prominently.  Cloning over https seems to be rare in the
> various setups I've seen (I know there are others where it's common,
> just noting that many users may never have had to use https for
> cloning before), so this is a potentially big point for users to be
> aware of in terms of setup.

We could even go so far as to skip the credential manager if the
Git remote is SSH, requiring the bundle URIs to work only if
unauthenticated. Likely, we will want clear knobs that the user
can toggle for how to behave when a bundle URI is advertised with
modes such as

* always attempt with authentication
* always attempt, but skip authentication when Git remote is not HTTP(S)
* attempt only when Git remote is HTTP(S)
* never attempt

These are all things that are separate from the bundle URI standard
being documented here, but instead would be saved for the last set of
patches that allow a server to advertise a bundle URI at clone time.

>> +bundle.tableOfContents.version::
>> +       This value provides a version number for the table of contents. If
>> +       a future Git change enables a feature that needs the Git client to
>> +       react to a new key in the table of contents file, then this version
>> +       will increment. The only current version number is 1, and if any
>> +       other value is specified then Git will fail to use this file.
> 
> What does "Git will fail to use this file" mean?  Does it mean Git
> will throw an error?  clone/fetch without the aid of bundle uris but
> show a warning?  something else?

I mean "Git will continue as if the bundle URI was not specified". It would
show a warning, probably. This could be converted into a failure if valuable
for the user, but I don't expect that will be the default behavior.

>> +bundle.<id>.timestamp::
>> +       (Optional) This value is the number of seconds since Unix epoch
>> +       (UTC) that this bundle was created. This is used as an approximation
>> +       of a point in time that the bundle matches the data available at
>> +       the origin server.
> 
> As an approximation, is there a risk of drift where the user has
> timetamp A which is very close to B and makes decisions based upon it
> which results in their not getting dependent objects they need?  Or is
> it just an optimization for them to only download certain bundles and
> look at them, and then they'll iteratively go back and download more
> (as per the 'requires' field below) if they don't have enough objects
> to unbundle what they previously downloaded?

The user doesn't ever generate the timestamp. It saves the timestamp
from the most-recent bundle it downloaded. The only risk of timestamp
drift is if the server has multiple machines generating different sets
of bundles, and places those machines behind a load balancer.

This is something the server can control, likely by having one job
generate the bundle set and then distributing them to various storage
locations.

>> +Cloning with Bundle URIs
>> +------------------------
>> +
>> +The primary need for bundle URIs is to speed up clones. The Git client
>> +will interact with bundle URIs according to the following flow:
>> +
>> +1. The user specifies a bundle URI with the `--bundle-uri` command-line
>> +   option _or_ the client discovers a bundle URI that was advertised by
>> +   the remote server.
>> +
>> +2. The client downloads the file at the bundle URI. If it is a bundle, then
>> +   it is unbundled with the refs being stored in `refs/bundle/*`.
>> +
>> +3. If the file is instead a table of contents, then the bundles with
>> +   matching `filter` settings are sorted by `timestamp` (if present),
>> +   and the most-recent bundle is downloaded.
>> +
>> +4. If the current bundle header mentions negative commid OIDs that are not
>> +   in the object database, then download the `requires` bundle and try
>> +   again.
>> +
>> +5. After inspecting a bundle with no negative commit OIDs (or all OIDs are
>> +   already in the object database somehow), then unbundle all of the
>> +   bundles in reverse order, placing references within `refs/bundle/*`.
>> +
>> +6. The client performs a fetch negotiation with the origin server, using
>> +   the `refs/bundle/*` references as `have`s and the server's ref
>> +   advertisement as `want`s. This results in a pack-file containing the
>> +   remaining objects requested by the clone but not in the bundles.
> 
> Does step 6 potentially involve a new, second connection to the origin
> server?  I'm wondering about timeouts closing the original connection
> while the client is downloading the bundle uris.  Will the client
> handle that automatically, or will they potentially be forced to
> re-issue the clone/fetch command?  I'm also wondering if we want to be
> "nice" and pre-emptively close the original connection to the server
> while we fetch the bundles -- for example, some servers have a
> threadpool for processing fetch/clone requests and will only serve a
> limited number; IIRC Gerrit operates this way.  I have no idea if
> that's a good idea or a horrible one.  If a second connection is tried
> automatically, will the user potentially be forced to re-enter
> connection credentials again?  And is there a risk that after the
> second connection, there are new bundle uris for the client to go
> fetch (and/or a removal of the original ones, e.g. replacing the old
> "daily" bundle with a new one)?  Does this possibly cause us some
> timestamp confusion as I noted earlier?

If the user is cloning over HTTPS, then the connections are stateless
and this is not any different than how it works today.

When using SSH, we will probably want to close the SSH connection on
the client and then reopen it to avoid keeping that connection open
during the download. The implementation in this RFC does _not_ do this,
but I think it would be valuable to do.

>> +Note that during a clone we expect that all bundles will be required. The
>> +client could be extended to download all bundles in parallel, though they
>> +need to be unbundled in the correct order.
> 
> What does required mean?  That the origin server can refuse to service
> requests if the client does not have commits found in said bundles?
> That new enough Git clients are expected to download all the bundles
> (and no config option will be provided to users to just do traditional
> negotation without first downloading them)?  Something else?

The assumption I'm making here is that all but on bundle in the table
of contents contains a thin pack, depending on an "earlier" bundle.
The client would be unsuccessful unbundling any single bundle except
the earliest one first.

The benefit of this assumption is that we could also implement parallel
downloads of all bundles in the future.

This assumes that there is no way to organize the bundles to communicate
that a user might want only the objects reachable from the default branch,
but also some users want every reachable object. Such an organization
would require extra information to describe two "parallel" lists of
bundles that could be selected for each of those categories. If such an
organization is valuable, then the table of contents can be extended with
information to communicate such an organization. The downside is that
clients with this "v1" version would download extra data based on this
assumption.

> If users are doing a single-branch clone or a shallow clone, will the
> requirements still hold?  (I'm not a fan of shallow clones, but they
> are sadly used in a number of places and I'm curious how the features
> interact or conflict.)

The current specification does not focus on shallow clones. The TOC
could be extended to say "this bundle is for a shallow clone of 
commit <X>" if that was valuable.

For single-branch clones, my expectation is that the bundles will
give the user more information than they need for that clone. The
negotiation will find out what they need from that branch that was
not in the bundles, but the bundles will also contain a lot of
objects that are not reachable from that ref. (This is also up to
the discretion of the bundle server operator, since they could
focus on only objects reachable from a subset of refs, minimizing
the bundle data while increasing the potential size of that
follow-up fetch.)

>> +If a table of contents is used and it contains
>> +`bundle.tableOfContents.forFetch = true`, then the client can store a
>> +config value indicating to reuse this URI for later `git fetch` commands.
>> +In this case, the client will also want to store the maximum timestamp of
>> +a downloaded bundle.
>> +
>> +Fetching with Bundle URIs
>> +-------------------------
>> +
>> +When the client fetches new data, it can decide to fetch from bundle
>> +servers before fetching from the origin remote. This could be done via
>> +a command-line option, but it is more likely useful to use a config value
>> +such as the one specified during the clone.
>> +
>> +The fetch operation follows the same procedure to download bundles from a
>> +table of contents (although we do _not_ want to use parallel downloads
>> +here). We expect that the process will end because all negative commit
>> +OIDs in a thin bundle are already in the object database.
>> +
>> +A further optimization is that the client can avoid downloading any
>> +bundles if their timestamps are not larger than the stored timestamp.
>> +After fetching new bundles, this local timestamp value is updated.
> 
> What about the transition period where repositories were cloned before
> bundle URIs became a thing (or were turned on within an organization),
> and the user then goes to fetch?  Will git go and download a bunch of
> useless large bundles (and maybe one small useful one) the day this
> feature is turned on, making users think this is a bad feature?
> 
> Should git perhaps treat the already-cloned case without a stored
> timestamp as a request to store a timestamp of "now", making it ignore
> the current bundles?  (If so, are there races where it later goes to
> grab a bundle slightly newer than "now" but which depends on an older
> bundle that has some objects we are missing?)

I expect that users who already cloned will never configure their
repositories to include a bundle server.

That said, if you run 'git bundle fetch <uri>' in an existing
repository, then it will fetch only the newest bundle and see if you
already have all of its negative refs. If so, then it stops and that
is the only bundle that is downloaded. Its timestamp is stored for
the next 'git bundle fetch'.

In the case where the server starts advertising a bundle URI, a
'git fetch' will not start using that URI. That check only happens
during 'git clone' (as currently designed).

>> +Error Conditions
>> +----------------
>> +
>> +If the Git client discovers something unexpected while downloading
>> +information according to a bundle URI or the table of contents found at
>> +that location, then Git can ignore that data and continue as if it was not
>> +given a bundle URI. The remote Git server is the ultimate source of truth,
>> +not the bundle URI.
> 
> This seems to contradict the earlier statement that for clones all
> bundle URIs would be "required".  I like the idea of bundle URIs only
> being an optimization that can be ignored, just noting the potential
> confusion.

Perhaps I misnamed this section. These are things that could go wrong with
a bundle server connection, and in such a case Git should recover by
transitioning to the normal Git protocol to fetch the objects.

>> +
>> +Here are a few example error conditions:
>> +
>> +* The client fails to connect with a server at the given URI or a connection
>> +  is lost without any chance to recover.
>> +
>> +* The client receives a response other than `200 OK` (such as `404 Not Found`,
>> +  `401 Not Authorized`, or `500 Internal Server Error`).
>> +
>> +* The client receives data that is not parsable as a bundle or table of
>> +  contents.
>> +
>> +* The table of contents describes a directed cycle in the
>> +  `bundle.<id>.requires` links.
>> +
>> +* A bundle includes a filter that does not match expectations.
>> +
>> +* The client cannot unbundle the bundles because the negative commit OIDs
>> +  are not in the object database and there are no more
>> +  `bundle.<id>.requires` links to follow.
> 
> Should these result in warnings so that folks can diagnose slower
> connections, or should they be squelched?  (I'm thinking particularly
> of the `401 Not Authorized` case in combination with users never
> having had to use a credential helper before.)

There is a lot of work to be done around polishing the user ergonomics
here, and that is an interesting thing to consider for a second round
after the basic standard is established. I appreciate that you are
already thinking about the user experience in these corner cases.

>> +
>> +There are also situations that could be seen as wasteful, but are not
>> +error conditions:
>> +
>> +* The downloaded bundles contain more information than is requested by
>> +  the clone or fetch request. A primary example is if the user requests
>> +  a clone with `--single-branch` but downloads bundles that store every
>> +  reachable commit from all `refs/heads/*` references. This might be
>> +  initially wasteful, but perhaps these objects will become reachable by
>> +  a later ref update that the client cares about.
> 
> Ah, this answers my --single-branch question.  Still curious about the
> --shallow misfeature (yeah, I'm a bit opinionated) and how it
> interacts, though.

(Hopefully my previous reply to this topic is helpful.)

>> +* A bundle download during a `git fetch` contains objects already in the
>> +  object database. This is probably unavoidable if we are using bundles
>> +  for fetches, since the client will almost always be slightly ahead of
>> +  the bundle servers after performing its "catch-up" fetch to the remote
>> +  server. This extra work is most wasteful when the client is fetching
>> +  much more frequently than the server is computing bundles, such as if
>> +  the client is using hourly prefetches with background maintenance, but
>> +  the server is computing bundles weekly. For this reason, the client
>> +  should not use bundle URIs for fetch unless the server has explicitly
>> +  recommended it through the `bundle.tableOfContents.forFetch = true`
>> +  value.
> 
> Makes sense, and certainly reduces my worry about the "transition
> period" where users have existing clones that pre-dated the
> introduction of the bundle URI feature.  But I'm still kind of curious
> about how we handle that transition for folks that have recommended
> their bundleUris for fetches.

By "that transition" I believe you are talking about configuring bundle
URIs on an existing repository. My earlier reply on that hopefully
eases your concerns somewhat. Ævar also has some ideas about downloading
only the header of a bundle and examining it to see if we need the rest
of it, which would further reduce the amount of data necessary to do an
initial first fetch from a bundle URI.

Thanks for taking the time to read and give detailed thoughts on this
proposal!

Thanks,
-Stolee

next prev parent reply	other threads:[~2022-03-02 14:07 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-23 18:30 [PATCH 00/25] [RFC] Bundle URIs Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 01/25] docs: document bundle URI standard Derrick Stolee via GitGitGadget
2022-03-02  2:28   ` Elijah Newren
2022-03-02 14:06     ` Derrick Stolee [this message]
2022-03-03  2:19       ` Elijah Newren
2022-03-03  2:44         ` Derrick Stolee
2022-02-23 18:30 ` [PATCH 02/25] bundle: alphabetize subcommands better Derrick Stolee via GitGitGadget
2022-03-11 13:47   ` Ævar Arnfjörð Bjarmason
2022-02-23 18:30 ` [PATCH 03/25] dir: extract starts_with_dot[_dot]_slash() Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 04/25] remote: move relative_url() Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 05/25] remote: allow relative_url() to return an absolute url Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 06/25] http: make http_get_file() external Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 07/25] remote-curl: add 'get' capability Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 08/25] bundle: implement 'fetch' command for direct bundles Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 09/25] bundle: parse table of contents during 'fetch' Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 10/25] bundle: add --filter option to 'fetch' Derrick Stolee via GitGitGadget
2022-03-11 13:44   ` Ævar Arnfjörð Bjarmason
2022-02-23 18:30 ` [PATCH 11/25] bundle: allow relative URLs in table of contents Derrick Stolee via GitGitGadget
2022-03-11 13:42   ` Ævar Arnfjörð Bjarmason
2022-02-23 18:30 ` [PATCH 12/25] bundle: make it easy to call 'git bundle fetch' Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 13/25] clone: add --bundle-uri option Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 14/25] clone: --bundle-uri cannot be combined with --depth Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 15/25] config: add git_config_get_timestamp() Derrick Stolee via GitGitGadget
2022-03-11 13:32   ` Ævar Arnfjörð Bjarmason
2022-02-23 18:30 ` [PATCH 16/25] bundle: only fetch bundles if timestamp is new Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 17/25] fetch: fetch bundles before fetching original data Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 18/25] connect.c: refactor sending of agent & object-format Ævar Arnfjörð Bjarmason via GitGitGadget
2022-02-23 18:30 ` [PATCH 19/25] protocol-caps: implement cap_features() Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 20/25] serve: understand but do not advertise 'features' capability Derrick Stolee via GitGitGadget
2022-02-23 18:30 ` [PATCH 21/25] serve: advertise 'features' when config exists Derrick Stolee via GitGitGadget
2022-02-23 18:31 ` [PATCH 22/25] connect: implement get_recommended_features() Derrick Stolee via GitGitGadget
2022-02-23 18:31 ` [PATCH 23/25] transport: add connections for 'features' capability Derrick Stolee via GitGitGadget
2022-02-23 18:31 ` [PATCH 24/25] clone: use server-recommended bundle URI Derrick Stolee via GitGitGadget
2022-02-23 18:31 ` [PATCH 25/25] t5601: basic bundle URI test Derrick Stolee via GitGitGadget
2022-02-23 22:17 ` [PATCH 00/25] [RFC] Bundle URIs Ævar Arnfjörð Bjarmason
2022-02-24 14:11   ` Derrick Stolee
2022-03-04 13:30     ` Derrick Stolee
2022-03-04 14:49       ` Ævar Arnfjörð Bjarmason
2022-03-04 15:12         ` Derrick Stolee
2022-03-08 17:15           ` Derrick Stolee
2022-03-10 14:45             ` Johannes Schindelin
2022-04-07 19:08             ` Derrick Stolee
2022-04-08  9:15               ` Ævar Arnfjörð Bjarmason
2022-04-08 13:13                 ` Derrick Stolee
2022-04-08 18:26                   ` Junio C Hamano
2022-03-08  8:18   ` Teng Long

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4f5f4751-c047-b9de-28a7-6ee3c31826f0@github.com \
    --to=derrickstolee@github.com \
    --cc=avarab@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=me@ttaylorr.com \
    --cc=newren@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).