Re: [PATCH 1/6] docs: document bundle URI standard

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

From: Derrick Stolee <derrickstolee@github.com>
To: Junio C Hamano <gitster@pobox.com>,
	Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, me@ttaylorr.com, newren@gmail.com,
	avarab@gmail.com, dyroneteng@gmail.com,
	Johannes.Schindelin@gmx.de
Subject: Re: [PATCH 1/6] docs: document bundle URI standard
Date: Wed, 8 Jun 2022 15:20:45 -0400	[thread overview]
Message-ID: <9244644f-2c80-f23d-f054-3f0c961696af@github.com> (raw)
In-Reply-To: <xmqqsfoh4ery.fsf@gitster.g>

On 6/6/2022 6:18 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

>> +Assuming a `200 OK` response from the server, the content at the URL is
>> +expected to be of one of two forms:
>> +
>> +1. Bundle: A Git bundle file of version 2 or higher.
>> +
>> +2. Bundle List: A plain-text file that is parsable using Git's
>> +   config file parser. This file describes one or more bundles that are
>> +   accessible from other URIs.
>> +
>> +Any other data provided by the server is considered erroneous.
> 
> How does a client tell which one it got?  Do we register mimetype
> with iana to use for these two types of files, and have the HTTP
> downloader to use the information?

My implementation is much dumber than that: it first attempts to
parse the file as a bundle (looking for a bundle header) and then
attempts to parse it as a config file after that. If neither
succeed, then an error is thrown.

>> +bundle.list.mode::
>> +	(Required) This value has one of two values: `all` and `any`. When `all`
>> +	is specified, then the client should expect to need all of the listed
>> +	bundle URIs that match their repository's requirements. When `any` is
>> +	specified, then the client should expect that any one of the bundle URIs
>> +	that match their repository's requirements will suffice. Typically, the
>> +	`any` option is used to list a number of different bundle servers
>> +	located in different geographies.
> 
> OK. Presumably, if there are two sets of bundles, A and B, that
> consist of 3 and 2 bundle files respectively, and either one of
> these two sets is sufficient to help the client, then we'd have a
> bundle.list of type 'any', with two bundle.<id>.uri, that point at
> (sub) bundle.list of type 'all' in which these 3 or 2 bundle files
> are contained?  I am just wondering why we need 'all' and 'any', and
> at the same time why these two are sufficient for our needs.

Necessary: The origin Git server may want to advertise a list of
geo-distributed bundle servers, but not need to know the exact list
of bundles at each of those locations. The client can choose from
"any" advertised bundle URI, then download a bundle list from that
URI and download "all" bundles it advertised.

Sufficient: I can see a few different ways that we could want to
have something in-between "any" and "all" and that is: "The bundles
break into 'buckets', so pick any bucket and get all within that
bucket." This is already planned as part of the bundle.<id>.filter,
which creates different 'buckets'. So, if we need a partitioning
like this, then we can rely on functionality-based partitions.

Finally, we can always extend this in the future. If we want to
add a new mode "use-key-X" or something, then bundle providers
could start using it, but knowing that older clients would not
understand it and would lose the ability to use their bundles. This
is only a performance issue, not a correctness issue.

I'm definitely trying to minimize the need for these kinds of
extensions while keeping the spec small enough to implement in a
reasonable amount of time.
 
>> +bundle.list.heuristic::
>> +	If this string-valued key exists, then the bundle list is designed to
>> +  work well with incremental `git fetch` commands. The heuristic signals
>> +  that there are additional keys available for each bundle that help
>> +  determine which subset of bundles the client should download.
> 
> Funny indentation?

Thanks. Editor confusion with .txt files, apparently. Fixed now.

>> +The remaining keys include an `<id>` segment which is a server-designated
>> +name for each available bundle.
>> +
>> +bundle.<id>.uri::
>> +	(Required) This string value is the URI for downloading bundle `<id>`.
>> +	If the URI begins with a protocol (`http://` or `https://`) then the URI
>> +	is absolute. Otherwise, the URI is interpreted as relative to the URI
>> +	used for the bundle list. If the URI begins with `/`, then that relative
>> +	path is relative to the domain name used for the bundle list. (This use
>> +	of relative paths is intended to make it easier to distribute a set of
>> +	bundles across a large number of servers or CDNs with different domain
>> +	names.)
> 
> I have no objection to a host-relative URI notation, but is it
> something we need to spell out here?  I am mostly interested in
> making sure that we do not deviate a practice that is already used
> to point at different resource at the same server.  If the way we
> specify host-relative is unnecessarily different from the way
> existing "internet" programs (say, a web browser) work, that would
> be embarrasing, unless there is a very good reason for us to be
> different.

Since it requires careful implementation, I thought the detail along
with the justification would fit in this technical document. I can
avoid including that when describing the config options inside the
user-facing docs.

>> +bundle.<id>.list::
>> +	This boolean value indicates whether the client should expect the
>> +	content from this URI to be a list (if `true`) or a bundle (if `false`).
>> +	This is typically used when `bundle.list.mode` is `any`.
> 
> OK, so the type of a (sub) bundle.list can be specified using this
> without having the HTTP(s) server annotate the resource with
> mimetype when the thing gets actually downloaded.  It still leaves
> the issue of bootstrapping the system.  If the server advises bundle
> URI when the client contacts, presumably that first-contact
> bundle.*.uri can be annotated with the bundle.*.list at the same
> time, but the model allows the client to learn bundles independently
> from the server, and it still is a bit unclear how we tell.  Of
> course, we can examine the contents of a file that was obtained from
> a bundle URI, a file that parses correctly as a config-like file is
> very unlikely to be a valid bundle file, and we need to be prepared
> to deal with a corrupt resource downloaded from a bundle URI anyway,
> so this may not be a huge deal.

Right. We can inspect the file with our existing tools to see if they
fit the format. It might be worth doing some fuzz testing on these
parsers to be sure there isn't a surprising way to trick them into
doing strange things.

>> +bundle.<id>.filter::
>> +	This string value represents an object filter that should also appear in
>> +	the header of this bundle. The server uses this value to differentiate
>> +	different kinds of bundles from which the client can choose those that
>> +	match their object filters.
> 
> Is it an error to have .filter defined for a bundle URI whose .list
> says "true"?  Or does it mean all bundle files that make up the list
> share the same object filter?

While this would not be the typical situation, a bundle provider could
choose to combine these and the client would expect a list where all of
the .filter values match the one here. Of course, it would not be a
_failure_ if that wasn't true, but the client would ignore any bundles
it finds where .filter doesn't match.

>> +bundle.<id>.timestamp::
>> +	This value is the number of seconds since Unix epoch (UTC) that this
>> +	bundle was created. This is used as an approximation of a point in time
>> +	that the bundle matches the data available at the origin server. This is
>> +	used when `bundle.list.heuristic=timestamp`.
> 
> Name of this field should be better than 'timestamp'; we should say
> timestamp of creation (or last modification if the same name can be
> reused).

How about creationToken? That communicates that we don't really care
what the number is as long as it comes from an increasing sequence
controlled by the bundle provider.

>> +bundle.<id>.requires::
>> +	This string value represents the ID of another bundle. When present, the
>> +	server is indicating that this bundle contains a thin packfile. If the
>> +	client does not have all necessary objects to unbundle this packfile,
>> +	then the client can download the bundle with the `requires` ID and try
>> +	again. (Note: it may be beneficial to allow the server to specify
>> +	multiple `requires` bundles.) This is used when
>> +	`bundle.list.heuristic=timestamp`.
> 
> So, bundle.list.mode can say 'any', with three <id>s in it, but
> bundle.1.requires can point at '2', while bundle.2.requires can
> point at '1', and bundle.3.requires can be emtpy, in which case you
> can either fetch 1&2 or 3 alone.  Is that the idea?

The idea is that if I download bundle '1' and I can't unbundle it
(because I'm missing some required refs), then I can look at bundle.1.requires
to get any missing refs. If that is '2', then I download that. It then
continues in a chain.

The 'any' means "start anywhere", but I also don't expect a provider to use
.requires without the (maybe-to-be-renamed) timestamp heuristic. We could also
make that be a hard-coded statement: ".requires will be ignored unless mode=all
and heuristic=X"

>> +bundle.<id>.location::
>> +	This string value advertises a real-world location from where the bundle
>> +	URI is served. This can be used to present the user with an option for
>> +	which bundle URI to use. This is only valuable when `bundle.list.mode`
>> +	is `any`.
> 
> I am afraid I do not follow.  Do you mean, by "a real-world
> location", we write things like "America/Los_Angeles" and
> "Asia/Tokyo" in this field, so people can tell which one is cheaper
> to get to?  Do we make any further specification to make it machine
> usable in any way (I suspect machines would rather measure the
> latency and throughput against bundle.<id>.uri and .location is
> meant purely for human consumption)?

The intention is to be human-readable, for a user-facing prompt.
This could be for an interactive "chooser" or just letting the user
know "this is the location of the bundle URI I picked".

If we want the computer to automatically select, then using ping
latency would be a better way forward. Even in that case, it would
be helpful to tell the user "I discovered the closest bundle URI
is <location>".

>> +Here is an example bundle list using the Git config format:
>> +
>> +```
>> +[bundle "list"]
>> +	version = 1
>> +	mode = all
>> +	heuristic = timestamp
> 
> In all mode, how does heuristic help?  Doesn't mode=all by
> definition require you to grab everything anyway?

The heuristic is for incremental fetches, when you already have
some Git object data locally and don't want to download every
single bundle if you don't need to. (I think I have a step-by-step
flow of this lower in the doc.)

>> +[bundle "2022-02-09-1644442601-daily"]
>> +	uri = https://bundles.fake.com/git/git/2022-02-09-1644442601-daily.bundle
> 
> example.com (cf. RFC6761)?

Sure. Thanks.

>> +	timestamp = 1644442601
>> +	requires = 2022-02-02-1643842562
>> +
>> +[bundle "2022-02-02-1643842562"]
>> +	uri = https://bundles.fake.com/git/git/2022-02-02-1643842562.bundle
>> +	timestamp = 1643842562
>> +
>> +[bundle "2022-02-09-1644442631-daily-blobless"]
>> +	uri = 2022-02-09-1644442631-daily-blobless.bundle
>> +	timestamp = 1644442631
>> +	requires = 2022-02-02-1643842568-blobless
>> +	filter = blob:none
>> +
>> +[bundle "2022-02-02-1643842568-blobless"]
>> +	uri = /git/git/2022-02-02-1643842568-blobless.bundle
>> +	timestamp = 1643842568
>> +	filter = blob:none
>> +```
>> +
>> +This example uses `bundle.list.mode=all` as well as the
>> +`bundle.<id>.timestamp` heuristic. It also uses the `bundle.<id>.filter`
>> +options to present two parallel sets of bundles: one for full clones and
>> +another for blobless partial clones.
>> +
>> +Suppose that this bundle list was found at the URI
>> +`https://bundles.fake.com/git/git/` and so the two blobless bundles have
>> +the following fully-expanded URIs:
>> +
>> +* `https://bundles.fake.com/git/git/2022-02-09-1644442631-daily-blobless.bundle`
>> +* `https://bundles.fake.com/git/git/2022-02-02-1643842568-blobless.bundle`
> 
> So,... is the idea that bundle.list.mode=all does *not* mean "you
> need to get all of them"?  Rather, you first group bundles with the
> same filter, attribute, and then for each group with the same filter,
> you'd need to grab all of them?  IOW, if you are interested in a
> full clone, you can ignore <id>'s with non-empty bundle.<id>.filter 
> and grab all the rest?
> 
> If so, then I can see how the design makes sense.  I still do not
> know how heuristic kicks in, though.
> 
> ANother thing I noticed.  The above scheme makes it impossible to
> have <id> that happens to be "list".  I think the variables that
> apply to the entire list should be given two-level names, i.e.
> 
>       [bundle]
> 	version = 1
> 	mode = all
> 	heuristic = timestamp
>       [bundle "2022-02-09-1644442631-daily"]
> 	uri = ...

This then means that <id> can't be "version", "mode", or "heuristic",
or any other possible key that we use in the future, right? Using "list"
helped with this.

Perhaps we could do this:

	[bundles]
		version = 1
		mode = all
		heuristic = timestamp
	[bundle "id1"]
		uri = ...

Notably: "bundles" refers to the full list, while "bundle" refers
to a single bundle at a time. It makes the situation slightly more
complicated from the server side (we reserve bundles.* and bundle.*
for this advertisement).

>> +The client could choose an arbitrary bundle URI as an option _or_ select
>> +the URI with lowest latency by some exploratory checks.
> 
> Some places may have higher latency but great throughput.
> 
> The review for the rest of the document will be left for another
> sitting.

Thanks for your careful reading!
-Stolee

next prev parent reply	other threads:[~2022-06-08 19:20 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-06 19:55 [PATCH 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
2022-06-06 19:55 ` [PATCH 1/6] docs: document bundle URI standard Derrick Stolee via GitGitGadget
2022-06-06 22:18   ` Junio C Hamano
2022-06-08 19:20     ` Derrick Stolee [this message]
2022-06-08 19:27       ` Junio C Hamano
2022-06-08 20:44         ` Junio C Hamano
2022-06-08 20:39       ` Junio C Hamano
2022-06-08 20:52         ` Derrick Stolee
2022-06-07  0:33   ` Junio C Hamano
2022-06-08 19:46     ` Derrick Stolee
2022-06-08 21:01       ` Junio C Hamano
2022-06-09 16:00         ` Derrick Stolee
2022-06-09 17:56           ` Junio C Hamano
2022-06-09 18:27             ` Ævar Arnfjörð Bjarmason
2022-06-09 19:39             ` Derrick Stolee
2022-06-09 20:13               ` Junio C Hamano
2022-06-21 19:34       ` Derrick Stolee
2022-06-21 20:16         ` Junio C Hamano
2022-06-21 21:10           ` Derrick Stolee
2022-06-21 21:33             ` Junio C Hamano
2022-06-06 19:55 ` [PATCH 2/6] remote-curl: add 'get' capability Derrick Stolee via GitGitGadget
2022-07-21 22:59   ` Junio C Hamano
2022-06-06 19:55 ` [PATCH 3/6] bundle-uri: create basic file-copy logic Derrick Stolee via GitGitGadget
2022-06-06 19:55 ` [PATCH 4/6] fetch: add --bundle-uri option Derrick Stolee via GitGitGadget
2022-06-06 19:55 ` [PATCH 5/6] bundle-uri: add support for http(s):// and file:// Derrick Stolee via GitGitGadget
2022-06-06 19:55 ` [PATCH 6/6] fetch: add 'refs/bundle/' to log.excludeDecoration Derrick Stolee via GitGitGadget
2022-06-29 20:40 ` [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
2022-06-29 20:40   ` [PATCH v2 1/6] docs: document bundle URI standard Derrick Stolee via GitGitGadget
2022-07-18  9:20     ` SZEDER Gábor
2022-07-21 12:09     ` Matthew John Cheetham
2022-07-22 13:52       ` Derrick Stolee
2022-07-22 16:03       ` Derrick Stolee
2022-07-21 21:39     ` Josh Steadmon
2022-07-22 13:15       ` Derrick Stolee
2022-07-22 15:01       ` Derrick Stolee
2022-06-29 20:40   ` [PATCH v2 2/6] remote-curl: add 'get' capability Derrick Stolee via GitGitGadget
2022-07-21 21:41     ` Josh Steadmon
2022-06-29 20:40   ` [PATCH v2 3/6] bundle-uri: create basic file-copy logic Derrick Stolee via GitGitGadget
2022-07-21 21:45     ` Josh Steadmon
2022-07-22 13:18       ` Derrick Stolee
2022-06-29 20:40   ` [PATCH v2 4/6] fetch: add --bundle-uri option Derrick Stolee via GitGitGadget
2022-06-29 20:40   ` [PATCH v2 5/6] bundle-uri: add support for http(s):// and file:// Derrick Stolee via GitGitGadget
2022-06-29 20:40   ` [PATCH v2 6/6] fetch: add 'refs/bundle/' to log.excludeDecoration Derrick Stolee via GitGitGadget
2022-07-21 21:47     ` Josh Steadmon
2022-07-22 13:20       ` Derrick Stolee
2022-07-21 21:48   ` [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Josh Steadmon
2022-07-21 21:56     ` Junio C Hamano
2022-07-25 13:53   ` [PATCH v3 0/2] " Derrick Stolee via GitGitGadget
2022-07-25 13:53     ` [PATCH v3 1/2] docs: document bundle URI standard Derrick Stolee via GitGitGadget
2022-07-28  1:23       ` tenglong.tl
2022-08-01 13:42         ` Derrick Stolee
2022-07-25 13:53     ` [PATCH v3 2/2] bundle-uri: add example bundle organization Derrick Stolee via GitGitGadget
2022-08-04 16:09       ` Matthew John Cheetham
2022-08-04 17:39         ` Derrick Stolee
2022-08-04 20:29           ` Ævar Arnfjörð Bjarmason
2022-08-05 18:29             ` Derrick Stolee
2022-07-25 20:05     ` [PATCH v3 0/2] bundle URIs: design doc and initial git fetch --bundle-uri implementation Josh Steadmon
2022-08-09 13:12     ` [PATCH v4 0/2] bundle URIs: design doc Derrick Stolee via GitGitGadget
2022-08-09 13:12       ` [PATCH v4 1/2] docs: document bundle URI standard Derrick Stolee via GitGitGadget
2022-10-04 19:48         ` Philip Oakley
2022-08-09 13:12       ` [PATCH v4 2/2] bundle-uri: add example bundle organization Derrick Stolee via GitGitGadget
2022-08-09 13:49       ` [PATCH v4 0/2] bundle URIs: design doc Phillip Wood
2022-08-09 15:50         ` Derrick Stolee
2022-08-11 15:42           ` Phillip Wood

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9244644f-2c80-f23d-f054-3f0c961696af@github.com \
    --to=derrickstolee@github.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=avarab@gmail.com \
    --cc=dyroneteng@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=me@ttaylorr.com \
    --cc=newren@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).