git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation
@ 2022-06-06 19:55 Derrick Stolee via GitGitGadget
  2022-06-06 19:55 ` [PATCH 1/6] docs: document bundle URI standard Derrick Stolee via GitGitGadget
                   ` (6 more replies)
  0 siblings, 7 replies; 64+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-06-06 19:55 UTC (permalink / raw)
  To: git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee

This is the first of series towards building the bundle URI feature as
discussed in previous RFCs, specifically pulled directly out of [5]:

[1]
https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
[2]
https://lore.kernel.org/git/cover-0.3-00000000000-20211025T211159Z-avarab@gmail.com/
[3]
https://lore.kernel.org/git/pull.1160.git.1645641063.gitgitgadget@gmail.com
[4]
https://lore.kernel.org/git/RFC-cover-v2-00.36-00000000000-20220418T165545Z-avarab@gmail.com/
[5]
https://lore.kernel.org/git/pull.1234.git.1653072042.gitgitgadget@gmail.com

The first patch details the long-term design and goals of the bundle URI
feature, including complicated features such as the bundle-uri protocol v2
verb and bundle lists with heuristics.

However, then intention is to start small with the simplest features that
allow user benefit as soon as possible. In that direction, the rest of this
series creates the ability to run 'git fetch --bundle-uri=' to skip fetching
from any remote and instead download the file at the given . Currently, that
data is expected to be a bundle, which Git will then unbundle and modify the
refs to be in the 'refs/bundle/' namespace.

Currently, the can be a literal filename, a file:// URI, or an http[s]://
URI. Tests are added for both of these cases.

As outlined in [5], the next steps after this are:

 1. Add 'git clone --bundle-uri=' to run this 'git fetch --bundle-uri=' step
    before doing a fetch negotiation with the origin remote.
 2. Allow parsing a bundle list as a config file at the given URI. The
    key-value format is unified with the protocol v2 verb (coming in (3)).
 3. Implement the protocol v2 verb, re-using the bundle list logic from (2).
    Use this to auto-discover bundle URIs during 'git clone' (behind a
    config option).
 4. Implement the 'timestamp' heuristic, allowing incremental 'git fetch'
    commands to download a bundle list from a configured URI, and only
    download bundles that are new based on the timestamp values.

As mentioned in the design document, this is not all that is possible. For
instance, Ævar's suggestion to download only the bundle headers can be used
as a second heuristic (and as an augmentation of the timestamp heuristic).

Thanks, -Stolee

Derrick Stolee (6):
  docs: document bundle URI standard
  remote-curl: add 'get' capability
  bundle-uri: create basic file-copy logic
  fetch: add --bundle-uri option
  bundle-uri: add support for http(s):// and file://
  fetch: add 'refs/bundle/' to log.excludeDecoration

 Documentation/fetch-options.txt        |   6 +
 Documentation/git-fetch.txt            |   1 +
 Documentation/gitremote-helpers.txt    |   9 +
 Documentation/technical/bundle-uri.txt | 475 +++++++++++++++++++++++++
 Makefile                               |   1 +
 builtin/fetch.c                        |  10 +
 bundle-uri.c                           | 166 +++++++++
 bundle-uri.h                           |  14 +
 remote-curl.c                          |  33 ++
 t/t5557-http-get.sh                    |  37 ++
 t/t5558-fetch-bundle-uri.sh            |  77 ++++
 transport-helper.c                     |   5 +-
 12 files changed, 833 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/technical/bundle-uri.txt
 create mode 100644 bundle-uri.c
 create mode 100644 bundle-uri.h
 create mode 100755 t/t5557-http-get.sh
 create mode 100755 t/t5558-fetch-bundle-uri.sh


base-commit: 89c6e450fe4a919ecb6fa698005a935531c732cf
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1248%2Fderrickstolee%2Fbundle-redo%2Ffetch-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1248/derrickstolee/bundle-redo/fetch-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1248
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 1/6] docs: document bundle URI standard
  2022-06-06 19:55 [PATCH 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
@ 2022-06-06 19:55 ` Derrick Stolee via GitGitGadget
  2022-06-06 22:18   ` Junio C Hamano
  2022-06-07  0:33   ` Junio C Hamano
  2022-06-06 19:55 ` [PATCH 2/6] remote-curl: add 'get' capability Derrick Stolee via GitGitGadget
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 64+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-06-06 19:55 UTC (permalink / raw)
  To: git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

Introduce the idea of bundle URIs to the Git codebase through an
aspirational design document. This document includes the full design
intended to include the feature in its fully-implemented form. This will
take several steps as detailed in the Implementation Plan section.

By committing this document now, it can be used to motivate changes
necessary to reach these final goals. The design can still be altered as
new information is discovered.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/technical/bundle-uri.txt | 475 +++++++++++++++++++++++++
 1 file changed, 475 insertions(+)
 create mode 100644 Documentation/technical/bundle-uri.txt

diff --git a/Documentation/technical/bundle-uri.txt b/Documentation/technical/bundle-uri.txt
new file mode 100644
index 00000000000..6657ba079ab
--- /dev/null
+++ b/Documentation/technical/bundle-uri.txt
@@ -0,0 +1,475 @@
+Bundle URIs
+===========
+
+Bundle URIs are locations where Git can download one or more bundles in
+order to bootstrap the object database in advance of fetching the remaining
+objects from a remote.
+
+One goal is to speed up clones and fetches for users with poor network
+connectivity to the origin server. Another benefit is to allow heavy users,
+such as CI build farms, to use local resources for the majority of Git data
+and thereby reducing the load on the origin server.
+
+To enable the bundle URI feature, users can specify a bundle URI using
+command-line options or the origin server can advertise one or more URIs
+via a protocol v2 capability.
+
+Design Goals
+------------
+
+The bundle URI standard aims to be flexible enough to satisfy multiple
+workloads. The bundle provider and the Git client have several choices in
+how they create and consume bundle URIs.
+
+* Bundles can have whatever name the server desires. This name could refer
+  to immutable data by using a hash of the bundle contents. However, this
+  means that a new URI will be needed after every update of the content.
+  This might be acceptable if the server is advertising the URI (and the
+  server is aware of new bundles being generated) but would not be
+  ergonomic for users using the command line option.
+
+* The bundles could be organized specifically for bootstrapping full
+  clones, but could also be organized with the intention of bootstrapping
+  incremental fetches. The bundle provider must decide on one of several
+  organization schemes to minimize client downloads during incremental
+  fetches, but the Git client can also choose whether to use bundles for
+  either of these operations.
+
+* The bundle provider can choose to support full clones, partial clones,
+  or both. The client can detect which bundles are appropriate for the
+  repository's partial clone filter, if any.
+
+* The bundle provider can use a single bundle (for clones only), or a
+  list of bundles. When using a list of bundles, the provider can specify
+  whether or not the client needs _all_ of the bundle URIs for a full
+  clone, or if _any_ one of the bundle URIs is sufficient. This allows the
+  bundle provider to use different URIs for different geographies.
+
+* The bundle provider can organize the bundles using heuristics, such as
+  timestamps or tokens, to help the client prevent downloading bundles it
+  does not need. When the bundle provider does not provide these
+  heuristics, the client can use optimizations to minimize how much of the
+  data is downloaded.
+
+* The bundle provider does not need to be associated with the Git server.
+  The client can choose to use the bundle provider without it being
+  advertised by the Git server.
+
+* The client can choose to discover bundle providers that are advertised
+  by the Git server. This could happen during `git clone`, during
+  `git fetch`, both, or neither. The user can choose which combination
+  works best for them.
+
+* The client can choose to configure a bundle provider manually at any
+  time. The client can also choose to specify a bundle provider manually
+  as a command-line option to `git clone`.
+
+Each repository is different and every Git server has different needs.
+Hopefully the bundle URI feature is flexible enough to satisfy all needs.
+If not, then the feature can be extended through its versioning mechanism.
+
+Server requirements
+-------------------
+
+To provide a server-side implementation of bundle servers, no other parts
+of the Git protocol are required. This allows server maintainers to use
+static content solutions such as CDNs in order to serve the bundle files.
+
+At the current scope of the bundle URI feature, all URIs are expected to
+be HTTP(S) URLs where content is downloaded to a local file using a `GET`
+request to that URL. The server could include authentication requirements
+to those requests with the aim of triggering the configured credential
+helper for secure access. (Future extensions could use "file://" URIs or
+SSH URIs.)
+
+Assuming a `200 OK` response from the server, the content at the URL is
+expected to be of one of two forms:
+
+1. Bundle: A Git bundle file of version 2 or higher.
+
+2. Bundle List: A plain-text file that is parsable using Git's
+   config file parser. This file describes one or more bundles that are
+   accessible from other URIs.
+
+Any other data provided by the server is considered erroneous.
+
+Bundle Lists
+------------
+
+The Git server can advertise bundle URIs using a set of `key=value` pairs.
+A bundle URI can also serve a plain-text file in the Git config format
+containing these same `key=value` pairs. In both cases, we consider this
+to be a _bundle list_. The pairs specify information about the bundles
+that the client can use to make decisions for which bundles to download
+and which to ignore.
+
+A few keys focus on properties of the list itself.
+
+bundle.list.version::
+	(Required) This value provides a version number for the table of
+	contents. If a future Git change enables a feature that needs the Git
+	client to react to a new key in the bundle list file, then this version
+	will increment. The only current version number is 1, and if any other
+	value is specified then Git will fail to use this file.
+
+bundle.list.mode::
+	(Required) This value has one of two values: `all` and `any`. When `all`
+	is specified, then the client should expect to need all of the listed
+	bundle URIs that match their repository's requirements. When `any` is
+	specified, then the client should expect that any one of the bundle URIs
+	that match their repository's requirements will suffice. Typically, the
+	`any` option is used to list a number of different bundle servers
+	located in different geographies.
+
+bundle.list.heuristic::
+	If this string-valued key exists, then the bundle list is designed to
+  work well with incremental `git fetch` commands. The heuristic signals
+  that there are additional keys available for each bundle that help
+  determine which subset of bundles the client should download.
+
+The remaining keys include an `<id>` segment which is a server-designated
+name for each available bundle.
+
+bundle.<id>.uri::
+	(Required) This string value is the URI for downloading bundle `<id>`.
+	If the URI begins with a protocol (`http://` or `https://`) then the URI
+	is absolute. Otherwise, the URI is interpreted as relative to the URI
+	used for the bundle list. If the URI begins with `/`, then that relative
+	path is relative to the domain name used for the bundle list. (This use
+	of relative paths is intended to make it easier to distribute a set of
+	bundles across a large number of servers or CDNs with different domain
+	names.)
+
+bundle.<id>.list::
+	This boolean value indicates whether the client should expect the
+	content from this URI to be a list (if `true`) or a bundle (if `false`).
+	This is typically used when `bundle.list.mode` is `any`.
+
+bundle.<id>.filter::
+	This string value represents an object filter that should also appear in
+	the header of this bundle. The server uses this value to differentiate
+	different kinds of bundles from which the client can choose those that
+	match their object filters.
+
+bundle.<id>.timestamp::
+	This value is the number of seconds since Unix epoch (UTC) that this
+	bundle was created. This is used as an approximation of a point in time
+	that the bundle matches the data available at the origin server. This is
+	used when `bundle.list.heuristic=timestamp`.
+
+bundle.<id>.requires::
+	This string value represents the ID of another bundle. When present, the
+	server is indicating that this bundle contains a thin packfile. If the
+	client does not have all necessary objects to unbundle this packfile,
+	then the client can download the bundle with the `requires` ID and try
+	again. (Note: it may be beneficial to allow the server to specify
+	multiple `requires` bundles.) This is used when
+	`bundle.list.heuristic=timestamp`.
+
+bundle.<id>.location::
+	This string value advertises a real-world location from where the bundle
+	URI is served. This can be used to present the user with an option for
+	which bundle URI to use. This is only valuable when `bundle.list.mode`
+	is `any`.
+
+Here is an example bundle list using the Git config format:
+
+```
+[bundle "list"]
+	version = 1
+	mode = all
+	heuristic = timestamp
+
+[bundle "2022-02-09-1644442601-daily"]
+	uri = https://bundles.fake.com/git/git/2022-02-09-1644442601-daily.bundle
+	timestamp = 1644442601
+	requires = 2022-02-02-1643842562
+
+[bundle "2022-02-02-1643842562"]
+	uri = https://bundles.fake.com/git/git/2022-02-02-1643842562.bundle
+	timestamp = 1643842562
+
+[bundle "2022-02-09-1644442631-daily-blobless"]
+	uri = 2022-02-09-1644442631-daily-blobless.bundle
+	timestamp = 1644442631
+	requires = 2022-02-02-1643842568-blobless
+	filter = blob:none
+
+[bundle "2022-02-02-1643842568-blobless"]
+	uri = /git/git/2022-02-02-1643842568-blobless.bundle
+	timestamp = 1643842568
+	filter = blob:none
+```
+
+This example uses `bundle.list.mode=all` as well as the
+`bundle.<id>.timestamp` heuristic. It also uses the `bundle.<id>.filter`
+options to present two parallel sets of bundles: one for full clones and
+another for blobless partial clones.
+
+Suppose that this bundle list was found at the URI
+`https://bundles.fake.com/git/git/` and so the two blobless bundles have
+the following fully-expanded URIs:
+
+* `https://bundles.fake.com/git/git/2022-02-09-1644442631-daily-blobless.bundle`
+* `https://bundles.fake.com/git/git/2022-02-02-1643842568-blobless.bundle`
+
+Advertising Bundle URIs
+-----------------------
+
+If a user knows a bundle URI for the repository they are cloning, then
+they can specify that URI manually through a command-line option. However,
+a Git host may want to advertise bundle URIs during the clone operation,
+helping users unaware of the feature.
+
+The only thing required for this feature is that the server can advertise
+one or more bundle URIs. This advertisement takes the form of a new
+protocol v2 capability specifically for discovering bundle URIs.
+
+The client could choose an arbitrary bundle URI as an option _or_ select
+the URI with lowest latency by some exploratory checks. It is up to the
+bundle provider to decide if having multiple URIs is preferable to a
+single URI that is geodistributed through server-side infrastructure.
+
+Cloning with Bundle URIs
+------------------------
+
+The primary need for bundle URIs is to speed up clones. The Git client
+will interact with bundle URIs according to the following flow:
+
+1. The user specifies a bundle URI with the `--bundle-uri` command-line
+   option _or_ the client discovers a bundle list advertised by the
+   Git server.
+
+2. If the downloaded data from a bundle URI is a bundle, then the client
+   inspects the bundle headers to check that the negative commit OIDs are
+   present in the client repository. If some are missing, then the client
+   delays unbundling until other bundles have been unbundled, making those
+   OIDs present. When all required OIDs are present, the client unbundles
+   that data using a refspec. The default refspec is
+   `+refs/heads/*:refs/bundles/*`, but this can be configured.
+
+3. If the file is instead a bundle list, then the client inspects the
+   `bundle.list.mode` to see if the list is of the `all` or `any` form.
+
+   a. If `bundle.list.mode=all`, then the client considers all bundle
+      URIs. The list is reduced based on the `bundle.<id>.filter` options
+      matching the client repository's partial clone filter. Then, all
+      bundle URIs are requested. If the `bundle.<id>.timestamp` heuristic
+      is provided, then the bundles are downloaded in reverse-
+      chronological order, stopping when a bundle has all required OIDs.
+      The bundles can then be unbundled in chronological order. The client
+      stores the latest timestamp as a heuristic for avoiding future
+      downloads if the bundle list does not advertise newer bundles.
+
+   b. If `bundle.list.mode=any`, then the client can choose any one of the
+      bundle URIs to inspect. The client can use a variety of ways to
+      choose among these URIs. The client can also fallback to another URI
+      if the initial choice fails to return a result.
+
+Note that during a clone we expect that all bundles will be required, and
+heuristics such as `bundle.<uri>.timestamp` can be used to download bundles
+in chronological order or in parallel.
+
+If a given bundle URI is a bundle list with a `bundle.list.heuristic`
+value, then the client can choose to store that URI as its chosen bundle
+URI. The client can then navigate directly to that URI during later `git
+fetch` calls.
+
+When downloading bundle URIs, the client can choose to inspect the initial
+content before committing to downloading the entire content. This may
+provide enough information to determine if the URI is a bundle list or
+a bundle. In the case of a bundle, the client may inspect the bundle
+header to determine that all advertised tips are already in the client
+repository and cancel the remaining download.
+
+Fetching with Bundle URIs
+-------------------------
+
+When the client fetches new data, it can decide to fetch from bundle
+servers before fetching from the origin remote. This could be done via a
+command-line option, but it is more likely useful to use a config value
+such as the one specified during the clone.
+
+The fetch operation follows the same procedure to download bundles from a
+bundle list (although we do _not_ want to use parallel downloads here). We
+expect that the process will end when all negative commit OIDs in a thin
+bundle are already in the object database.
+
+When using the `timestamp` heuristic, the client can avoid downloading any
+bundles if their timestamps are not larger than the stored timestamp.
+After fetching new bundles, this local timestamp value is updated.
+
+If the bundle provider does not provide a heuristic, then the client
+should attempt to inspect the bundle headers before downloading the full
+bundle data in case the bundle tips already exist in the client
+repository.
+
+Error Conditions
+----------------
+
+If the Git client discovers something unexpected while downloading
+information according to a bundle URI or the bundle list found at that
+location, then Git can ignore that data and continue as if it was not
+given a bundle URI. The remote Git server is the ultimate source of truth,
+not the bundle URI.
+
+Here are a few example error conditions:
+
+* The client fails to connect with a server at the given URI or a connection
+  is lost without any chance to recover.
+
+* The client receives a response other than `200 OK` (such as `404 Not Found`,
+  `401 Not Authorized`, or `500 Internal Server Error`). The client should
+  use the `credential.helper` to attempt authentication after the first
+  `401 Not Authorized` response, but a second such response is a failure.
+
+* The client receives data that is not parsable as a bundle or table of
+  contents.
+
+* The bundle list describes a directed cycle in the
+  `bundle.<id>.requires` links.
+
+* A bundle includes a filter that does not match expectations.
+
+* The client cannot unbundle the bundles because the negative commit OIDs
+  are not in the object database and there are no more
+  `bundle.<id>.requires` links to follow.
+
+There are also situations that could be seen as wasteful, but are not
+error conditions:
+
+* The downloaded bundles contain more information than is requested by
+  the clone or fetch request. A primary example is if the user requests
+  a clone with `--single-branch` but downloads bundles that store every
+  reachable commit from all `refs/heads/*` references. This might be
+  initially wasteful, but perhaps these objects will become reachable by
+  a later ref update that the client cares about.
+
+* A bundle download during a `git fetch` contains objects already in the
+  object database. This is probably unavoidable if we are using bundles
+  for fetches, since the client will almost always be slightly ahead of
+  the bundle servers after performing its "catch-up" fetch to the remote
+  server. This extra work is most wasteful when the client is fetching
+  much more frequently than the server is computing bundles, such as if
+  the client is using hourly prefetches with background maintenance, but
+  the server is computing bundles weekly. For this reason, the client
+  should not use bundle URIs for fetch unless the server has explicitly
+  recommended it through the `bundle.list.forFetch = true` value.
+
+Implementation Plan
+-------------------
+
+This design document is being submitted on its own as an aspirational
+document, with the goal of implementing all of the mentioned client
+features over the course of several patch series. Here is a potential
+outline for submitting these features:
+
+1. Integrate bundle URIs into `git clone` with a `--bundle-uri` option.
+   This will include a new `git fetch --bundle-uri` mode for use as the
+   implementation underneath `git clone`. The initial version here will
+   expect a single bundle at the given URI.
+
+2. Implement the ability to parse a bundle list from a bundle URI and
+   update the `git fetch --bundle-uri` logic to properly distinguish
+   between `bundle.list.mode` options. Specifically design the feature so
+   that the config format parsing feeds a list of key-value pairs into the
+   bundle list logic.
+
+3. Create the `bundle-uri` protocol v2 verb so Git servers can advertise
+   bundle URIs using the key-value pairs. Plug into the existing key-value
+   input to the bundle list logic. Allow `git clone` to discover these
+   bundle URIs and bootstrap the client repository from the bundle data.
+   (This choice is an opt-in via a config option and a command-line
+   option.)
+
+4. Allow the client to understand the `bundle.list.forFetch` configuration
+   and the `bundle.<id>.timestamp` heuristic. When `git clone` discovers a
+   bundle URI with `bundle.list.forFetch=true`, it configures the client
+   repository to check that bundle URI during later `git fetch <remote>`
+   commands.
+
+5. Allow clients to discover bundle URIs during `git fetch` and configure
+   a bundle URI for later fetches if `bundle.list.forFetch=true`.
+
+6. Implement the "inspect headers" heuristic to reduce data downloads when
+   the `bundle.<id>.timestamp` heuristic is not available.
+
+As these features are reviewed, this plan might be updated. We also expect
+that new designs will be discovered and implemented as this feature
+matures and becomes used in real-world scenarios.
+
+Related Work: Packfile URIs
+---------------------------
+
+The Git protocol already has a capability where the Git server can list
+a set of URLs along with the packfile response when serving a client
+request. The client is then expected to download the packfiles at those
+locations in order to have a complete understanding of the response.
+
+This mechanism is used by the Gerrit server (implemented with JGit) and
+has been effective at reducing CPU load and improving user performance for
+clones.
+
+A major downside to this mechanism is that the origin server needs to know
+_exactly_ what is in those packfiles, and the packfiles need to be available
+to the user for some time after the server has responded. This coupling
+between the origin and the packfile data is difficult to manage.
+
+Further, this implementation is extremely hard to make work with fetches.
+
+Related Work: GVFS Cache Servers
+--------------------------------
+
+The GVFS Protocol [2] is a set of HTTP endpoints designed independently of
+the Git project before Git's partial clone was created. One feature of this
+protocol is the idea of a "cache server" which can be colocated with build
+machines or developer offices to transfer Git data without overloading the
+central server.
+
+The endpoint that VFS for Git is famous for is the `GET /gvfs/objects/{oid}`
+endpoint, which allows downloading an object on-demand. This is a critical
+piece of the filesystem virtualization of that product.
+
+However, a more subtle need is the `GET /gvfs/prefetch?lastPackTimestamp=<t>`
+endpoint. Given an optional timestamp, the cache server responds with a list
+of precomputed packfiles containing the commits and trees that were introduced
+in those time intervals.
+
+The cache server computes these "prefetch" packfiles using the following
+strategy:
+
+1. Every hour, an "hourly" pack is generated with a given timestamp.
+2. Nightly, the previous 24 hourly packs are rolled up into a "daily" pack.
+3. Nightly, all prefetch packs more than 30 days old are rolled up into
+   one pack.
+
+When a user runs `gvfs clone` or `scalar clone` against a repo with cache
+servers, the client requests all prefetch packfiles, which is at most
+`24 + 30 + 1` packfiles downloading only commits and trees. The client
+then follows with a request to the origin server for the references, and
+attempts to checkout that tip reference. (There is an extra endpoint that
+helps get all reachable trees from a given commit, in case that commit
+was not already in a prefetch packfile.)
+
+During a `git fetch`, a hook requests the prefetch endpoint using the
+most-recent timestamp from a previously-downloaded prefetch packfile.
+Only the list of packfiles with later timestamps are downloaded. Most
+users fetch hourly, so they get at most one hourly prefetch pack. Users
+whose machines have been off or otherwise have not fetched in over 30 days
+might redownload all prefetch packfiles. This is rare.
+
+It is important to note that the clients always contact the origin server
+for the refs advertisement, so the refs are frequently "ahead" of the
+prefetched pack data. The missing objects are downloaded on-demand using
+the `GET gvfs/objects/{oid}` requests, when needed by a command such as
+`git checkout` or `git log`. Some Git optimizations disable checks that
+would cause these on-demand downloads to be too aggressive.
+
+See Also
+--------
+
+[1] https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
+    An earlier RFC for a bundle URI feature.
+
+[2] https://github.com/microsoft/VFSForGit/blob/master/Protocol.md
+    The GVFS Protocol
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 2/6] remote-curl: add 'get' capability
  2022-06-06 19:55 [PATCH 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
  2022-06-06 19:55 ` [PATCH 1/6] docs: document bundle URI standard Derrick Stolee via GitGitGadget
@ 2022-06-06 19:55 ` Derrick Stolee via GitGitGadget
  2022-07-21 22:59   ` Junio C Hamano
  2022-06-06 19:55 ` [PATCH 3/6] bundle-uri: create basic file-copy logic Derrick Stolee via GitGitGadget
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 64+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-06-06 19:55 UTC (permalink / raw)
  To: git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

A future change will want a way to download a file over HTTP(S) using
the simplest of download mechanisms. We do not want to assume that the
server on the other side understands anything about the Git protocol but
could be a simple static web server.

Create the new 'get' capability for the remote helpers which advertises
that the 'get' command is avalable. A caller can send a line containing
'get <url> <path>' to download the file at <url> into the file at
<path>.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/gitremote-helpers.txt |  9 +++++++
 remote-curl.c                       | 33 +++++++++++++++++++++++++
 t/t5557-http-get.sh                 | 37 +++++++++++++++++++++++++++++
 transport-helper.c                  |  5 +++-
 4 files changed, 83 insertions(+), 1 deletion(-)
 create mode 100755 t/t5557-http-get.sh

diff --git a/Documentation/gitremote-helpers.txt b/Documentation/gitremote-helpers.txt
index 6f1e269ae43..ed8da428c98 100644
--- a/Documentation/gitremote-helpers.txt
+++ b/Documentation/gitremote-helpers.txt
@@ -168,6 +168,9 @@ Supported commands: 'list', 'import'.
 	Can guarantee that when a clone is requested, the received
 	pack is self contained and is connected.
 
+'get'::
+	Can use the 'get' command to download a file from a given URI.
+
 If a helper advertises 'connect', Git will use it if possible and
 fall back to another capability if the helper requests so when
 connecting (see the 'connect' command under COMMANDS).
@@ -418,6 +421,12 @@ Supported if the helper has the "connect" capability.
 +
 Supported if the helper has the "stateless-connect" capability.
 
+'get' <uri> <path>::
+	Downloads the file from the given `<uri>` to the given `<path>`. If
+	`<path>.temp` exists, then Git assumes that the `.temp` file is a
+	partial download from a previous attempt and will resume the
+	download from that position.
+
 If a fatal error occurs, the program writes the error message to
 stderr and exits. The caller should expect that a suitable error
 message has been printed if the child closes the connection without
diff --git a/remote-curl.c b/remote-curl.c
index 67f178b1120..f005419f872 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -1276,6 +1276,34 @@ static void parse_fetch(struct strbuf *buf)
 	strbuf_reset(buf);
 }
 
+static void parse_get(struct strbuf *buf)
+{
+	struct http_get_options opts = { 0 };
+	struct strbuf url = STRBUF_INIT;
+	struct strbuf path = STRBUF_INIT;
+	const char *p, *space;
+
+	if (!skip_prefix(buf->buf, "get ", &p))
+		die(_("http transport does not support %s"), buf->buf);
+
+	space = strchr(p, ' ');
+
+	if (!space)
+		die(_("protocol error: expected '<url> <path>', missing space"));
+
+	strbuf_add(&url, p, space - p);
+	strbuf_addstr(&path, space + 1);
+
+	if (http_get_file(url.buf, path.buf, &opts))
+		die(_("failed to download file at URL '%s'"), url.buf);
+
+	strbuf_release(&url);
+	strbuf_release(&path);
+	printf("\n");
+	fflush(stdout);
+	strbuf_reset(buf);
+}
+
 static int push_dav(int nr_spec, const char **specs)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
@@ -1549,9 +1577,14 @@ int cmd_main(int argc, const char **argv)
 				printf("unsupported\n");
 			fflush(stdout);
 
+		} else if (skip_prefix(buf.buf, "get ", &arg)) {
+			parse_get(&buf);
+			fflush(stdout);
+
 		} else if (!strcmp(buf.buf, "capabilities")) {
 			printf("stateless-connect\n");
 			printf("fetch\n");
+			printf("get\n");
 			printf("option\n");
 			printf("push\n");
 			printf("check-connectivity\n");
diff --git a/t/t5557-http-get.sh b/t/t5557-http-get.sh
new file mode 100755
index 00000000000..1fd4ded3eb1
--- /dev/null
+++ b/t/t5557-http-get.sh
@@ -0,0 +1,37 @@
+#!/bin/sh
+
+test_description='test downloading a file by URL'
+
+. ./test-lib.sh
+
+. "$TEST_DIRECTORY"/lib-httpd.sh
+start_httpd
+
+test_expect_success 'get by URL: 404' '
+	url="$HTTPD_URL/none.txt" &&
+	cat >input <<-EOF &&
+	capabilities
+	get $url file1
+	EOF
+
+	test_must_fail git remote-http $url $url <input 2>err &&
+	test_path_is_missing file1 &&
+	grep "failed to download file at URL" err &&
+	rm file1.temp
+'
+
+test_expect_success 'get by URL: 200' '
+	echo data >"$HTTPD_DOCUMENT_ROOT_PATH/exists.txt" &&
+
+	url="$HTTPD_URL/exists.txt" &&
+	cat >input <<-EOF &&
+	capabilities
+	get $url file2
+
+	EOF
+
+	GIT_TRACE2_PERF=1 git remote-http $url $url <input &&
+	test_cmp "$HTTPD_DOCUMENT_ROOT_PATH/exists.txt" file2
+'
+
+test_done
diff --git a/transport-helper.c b/transport-helper.c
index b4dbbabb0c2..dfbeaebe40c 100644
--- a/transport-helper.c
+++ b/transport-helper.c
@@ -33,7 +33,8 @@ struct helper_data {
 		check_connectivity : 1,
 		no_disconnect_req : 1,
 		no_private_update : 1,
-		object_format : 1;
+		object_format : 1,
+		get : 1;
 
 	/*
 	 * As an optimization, the transport code may invoke fetch before
@@ -210,6 +211,8 @@ static struct child_process *get_helper(struct transport *transport)
 			data->no_private_update = 1;
 		} else if (starts_with(capname, "object-format")) {
 			data->object_format = 1;
+		} else if (!strcmp(capname, "get")) {
+			data->get = 1;
 		} else if (mandatory) {
 			die(_("unknown mandatory capability %s; this remote "
 			      "helper probably needs newer version of Git"),
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 3/6] bundle-uri: create basic file-copy logic
  2022-06-06 19:55 [PATCH 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
  2022-06-06 19:55 ` [PATCH 1/6] docs: document bundle URI standard Derrick Stolee via GitGitGadget
  2022-06-06 19:55 ` [PATCH 2/6] remote-curl: add 'get' capability Derrick Stolee via GitGitGadget
@ 2022-06-06 19:55 ` Derrick Stolee via GitGitGadget
  2022-06-06 19:55 ` [PATCH 4/6] fetch: add --bundle-uri option Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 64+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-06-06 19:55 UTC (permalink / raw)
  To: git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

Before implementing a way to fetch bundles into a repository, create the
basic logic. Assume that the URI is actually a file path. Future logic
will make this more careful to other protocols.

For now, we also only succeed if the content at the URI is a bundle
file, not a bundle list. Bundle lists will be implemented in a future
change.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Makefile     |  1 +
 bundle-uri.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 bundle-uri.h | 14 ++++++++
 3 files changed, 107 insertions(+)
 create mode 100644 bundle-uri.c
 create mode 100644 bundle-uri.h

diff --git a/Makefile b/Makefile
index f8bccfab5e9..8f27310836d 100644
--- a/Makefile
+++ b/Makefile
@@ -887,6 +887,7 @@ LIB_OBJS += blob.o
 LIB_OBJS += bloom.o
 LIB_OBJS += branch.o
 LIB_OBJS += bulk-checkin.o
+LIB_OBJS += bundle-uri.o
 LIB_OBJS += bundle.o
 LIB_OBJS += cache-tree.o
 LIB_OBJS += cbtree.o
diff --git a/bundle-uri.c b/bundle-uri.c
new file mode 100644
index 00000000000..07b4bfe4e11
--- /dev/null
+++ b/bundle-uri.c
@@ -0,0 +1,92 @@
+#include "cache.h"
+#include "bundle-uri.h"
+#include "bundle.h"
+#include "object-store.h"
+#include "refs.h"
+#include "run-command.h"
+
+static void find_temp_filename(struct strbuf *name)
+{
+	int fd;
+	/*
+	 * Find a temporray filename that is available. This is briefly
+	 * racy, but unlikely to collide.
+	 */
+	fd = odb_mkstemp(name, "bundles/tmp_uri_XXXXXX");
+	if (fd < 0)
+		die(_("failed to create temporary file"));
+	close(fd);
+	unlink(name->buf);
+}
+
+static int copy_uri_to_file(const char *uri, const char *file)
+{
+	/* Copy as a file */
+	return copy_file(file, uri, 0444);
+}
+
+static int unbundle_from_file(struct repository *r, const char *file)
+{
+	int result = 0;
+	int bundle_fd;
+	struct bundle_header header = BUNDLE_HEADER_INIT;
+	struct strvec extra_index_pack_args = STRVEC_INIT;
+	struct string_list_item *refname;
+	struct strbuf bundle_ref = STRBUF_INIT;
+	size_t bundle_prefix_len;
+
+	if ((bundle_fd = read_bundle_header(file, &header)) < 0)
+		return 1;
+
+	result = unbundle(r, &header, bundle_fd, &extra_index_pack_args);
+
+	/*
+	 * Convert all refs/heads/ from the bundle into refs/bundles/
+	 * in the local repository.
+	 */
+	strbuf_addstr(&bundle_ref, "refs/bundles/");
+	bundle_prefix_len = bundle_ref.len;
+
+	for_each_string_list_item(refname, &header.references) {
+		struct object_id *oid = refname->util;
+		struct object_id old_oid;
+		const char *branch_name;
+		int has_old;
+
+		if (!skip_prefix(refname->string, "refs/heads/", &branch_name))
+			continue;
+
+		strbuf_setlen(&bundle_ref, bundle_prefix_len);
+		strbuf_addstr(&bundle_ref, branch_name);
+
+		has_old = !read_ref(bundle_ref.buf, &old_oid);
+		update_ref("fetched bundle", bundle_ref.buf, oid,
+			   has_old ? &old_oid : NULL,
+			   REF_SKIP_OID_VERIFICATION,
+			   UPDATE_REFS_MSG_ON_ERR);
+	}
+
+	bundle_header_release(&header);
+	return result;
+}
+
+int fetch_bundle_uri(struct repository *r, const char *uri)
+{
+	int result = 0;
+	struct strbuf filename = STRBUF_INIT;
+
+	find_temp_filename(&filename);
+	if ((result = copy_uri_to_file(uri, filename.buf)))
+		goto cleanup;
+
+	if ((result = !is_bundle(filename.buf, 0)))
+		goto cleanup;
+
+	if ((result = unbundle_from_file(r, filename.buf)))
+		goto cleanup;
+
+cleanup:
+	unlink(filename.buf);
+	strbuf_release(&filename);
+	return result;
+}
diff --git a/bundle-uri.h b/bundle-uri.h
new file mode 100644
index 00000000000..8a152f1ef14
--- /dev/null
+++ b/bundle-uri.h
@@ -0,0 +1,14 @@
+#ifndef BUNDLE_URI_H
+#define BUNDLE_URI_H
+
+struct repository;
+
+/**
+ * Fetch data from the given 'uri' and unbundle the bundle data found
+ * based on that information.
+ *
+ * Returns non-zero if no bundle information is found at the given 'uri'.
+ */
+int fetch_bundle_uri(struct repository *r, const char *uri);
+
+#endif
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 4/6] fetch: add --bundle-uri option
  2022-06-06 19:55 [PATCH 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2022-06-06 19:55 ` [PATCH 3/6] bundle-uri: create basic file-copy logic Derrick Stolee via GitGitGadget
@ 2022-06-06 19:55 ` Derrick Stolee via GitGitGadget
  2022-06-06 19:55 ` [PATCH 5/6] bundle-uri: add support for http(s):// and file:// Derrick Stolee via GitGitGadget
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 64+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-06-06 19:55 UTC (permalink / raw)
  To: git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

Teach 'git fetch' a new --bundle-uri=<uri> option which changes the mode
from fetching from a remote using the Git protocol to fetching a bundle
from the given <uri>. See Documentation/technical/bundle-uri.txt for
more information on the design of this feature.

This implementation is limited to the most basic version of the feature.
We expect the content at that URI to be a bundle file, not a bundle
list. Bundle lists will be implemented later.

This implementation is sufficient for a bundle provider to create a
single bootstrap bundle for a large repository. The user would bootstrap
a repository using a sequence of Git commands, such as:

 1. git init <repo> && cd <repo>
 2. git fetch --bundle-uri=<uri>
 3. git remote add origin <url>
 4. git fetch origin
 5. git checkout FETCH_HEAD

Later changes will make this seamless within a 'git clone' command, but
this implementation is large enough to delay that integration.

Currently, this option supports local filenames. Other protocols will be
added in the future.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/fetch-options.txt |  5 +++++
 Documentation/git-fetch.txt     |  1 +
 builtin/fetch.c                 | 10 ++++++++++
 bundle-uri.c                    |  8 ++++++--
 t/t5558-fetch-bundle-uri.sh     | 34 +++++++++++++++++++++++++++++++++
 5 files changed, 56 insertions(+), 2 deletions(-)
 create mode 100755 t/t5558-fetch-bundle-uri.sh

diff --git a/Documentation/fetch-options.txt b/Documentation/fetch-options.txt
index 622bd84768b..09bd1feeed8 100644
--- a/Documentation/fetch-options.txt
+++ b/Documentation/fetch-options.txt
@@ -317,3 +317,8 @@ endif::git-pull[]
 -6::
 --ipv6::
 	Use IPv6 addresses only, ignoring IPv4 addresses.
+
+--bundle-uri=<uri>::
+	Instead of fetching from a remote, fetch a bundle from the given
+	`<uri>` and unbundle the data into the local repository. The refs
+	in the bundle will be stored under the `refs/bundle/*` namespace.
diff --git a/Documentation/git-fetch.txt b/Documentation/git-fetch.txt
index e9d364669af..4fd8911b336 100644
--- a/Documentation/git-fetch.txt
+++ b/Documentation/git-fetch.txt
@@ -13,6 +13,7 @@ SYNOPSIS
 'git fetch' [<options>] <group>
 'git fetch' --multiple [<options>] [(<repository> | <group>)...]
 'git fetch' --all [<options>]
+'git fetch' --bundle-uri=<uri> [<options>]
 
 
 DESCRIPTION
diff --git a/builtin/fetch.c b/builtin/fetch.c
index e3791f09ed5..cb0d2fbe82c 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -29,6 +29,7 @@
 #include "commit-graph.h"
 #include "shallow.h"
 #include "worktree.h"
+#include "bundle-uri.h"
 
 #define FORCED_UPDATES_DELAY_WARNING_IN_MS (10 * 1000)
 
@@ -37,6 +38,7 @@ static const char * const builtin_fetch_usage[] = {
 	N_("git fetch [<options>] <group>"),
 	N_("git fetch --multiple [<options>] [(<repository> | <group>)...]"),
 	N_("git fetch --all [<options>]"),
+	N_("git fetch --bundle-uri=<uri> [<options>]"),
 	NULL
 };
 
@@ -86,6 +88,7 @@ static struct string_list negotiation_tip = STRING_LIST_INIT_NODUP;
 static int fetch_write_commit_graph = -1;
 static int stdin_refspecs = 0;
 static int negotiate_only;
+static const char *bundle_uri;
 
 static int git_fetch_config(const char *k, const char *v, void *cb)
 {
@@ -224,6 +227,8 @@ static struct option builtin_fetch_options[] = {
 		 N_("write the commit-graph after fetching")),
 	OPT_BOOL(0, "stdin", &stdin_refspecs,
 		 N_("accept refspecs from stdin")),
+	OPT_STRING(0, "bundle-uri", &bundle_uri, N_("uri"),
+		   N_("download bundle data from the given URI instead of from a remote")),
 	OPT_END()
 };
 
@@ -2181,6 +2186,11 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 	if (dry_run)
 		write_fetch_head = 0;
 
+	if (bundle_uri) {
+		result = fetch_bundle_uri(the_repository, bundle_uri);
+		goto cleanup;
+	}
+
 	if (all) {
 		if (argc == 1)
 			die(_("fetch --all does not take a repository argument"));
diff --git a/bundle-uri.c b/bundle-uri.c
index 07b4bfe4e11..095779352e7 100644
--- a/bundle-uri.c
+++ b/bundle-uri.c
@@ -76,11 +76,15 @@ int fetch_bundle_uri(struct repository *r, const char *uri)
 	struct strbuf filename = STRBUF_INIT;
 
 	find_temp_filename(&filename);
-	if ((result = copy_uri_to_file(uri, filename.buf)))
+	if ((result = copy_uri_to_file(uri, filename.buf))) {
+		error(_("failed to download bundle from URI '%s'"), uri);
 		goto cleanup;
+	}
 
-	if ((result = !is_bundle(filename.buf, 0)))
+	if ((result = !is_bundle(filename.buf, 0))) {
+		error(_("file at URI '%s' is not a bundle"), uri);
 		goto cleanup;
+	}
 
 	if ((result = unbundle_from_file(r, filename.buf)))
 		goto cleanup;
diff --git a/t/t5558-fetch-bundle-uri.sh b/t/t5558-fetch-bundle-uri.sh
new file mode 100755
index 00000000000..381e56cac20
--- /dev/null
+++ b/t/t5558-fetch-bundle-uri.sh
@@ -0,0 +1,34 @@
+#!/bin/sh
+
+test_description='test fetching bundles with --bundle-uri'
+
+. ./test-lib.sh
+
+test_expect_success 'fail to fetch from non-existent file' '
+	test_must_fail git fetch --bundle-uri="$(pwd)/does-not-exist" 2>err &&
+	grep "failed to download bundle from URI" err
+'
+
+test_expect_success 'fail to fetch from non-bundle file' '
+	echo bogus >bogus &&
+	test_must_fail git fetch --bundle-uri="$(pwd)/bogus" 2>err &&
+	grep "is not a bundle" err
+'
+
+test_expect_success 'create bundle' '
+	git init fetch-from &&
+	git -C fetch-from checkout -b topic &&
+	test_commit -C fetch-from A &&
+	test_commit -C fetch-from B &&
+	git -C fetch-from bundle create B.bundle topic
+'
+
+test_expect_success 'fetch file bundle' '
+	git init fetch-to &&
+	git -C fetch-to fetch --bundle-uri="$(pwd)/fetch-from/B.bundle" &&
+	git -C fetch-to rev-parse refs/bundles/topic >actual &&
+	git -C fetch-from rev-parse topic >expect &&
+	test_cmp expect actual
+'
+
+test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 5/6] bundle-uri: add support for http(s):// and file://
  2022-06-06 19:55 [PATCH 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2022-06-06 19:55 ` [PATCH 4/6] fetch: add --bundle-uri option Derrick Stolee via GitGitGadget
@ 2022-06-06 19:55 ` Derrick Stolee via GitGitGadget
  2022-06-06 19:55 ` [PATCH 6/6] fetch: add 'refs/bundle/' to log.excludeDecoration Derrick Stolee via GitGitGadget
  2022-06-29 20:40 ` [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
  6 siblings, 0 replies; 64+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-06-06 19:55 UTC (permalink / raw)
  To: git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The previous change created the logic for copying a file by name. Now,
first inspect the URI for an HTTP(S) prefix and use git-remote-https as
the way to download the data at that URI. Otherwise, check to see if
file:// is present and modify the prefix accordingly.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 bundle-uri.c                | 65 ++++++++++++++++++++++++++++++++++++-
 t/t5558-fetch-bundle-uri.sh | 37 +++++++++++++++++++++
 2 files changed, 101 insertions(+), 1 deletion(-)

diff --git a/bundle-uri.c b/bundle-uri.c
index 095779352e7..576ced271cb 100644
--- a/bundle-uri.c
+++ b/bundle-uri.c
@@ -19,10 +19,73 @@ static void find_temp_filename(struct strbuf *name)
 	unlink(name->buf);
 }
 
+static int download_https_uri_to_file(const char *uri, const char *file)
+{
+	int result = 0;
+	struct child_process cp = CHILD_PROCESS_INIT;
+	FILE *child_in = NULL, *child_out = NULL;
+	struct strbuf line = STRBUF_INIT;
+	int found_get = 0;
+
+	strvec_pushl(&cp.args, "git-remote-https", "origin", uri, NULL);
+	cp.in = -1;
+	cp.out = -1;
+
+	if (start_command(&cp))
+		return 1;
+
+	child_in = fdopen(cp.in, "w");
+	if (!child_in) {
+		result = 1;
+		goto cleanup;
+	}
+
+	child_out = fdopen(cp.out, "r");
+	if (!child_out) {
+		result = 1;
+		goto cleanup;
+	}
+
+	fprintf(child_in, "capabilities\n");
+	fflush(child_in);
+
+	while (!strbuf_getline(&line, child_out)) {
+		if (!line.len)
+			break;
+		if (!strcmp(line.buf, "get"))
+			found_get = 1;
+	}
+	strbuf_release(&line);
+
+	if (!found_get) {
+		result = error(_("insufficient capabilities"));
+		goto cleanup;
+	}
+
+	fprintf(child_in, "get %s %s\n\n", uri, file);
+
+cleanup:
+	if (child_in)
+		fclose(child_in);
+	if (finish_command(&cp))
+		return 1;
+	if (child_out)
+		fclose(child_out);
+	return result;
+}
+
 static int copy_uri_to_file(const char *uri, const char *file)
 {
+	const char *out;
+	if (skip_prefix(uri, "https:", &out) ||
+	    skip_prefix(uri, "http:", &out))
+		return download_https_uri_to_file(uri, file);
+
+	if (!skip_prefix(uri, "file://", &out))
+		out = uri;
+
 	/* Copy as a file */
-	return copy_file(file, uri, 0444);
+	return !!copy_file(file, out, 0);
 }
 
 static int unbundle_from_file(struct repository *r, const char *file)
diff --git a/t/t5558-fetch-bundle-uri.sh b/t/t5558-fetch-bundle-uri.sh
index 381e56cac20..919db6f4551 100755
--- a/t/t5558-fetch-bundle-uri.sh
+++ b/t/t5558-fetch-bundle-uri.sh
@@ -31,4 +31,41 @@ test_expect_success 'fetch file bundle' '
 	test_cmp expect actual
 '
 
+test_expect_success 'fetch file:// bundle' '
+	git init fetch-file &&
+	git -C fetch-file fetch --bundle-uri="file://$(pwd)/fetch-from/B.bundle" &&
+	git -C fetch-file rev-parse refs/bundles/topic >actual &&
+	git -C fetch-from rev-parse topic >expect &&
+	test_cmp expect actual
+'
+
+#########################################################################
+# HTTP tests begin here
+
+. "$TEST_DIRECTORY"/lib-httpd.sh
+start_httpd
+
+test_expect_success 'fail to fetch from non-existent HTTP URL' '
+	test_must_fail git fetch --bundle-uri="$HTTPD_URL/does-not-exist" 2>err &&
+	grep "failed to download bundle from URI" err
+'
+
+test_expect_success 'fail to fetch from non-bundle HTTP URL' '
+	echo bogus >"$HTTPD_DOCUMENT_ROOT_PATH/bogus" &&
+	test_must_fail git fetch --bundle-uri="$HTTPD_URL/bogus" 2>err &&
+	grep "is not a bundle" err
+'
+
+test_expect_success 'fetch HTTP bundle' '
+	cp fetch-from/B.bundle "$HTTPD_DOCUMENT_ROOT_PATH/B.bundle" &&
+	git init fetch-http &&
+	git -C fetch-http fetch --bundle-uri="$HTTPD_URL/B.bundle" &&
+	git -C fetch-http rev-parse refs/bundles/topic >actual &&
+	git -C fetch-from rev-parse topic >expect &&
+	test_cmp expect actual
+'
+
+# Do not add tests here unless they use the HTTP server, as they will
+# not run unless the HTTP dependencies exist.
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 6/6] fetch: add 'refs/bundle/' to log.excludeDecoration
  2022-06-06 19:55 [PATCH 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
                   ` (4 preceding siblings ...)
  2022-06-06 19:55 ` [PATCH 5/6] bundle-uri: add support for http(s):// and file:// Derrick Stolee via GitGitGadget
@ 2022-06-06 19:55 ` Derrick Stolee via GitGitGadget
  2022-06-29 20:40 ` [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
  6 siblings, 0 replies; 64+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-06-06 19:55 UTC (permalink / raw)
  To: git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

When fetching from a bundle URI, the branches of that bundle are stored
in a different ref namespace: refs/bundles/. This namespace is intended
to assist with later 'git fetch' negotiations with a Git server,
allowing the client to advertise which data it already has from a bundle
URI.

These references can be confusing for a user when they appear as a
decoration in 'git log' output. Add "refs/bundles/" to the multi-valued
log.excludeDecoration config value. This is similar to the way
"refs/prefetch/" is hidden by background prefetch operations in 'git
maintenance' as added by 96eaffebb (maintenance: set
log.excludeDecoration durin prefetch, 2021-01-19).

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/fetch-options.txt |  3 ++-
 bundle-uri.c                    |  7 +++++++
 t/t5558-fetch-bundle-uri.sh     | 12 +++++++++---
 3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/Documentation/fetch-options.txt b/Documentation/fetch-options.txt
index 09bd1feeed8..8b801bcc2f3 100644
--- a/Documentation/fetch-options.txt
+++ b/Documentation/fetch-options.txt
@@ -321,4 +321,5 @@ endif::git-pull[]
 --bundle-uri=<uri>::
 	Instead of fetching from a remote, fetch a bundle from the given
 	`<uri>` and unbundle the data into the local repository. The refs
-	in the bundle will be stored under the `refs/bundle/*` namespace.
+	in the bundle will be stored under the hidden `refs/bundle/*`
+	namespace.
diff --git a/bundle-uri.c b/bundle-uri.c
index 576ced271cb..a3ffe0d129e 100644
--- a/bundle-uri.c
+++ b/bundle-uri.c
@@ -1,6 +1,7 @@
 #include "cache.h"
 #include "bundle-uri.h"
 #include "bundle.h"
+#include "config.h"
 #include "object-store.h"
 #include "refs.h"
 #include "run-command.h"
@@ -152,6 +153,12 @@ int fetch_bundle_uri(struct repository *r, const char *uri)
 	if ((result = unbundle_from_file(r, filename.buf)))
 		goto cleanup;
 
+	git_config_set_multivar_gently("log.excludedecoration",
+					"refs/bundle/",
+					"refs/bundle/",
+					CONFIG_FLAGS_FIXED_VALUE |
+					CONFIG_FLAGS_MULTI_REPLACE);
+
 cleanup:
 	unlink(filename.buf);
 	strbuf_release(&filename);
diff --git a/t/t5558-fetch-bundle-uri.sh b/t/t5558-fetch-bundle-uri.sh
index 919db6f4551..563df6de5e3 100755
--- a/t/t5558-fetch-bundle-uri.sh
+++ b/t/t5558-fetch-bundle-uri.sh
@@ -28,7 +28,9 @@ test_expect_success 'fetch file bundle' '
 	git -C fetch-to fetch --bundle-uri="$(pwd)/fetch-from/B.bundle" &&
 	git -C fetch-to rev-parse refs/bundles/topic >actual &&
 	git -C fetch-from rev-parse topic >expect &&
-	test_cmp expect actual
+	test_cmp expect actual &&
+
+	test_config log.excludedecoration refs/bundle/
 '
 
 test_expect_success 'fetch file:// bundle' '
@@ -36,7 +38,9 @@ test_expect_success 'fetch file:// bundle' '
 	git -C fetch-file fetch --bundle-uri="file://$(pwd)/fetch-from/B.bundle" &&
 	git -C fetch-file rev-parse refs/bundles/topic >actual &&
 	git -C fetch-from rev-parse topic >expect &&
-	test_cmp expect actual
+	test_cmp expect actual &&
+
+	test_config log.excludedecoration refs/bundle/
 '
 
 #########################################################################
@@ -62,7 +66,9 @@ test_expect_success 'fetch HTTP bundle' '
 	git -C fetch-http fetch --bundle-uri="$HTTPD_URL/B.bundle" &&
 	git -C fetch-http rev-parse refs/bundles/topic >actual &&
 	git -C fetch-from rev-parse topic >expect &&
-	test_cmp expect actual
+	test_cmp expect actual &&
+
+	test_config log.excludedecoration refs/bundle/
 '
 
 # Do not add tests here unless they use the HTTP server, as they will
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/6] docs: document bundle URI standard
  2022-06-06 19:55 ` [PATCH 1/6] docs: document bundle URI standard Derrick Stolee via GitGitGadget
@ 2022-06-06 22:18   ` Junio C Hamano
  2022-06-08 19:20     ` Derrick Stolee
  2022-06-07  0:33   ` Junio C Hamano
  1 sibling, 1 reply; 64+ messages in thread
From: Junio C Hamano @ 2022-06-06 22:18 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Derrick Stolee <derrickstolee@github.com>
>
> Introduce the idea of bundle URIs to the Git codebase through an
> aspirational design document. This document includes the full design
> intended to include the feature in its fully-implemented form. This will
> take several steps as detailed in the Implementation Plan section.
>
> By committing this document now, it can be used to motivate changes
> necessary to reach these final goals. The design can still be altered as
> new information is discovered.
>
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
>  Documentation/technical/bundle-uri.txt | 475 +++++++++++++++++++++++++
>  1 file changed, 475 insertions(+)
>  create mode 100644 Documentation/technical/bundle-uri.txt
>
> diff --git a/Documentation/technical/bundle-uri.txt b/Documentation/technical/bundle-uri.txt
> new file mode 100644
> index 00000000000..6657ba079ab
> --- /dev/null
> +++ b/Documentation/technical/bundle-uri.txt
> @@ -0,0 +1,475 @@
> +Bundle URIs
> +===========
> +
> +Bundle URIs are locations where Git can download one or more bundles in
> +order to bootstrap the object database in advance of fetching the remaining
> +objects from a remote.
> +
> +One goal is to speed up clones and fetches for users with poor network
> +connectivity to the origin server. Another benefit is to allow heavy users,
> +such as CI build farms, to use local resources for the majority of Git data
> +and thereby reducing the load on the origin server.
> +
> +To enable the bundle URI feature, users can specify a bundle URI using
> +command-line options or the origin server can advertise one or more URIs
> +via a protocol v2 capability.
> +
> +Design Goals
> +------------
> +
> +The bundle URI standard aims to be flexible enough to satisfy multiple
> +workloads. The bundle provider and the Git client have several choices in
> +how they create and consume bundle URIs.
> +
> +* Bundles can have whatever name the server desires. This name could refer
> +  to immutable data by using a hash of the bundle contents. However, this
> +  means that a new URI will be needed after every update of the content.
> +  This might be acceptable if the server is advertising the URI (and the
> +  server is aware of new bundles being generated) but would not be
> +  ergonomic for users using the command line option.
> +
> +* The bundles could be organized specifically for bootstrapping full
> +  clones, but could also be organized with the intention of bootstrapping
> +  incremental fetches. The bundle provider must decide on one of several
> +  organization schemes to minimize client downloads during incremental
> +  fetches, but the Git client can also choose whether to use bundles for
> +  either of these operations.
> +
> +* The bundle provider can choose to support full clones, partial clones,
> +  or both. The client can detect which bundles are appropriate for the
> +  repository's partial clone filter, if any.
> +
> +* The bundle provider can use a single bundle (for clones only), or a
> +  list of bundles. When using a list of bundles, the provider can specify
> +  whether or not the client needs _all_ of the bundle URIs for a full
> +  clone, or if _any_ one of the bundle URIs is sufficient. This allows the
> +  bundle provider to use different URIs for different geographies.
> +
> +* The bundle provider can organize the bundles using heuristics, such as
> +  timestamps or tokens, to help the client prevent downloading bundles it
> +  does not need. When the bundle provider does not provide these
> +  heuristics, the client can use optimizations to minimize how much of the
> +  data is downloaded.
> +
> +* The bundle provider does not need to be associated with the Git server.
> +  The client can choose to use the bundle provider without it being
> +  advertised by the Git server.
> +
> +* The client can choose to discover bundle providers that are advertised
> +  by the Git server. This could happen during `git clone`, during
> +  `git fetch`, both, or neither. The user can choose which combination
> +  works best for them.
> +
> +* The client can choose to configure a bundle provider manually at any
> +  time. The client can also choose to specify a bundle provider manually
> +  as a command-line option to `git clone`.
> +
> +Each repository is different and every Git server has different needs.
> +Hopefully the bundle URI feature is flexible enough to satisfy all needs.
> +If not, then the feature can be extended through its versioning mechanism.
> +
> +Server requirements
> +-------------------
> +
> +To provide a server-side implementation of bundle servers, no other parts
> +of the Git protocol are required. This allows server maintainers to use
> +static content solutions such as CDNs in order to serve the bundle files.
> +
> +At the current scope of the bundle URI feature, all URIs are expected to
> +be HTTP(S) URLs where content is downloaded to a local file using a `GET`
> +request to that URL. The server could include authentication requirements
> +to those requests with the aim of triggering the configured credential
> +helper for secure access. (Future extensions could use "file://" URIs or
> +SSH URIs.)
> +
> +Assuming a `200 OK` response from the server, the content at the URL is
> +expected to be of one of two forms:
> +
> +1. Bundle: A Git bundle file of version 2 or higher.
> +
> +2. Bundle List: A plain-text file that is parsable using Git's
> +   config file parser. This file describes one or more bundles that are
> +   accessible from other URIs.
> +
> +Any other data provided by the server is considered erroneous.

How does a client tell which one it got?  Do we register mimetype
with iana to use for these two types of files, and have the HTTP
downloader to use the information?

> +Bundle Lists
> +------------
> +
> +The Git server can advertise bundle URIs using a set of `key=value` pairs.
> +A bundle URI can also serve a plain-text file in the Git config format
> +containing these same `key=value` pairs. In both cases, we consider this
> +to be a _bundle list_. The pairs specify information about the bundles
> +that the client can use to make decisions for which bundles to download
> +and which to ignore.
> +
> +A few keys focus on properties of the list itself.
> +
> +bundle.list.version::
> +	(Required) This value provides a version number for the table of
> +	contents. If a future Git change enables a feature that needs the Git
> +	client to react to a new key in the bundle list file, then this version
> +	will increment. The only current version number is 1, and if any other
> +	value is specified then Git will fail to use this file.
> +
> +bundle.list.mode::
> +	(Required) This value has one of two values: `all` and `any`. When `all`
> +	is specified, then the client should expect to need all of the listed
> +	bundle URIs that match their repository's requirements. When `any` is
> +	specified, then the client should expect that any one of the bundle URIs
> +	that match their repository's requirements will suffice. Typically, the
> +	`any` option is used to list a number of different bundle servers
> +	located in different geographies.

OK. Presumably, if there are two sets of bundles, A and B, that
consist of 3 and 2 bundle files respectively, and either one of
these two sets is sufficient to help the client, then we'd have a
bundle.list of type 'any', with two bundle.<id>.uri, that point at
(sub) bundle.list of type 'all' in which these 3 or 2 bundle files
are contained?  I am just wondering why we need 'all' and 'any', and
at the same time why these two are sufficient for our needs.

> +bundle.list.heuristic::
> +	If this string-valued key exists, then the bundle list is designed to
> +  work well with incremental `git fetch` commands. The heuristic signals
> +  that there are additional keys available for each bundle that help
> +  determine which subset of bundles the client should download.

Funny indentation?

> +The remaining keys include an `<id>` segment which is a server-designated
> +name for each available bundle.
> +
> +bundle.<id>.uri::
> +	(Required) This string value is the URI for downloading bundle `<id>`.
> +	If the URI begins with a protocol (`http://` or `https://`) then the URI
> +	is absolute. Otherwise, the URI is interpreted as relative to the URI
> +	used for the bundle list. If the URI begins with `/`, then that relative
> +	path is relative to the domain name used for the bundle list. (This use
> +	of relative paths is intended to make it easier to distribute a set of
> +	bundles across a large number of servers or CDNs with different domain
> +	names.)

I have no objection to a host-relative URI notation, but is it
something we need to spell out here?  I am mostly interested in
making sure that we do not deviate a practice that is already used
to point at different resource at the same server.  If the way we
specify host-relative is unnecessarily different from the way
existing "internet" programs (say, a web browser) work, that would
be embarrasing, unless there is a very good reason for us to be
different.

> +bundle.<id>.list::
> +	This boolean value indicates whether the client should expect the
> +	content from this URI to be a list (if `true`) or a bundle (if `false`).
> +	This is typically used when `bundle.list.mode` is `any`.

OK, so the type of a (sub) bundle.list can be specified using this
without having the HTTP(s) server annotate the resource with
mimetype when the thing gets actually downloaded.  It still leaves
the issue of bootstrapping the system.  If the server advises bundle
URI when the client contacts, presumably that first-contact
bundle.*.uri can be annotated with the bundle.*.list at the same
time, but the model allows the client to learn bundles independently
from the server, and it still is a bit unclear how we tell.  Of
course, we can examine the contents of a file that was obtained from
a bundle URI, a file that parses correctly as a config-like file is
very unlikely to be a valid bundle file, and we need to be prepared
to deal with a corrupt resource downloaded from a bundle URI anyway,
so this may not be a huge deal.

> +bundle.<id>.filter::
> +	This string value represents an object filter that should also appear in
> +	the header of this bundle. The server uses this value to differentiate
> +	different kinds of bundles from which the client can choose those that
> +	match their object filters.

Is it an error to have .filter defined for a bundle URI whose .list
says "true"?  Or does it mean all bundle files that make up the list
share the same object filter?

> +bundle.<id>.timestamp::
> +	This value is the number of seconds since Unix epoch (UTC) that this
> +	bundle was created. This is used as an approximation of a point in time
> +	that the bundle matches the data available at the origin server. This is
> +	used when `bundle.list.heuristic=timestamp`.

Name of this field should be better than 'timestamp'; we should say
timestamp of creation (or last modification if the same name can be
reused).

> +bundle.<id>.requires::
> +	This string value represents the ID of another bundle. When present, the
> +	server is indicating that this bundle contains a thin packfile. If the
> +	client does not have all necessary objects to unbundle this packfile,
> +	then the client can download the bundle with the `requires` ID and try
> +	again. (Note: it may be beneficial to allow the server to specify
> +	multiple `requires` bundles.) This is used when
> +	`bundle.list.heuristic=timestamp`.

So, bundle.list.mode can say 'any', with three <id>s in it, but
bundle.1.requires can point at '2', while bundle.2.requires can
point at '1', and bundle.3.requires can be emtpy, in which case you
can either fetch 1&2 or 3 alone.  Is that the idea?

> +bundle.<id>.location::
> +	This string value advertises a real-world location from where the bundle
> +	URI is served. This can be used to present the user with an option for
> +	which bundle URI to use. This is only valuable when `bundle.list.mode`
> +	is `any`.

I am afraid I do not follow.  Do you mean, by "a real-world
location", we write things like "America/Los_Angeles" and
"Asia/Tokyo" in this field, so people can tell which one is cheaper
to get to?  Do we make any further specification to make it machine
usable in any way (I suspect machines would rather measure the
latency and throughput against bundle.<id>.uri and .location is
meant purely for human consumption)?

> +Here is an example bundle list using the Git config format:
> +
> +```
> +[bundle "list"]
> +	version = 1
> +	mode = all
> +	heuristic = timestamp

In all mode, how does heuristic help?  Doesn't mode=all by
definition require you to grab everything anyway?

> +[bundle "2022-02-09-1644442601-daily"]
> +	uri = https://bundles.fake.com/git/git/2022-02-09-1644442601-daily.bundle

example.com (cf. RFC6761)?

> +	timestamp = 1644442601
> +	requires = 2022-02-02-1643842562
> +
> +[bundle "2022-02-02-1643842562"]
> +	uri = https://bundles.fake.com/git/git/2022-02-02-1643842562.bundle
> +	timestamp = 1643842562
> +
> +[bundle "2022-02-09-1644442631-daily-blobless"]
> +	uri = 2022-02-09-1644442631-daily-blobless.bundle
> +	timestamp = 1644442631
> +	requires = 2022-02-02-1643842568-blobless
> +	filter = blob:none
> +
> +[bundle "2022-02-02-1643842568-blobless"]
> +	uri = /git/git/2022-02-02-1643842568-blobless.bundle
> +	timestamp = 1643842568
> +	filter = blob:none
> +```
> +
> +This example uses `bundle.list.mode=all` as well as the
> +`bundle.<id>.timestamp` heuristic. It also uses the `bundle.<id>.filter`
> +options to present two parallel sets of bundles: one for full clones and
> +another for blobless partial clones.
> +
> +Suppose that this bundle list was found at the URI
> +`https://bundles.fake.com/git/git/` and so the two blobless bundles have
> +the following fully-expanded URIs:
> +
> +* `https://bundles.fake.com/git/git/2022-02-09-1644442631-daily-blobless.bundle`
> +* `https://bundles.fake.com/git/git/2022-02-02-1643842568-blobless.bundle`

So,... is the idea that bundle.list.mode=all does *not* mean "you
need to get all of them"?  Rather, you first group bundles with the
same filter, attribute, and then for each group with the same filter,
you'd need to grab all of them?  IOW, if you are interested in a
full clone, you can ignore <id>'s with non-empty bundle.<id>.filter 
and grab all the rest?

If so, then I can see how the design makes sense.  I still do not
know how heuristic kicks in, though.

ANother thing I noticed.  The above scheme makes it impossible to
have <id> that happens to be "list".  I think the variables that
apply to the entire list should be given two-level names, i.e.

      [bundle]
	version = 1
	mode = all
	heuristic = timestamp
      [bundle "2022-02-09-1644442631-daily"]
	uri = ...

> +The client could choose an arbitrary bundle URI as an option _or_ select
> +the URI with lowest latency by some exploratory checks.

Some places may have higher latency but great throughput.

The review for the rest of the document will be left for another
sitting.

Thanks.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/6] docs: document bundle URI standard
  2022-06-06 19:55 ` [PATCH 1/6] docs: document bundle URI standard Derrick Stolee via GitGitGadget
  2022-06-06 22:18   ` Junio C Hamano
@ 2022-06-07  0:33   ` Junio C Hamano
  2022-06-08 19:46     ` Derrick Stolee
  1 sibling, 1 reply; 64+ messages in thread
From: Junio C Hamano @ 2022-06-07  0:33 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +Cloning with Bundle URIs
> +------------------------
> +
> +The primary need for bundle URIs is to speed up clones. The Git client
> +will interact with bundle URIs according to the following flow:
> +
> +1. The user specifies a bundle URI with the `--bundle-uri` command-line
> +   option _or_ the client discovers a bundle list advertised by the
> +   Git server.
> +
> +2. If the downloaded data from a bundle URI is a bundle, then the client
> +   inspects the bundle headers to check that the negative commit OIDs are

Although "negative" would be understandable to pros, the commits
required to unbundle a bundle file are officially called
"prerequisite commits" (cf. "git bundle --help"), so that may be
easier to understand by ordinary readers.

> +   present in the client repository. If some are missing, then the client
> +   delays unbundling until other bundles have been unbundled, making those
> +   OIDs present. When all required OIDs are present, the client unbundles
> +   that data using a refspec. The default refspec is
> +   `+refs/heads/*:refs/bundles/*`, but this can be configured.

The refs/bundles/ appear in the document only here, and it is
unclear why we even want it (I am assuming this is against gc while
"git clone" is still running) or how we are going to retire it, if
ever.  If there are multiple bundle files involved in this "git clone",
to anchor objects that are necessary against "gc", don't we need to use
refs/bundles/<i>/* or something like that, where <i> is uniquely assigned
number locally?

> +3. If the file is instead a bundle list, then the client inspects the
> +   `bundle.list.mode` to see if the list is of the `all` or `any` form.

If the downloaded file is not a bundle (e.g. "git bundle list-heads"
barfs on it) and it is not parseable with our configuration parser,
do we error out, or do we pretend as if that bundle file or the TOC
did not exist (if the bundle list with mode=any at the higher level
has appropriate alternatives)?

> +   a. If `bundle.list.mode=all`, then the client considers all bundle
> +      URIs. The list is reduced based on the `bundle.<id>.filter` options
> +      matching the client repository's partial clone filter.

OK, this answers my earlier question nicely.  It probably means that
either the presentation order needs a bit of rethinking, or "we
group by .filter" needs to be mentioned a lot earlier.

> Then, all
> +      bundle URIs are requested. If the `bundle.<id>.timestamp` heuristic
> +      is provided, then the bundles are downloaded in reverse-
> +      chronological order, stopping when a bundle has all required OIDs.

Stop as soon as just one bundle has all the prerequisite objects, or
should we keep going until all bundles have their prerequisites
satisfied?  I presume it is the latter.

> +      The bundles can then be unbundled in chronological order. The client
> +      stores the latest timestamp as a heuristic for avoiding future
> +      downloads if the bundle list does not advertise newer bundles.

So we see a list, we start grabbing from new to old.  Newer ones
that are based on older ones may have dependencies, so we do not
unbndle until we have all the prerequisites for them.  The bundles
that satisfy their prerequisites are unbundled---that would give us
enough objects to play with.  What happens to the refs recorded in
them, though?

Is the timestamp per the serving host, or per the CDN host that
serve us bundle files, or...?  I guess it is premature to discuss it
here. "git clone" bootstraps from the advertisement made only by a
single serving host, so the single newest timestamp among the
bundles used from the bundle list is what we store here.  How that
timestamp is used is primarily of interest in future fetching, which
would be discussed later.

> +Fetching with Bundle URIs
> +-------------------------
> +
> +When the client fetches new data, it can decide to fetch from bundle
> +servers before fetching from the origin remote. This could be done via a
> +command-line option, but it is more likely useful to use a config value
> +such as the one specified during the clone.
> +
> +The fetch operation follows the same procedure to download bundles from a
> +bundle list (although we do _not_ want to use parallel downloads here). We
> +expect that the process will end when all negative commit OIDs in a thin
> +bundle are already in the object database.

I do not see why we do not want to use parallel download, though.
If our last bundle download was last month, and they have two newer
bundles since then, don't we want to grab both at the same time?
Wasn't that the point of recording the newest timestamp when "git
clone" grabbed bundles?

> +Error Conditions
> +----------------
> +
> +If the Git client discovers something unexpected while downloading
> +information according to a bundle URI or the bundle list found at that
> +location, then Git can ignore that data and continue as if it was not
> +given a bundle URI. The remote Git server is the ultimate source of truth,
> +not the bundle URI.
> +
> +Here are a few example error conditions:
> +
> +* The client fails to connect with a server at the given URI or a connection
> +  is lost without any chance to recover.
> +
> +* The client receives a response other than `200 OK` (such as `404 Not Found`,
> +  `401 Not Authorized`, or `500 Internal Server Error`). The client should
> +  use the `credential.helper` to attempt authentication after the first
> +  `401 Not Authorized` response, but a second such response is a failure.
> +
> +* The client receives data that is not parsable as a bundle or table of
> +  contents.

Is it an error if bundle.<id>.list and the contents disagree?

It is fine to call the possibility other than "a bundle file" "table
of contents", but then let's do so consistently throughout the document.
When we explain bundle.<id>.list, we should not call the other
possibility "list" but "table of contents", for example.

> +* The bundle list describes a directed cycle in the
> +  `bundle.<id>.requires` links.
> +
> +* A bundle includes a filter that does not match expectations.

Does this refer to a mismatch between the filter recorded in a
bundle and bundle.<id>.filter entry that described the bundle?

> +* The client cannot unbundle the bundles because the negative commit OIDs
> +  are not in the object database and there are no more
> +  `bundle.<id>.requires` links to follow.

Is a .requires link mandatory?  In a mode=all table of contents, we
should not have to have .requires at all.  In the above description
on how bundle files are downloaded and in what order in Clone and
Fetch operations, I didn't see any mention of .requires at all, but
I think there should be.  For example, the timestamp heuristics may
say the bundle A is the latest.  In a mode=any table of contents,
shouldn't bundles that contain prerequisite commits of the bundle A
be pointed by A's .requires fields?

> +4. Allow the client to understand the `bundle.list.forFetch` configuration
> +   and the `bundle.<id>.timestamp` heuristic. When `git clone` discovers a
> +   bundle URI with `bundle.list.forFetch=true`, it configures the client
> +   repository to check that bundle URI during later `git fetch <remote>`
> +   commands.

So bundle.list.forFetch is, unlike everything else we saw that
looked like a configuration variable in this document, a
configuration variable whose value is boolean?

Ah, no.  You mean the "git clone" sees a bundle URI, grabs it and
sees a table of contents, and in it, finds "bundle.forFetch" is set
to true?  Then "git fetch <remote>" is configured to also use bundle
URI?

It is unclear to me (with the information given here so far), why we
want this.  Isn't this something the responder to "git fetch" can
advertise over the wire?  If we leave a permanent record in the
resulting repository to do the bundle URI during 'fetch', wouldn't
it become more cumbersome to cancel (iow "no, you no longer want to
talk to me with bundle URI feature") from the server side?

> +5. Allow clients to discover bundle URIs during `git fetch` and configure
> +   a bundle URI for later fetches if `bundle.list.forFetch=true`.
> +
> +6. Implement the "inspect headers" heuristic to reduce data downloads when
> +   the `bundle.<id>.timestamp` heuristic is not available.

Sounds sensible, even though I do not offhand see why the "peek
header and stop" is any less useful when the timestamp heurisitc is
available.

> +A major downside to this mechanism is that the origin server needs to know
> +_exactly_ what is in those packfiles, and the packfiles need to be available
> +to the user for some time after the server has responded. This coupling
> +between the origin and the packfile data is difficult to manage.

Hmph.  I strongly suspect that there are Googlers on the list who
have been managing such JGit server installations.  Has this
"coupling" been difficult to manage for you guys in the real world?

> +Further, this implementation is extremely hard to make work with fetches.

IOW, they do this only for clones and not fetches?

> +Related Work: GVFS Cache Servers
> +--------------------------------
> ...
> +During a `git fetch`, a hook requests the prefetch endpoint using the
> +most-recent timestamp from a previously-downloaded prefetch packfile.
> +Only the list of packfiles with later timestamps are downloaded.

That sounds quite straight-forward.  Do you envision that their
incremental snapshot packfile chains can somehow be shared with the
bundle URI implementations?  Doesn't it make it more cumbersome that
this proposal uses the bundles as the encapsulation format, rather
than packfiles?  As you are sending extra pieces of information on
top of the payload in the form of table-of-contents already, I
wonder if bundle.<id>.uri should point at a bare packfile (instead
of a bundle), while multi-valued bundle.<id>.prerequisite give the
prerequisite objects?  The machinery that is already generating the
prefetch packfiles already know which packfile has what
prerequisites in it, so it rather looks simpler if the solution did
not involve bundles.

Thanks.  Here ends my two-part review on the doc.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/6] docs: document bundle URI standard
  2022-06-06 22:18   ` Junio C Hamano
@ 2022-06-08 19:20     ` Derrick Stolee
  2022-06-08 19:27       ` Junio C Hamano
  2022-06-08 20:39       ` Junio C Hamano
  0 siblings, 2 replies; 64+ messages in thread
From: Derrick Stolee @ 2022-06-08 19:20 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, me, newren, avarab, dyroneteng, Johannes.Schindelin

On 6/6/2022 6:18 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

>> +Assuming a `200 OK` response from the server, the content at the URL is
>> +expected to be of one of two forms:
>> +
>> +1. Bundle: A Git bundle file of version 2 or higher.
>> +
>> +2. Bundle List: A plain-text file that is parsable using Git's
>> +   config file parser. This file describes one or more bundles that are
>> +   accessible from other URIs.
>> +
>> +Any other data provided by the server is considered erroneous.
> 
> How does a client tell which one it got?  Do we register mimetype
> with iana to use for these two types of files, and have the HTTP
> downloader to use the information?

My implementation is much dumber than that: it first attempts to
parse the file as a bundle (looking for a bundle header) and then
attempts to parse it as a config file after that. If neither
succeed, then an error is thrown.

>> +bundle.list.mode::
>> +	(Required) This value has one of two values: `all` and `any`. When `all`
>> +	is specified, then the client should expect to need all of the listed
>> +	bundle URIs that match their repository's requirements. When `any` is
>> +	specified, then the client should expect that any one of the bundle URIs
>> +	that match their repository's requirements will suffice. Typically, the
>> +	`any` option is used to list a number of different bundle servers
>> +	located in different geographies.
> 
> OK. Presumably, if there are two sets of bundles, A and B, that
> consist of 3 and 2 bundle files respectively, and either one of
> these two sets is sufficient to help the client, then we'd have a
> bundle.list of type 'any', with two bundle.<id>.uri, that point at
> (sub) bundle.list of type 'all' in which these 3 or 2 bundle files
> are contained?  I am just wondering why we need 'all' and 'any', and
> at the same time why these two are sufficient for our needs.

Necessary: The origin Git server may want to advertise a list of
geo-distributed bundle servers, but not need to know the exact list
of bundles at each of those locations. The client can choose from
"any" advertised bundle URI, then download a bundle list from that
URI and download "all" bundles it advertised.

Sufficient: I can see a few different ways that we could want to
have something in-between "any" and "all" and that is: "The bundles
break into 'buckets', so pick any bucket and get all within that
bucket." This is already planned as part of the bundle.<id>.filter,
which creates different 'buckets'. So, if we need a partitioning
like this, then we can rely on functionality-based partitions.

Finally, we can always extend this in the future. If we want to
add a new mode "use-key-X" or something, then bundle providers
could start using it, but knowing that older clients would not
understand it and would lose the ability to use their bundles. This
is only a performance issue, not a correctness issue.

I'm definitely trying to minimize the need for these kinds of
extensions while keeping the spec small enough to implement in a
reasonable amount of time.
 
>> +bundle.list.heuristic::
>> +	If this string-valued key exists, then the bundle list is designed to
>> +  work well with incremental `git fetch` commands. The heuristic signals
>> +  that there are additional keys available for each bundle that help
>> +  determine which subset of bundles the client should download.
> 
> Funny indentation?

Thanks. Editor confusion with .txt files, apparently. Fixed now.

>> +The remaining keys include an `<id>` segment which is a server-designated
>> +name for each available bundle.
>> +
>> +bundle.<id>.uri::
>> +	(Required) This string value is the URI for downloading bundle `<id>`.
>> +	If the URI begins with a protocol (`http://` or `https://`) then the URI
>> +	is absolute. Otherwise, the URI is interpreted as relative to the URI
>> +	used for the bundle list. If the URI begins with `/`, then that relative
>> +	path is relative to the domain name used for the bundle list. (This use
>> +	of relative paths is intended to make it easier to distribute a set of
>> +	bundles across a large number of servers or CDNs with different domain
>> +	names.)
> 
> I have no objection to a host-relative URI notation, but is it
> something we need to spell out here?  I am mostly interested in
> making sure that we do not deviate a practice that is already used
> to point at different resource at the same server.  If the way we
> specify host-relative is unnecessarily different from the way
> existing "internet" programs (say, a web browser) work, that would
> be embarrasing, unless there is a very good reason for us to be
> different.

Since it requires careful implementation, I thought the detail along
with the justification would fit in this technical document. I can
avoid including that when describing the config options inside the
user-facing docs.

>> +bundle.<id>.list::
>> +	This boolean value indicates whether the client should expect the
>> +	content from this URI to be a list (if `true`) or a bundle (if `false`).
>> +	This is typically used when `bundle.list.mode` is `any`.
> 
> OK, so the type of a (sub) bundle.list can be specified using this
> without having the HTTP(s) server annotate the resource with
> mimetype when the thing gets actually downloaded.  It still leaves
> the issue of bootstrapping the system.  If the server advises bundle
> URI when the client contacts, presumably that first-contact
> bundle.*.uri can be annotated with the bundle.*.list at the same
> time, but the model allows the client to learn bundles independently
> from the server, and it still is a bit unclear how we tell.  Of
> course, we can examine the contents of a file that was obtained from
> a bundle URI, a file that parses correctly as a config-like file is
> very unlikely to be a valid bundle file, and we need to be prepared
> to deal with a corrupt resource downloaded from a bundle URI anyway,
> so this may not be a huge deal.

Right. We can inspect the file with our existing tools to see if they
fit the format. It might be worth doing some fuzz testing on these
parsers to be sure there isn't a surprising way to trick them into
doing strange things.

>> +bundle.<id>.filter::
>> +	This string value represents an object filter that should also appear in
>> +	the header of this bundle. The server uses this value to differentiate
>> +	different kinds of bundles from which the client can choose those that
>> +	match their object filters.
> 
> Is it an error to have .filter defined for a bundle URI whose .list
> says "true"?  Or does it mean all bundle files that make up the list
> share the same object filter?

While this would not be the typical situation, a bundle provider could
choose to combine these and the client would expect a list where all of
the .filter values match the one here. Of course, it would not be a
_failure_ if that wasn't true, but the client would ignore any bundles
it finds where .filter doesn't match.

>> +bundle.<id>.timestamp::
>> +	This value is the number of seconds since Unix epoch (UTC) that this
>> +	bundle was created. This is used as an approximation of a point in time
>> +	that the bundle matches the data available at the origin server. This is
>> +	used when `bundle.list.heuristic=timestamp`.
> 
> Name of this field should be better than 'timestamp'; we should say
> timestamp of creation (or last modification if the same name can be
> reused).

How about creationToken? That communicates that we don't really care
what the number is as long as it comes from an increasing sequence
controlled by the bundle provider.

>> +bundle.<id>.requires::
>> +	This string value represents the ID of another bundle. When present, the
>> +	server is indicating that this bundle contains a thin packfile. If the
>> +	client does not have all necessary objects to unbundle this packfile,
>> +	then the client can download the bundle with the `requires` ID and try
>> +	again. (Note: it may be beneficial to allow the server to specify
>> +	multiple `requires` bundles.) This is used when
>> +	`bundle.list.heuristic=timestamp`.
> 
> So, bundle.list.mode can say 'any', with three <id>s in it, but
> bundle.1.requires can point at '2', while bundle.2.requires can
> point at '1', and bundle.3.requires can be emtpy, in which case you
> can either fetch 1&2 or 3 alone.  Is that the idea?

The idea is that if I download bundle '1' and I can't unbundle it
(because I'm missing some required refs), then I can look at bundle.1.requires
to get any missing refs. If that is '2', then I download that. It then
continues in a chain.

The 'any' means "start anywhere", but I also don't expect a provider to use
.requires without the (maybe-to-be-renamed) timestamp heuristic. We could also
make that be a hard-coded statement: ".requires will be ignored unless mode=all
and heuristic=X"

>> +bundle.<id>.location::
>> +	This string value advertises a real-world location from where the bundle
>> +	URI is served. This can be used to present the user with an option for
>> +	which bundle URI to use. This is only valuable when `bundle.list.mode`
>> +	is `any`.
> 
> I am afraid I do not follow.  Do you mean, by "a real-world
> location", we write things like "America/Los_Angeles" and
> "Asia/Tokyo" in this field, so people can tell which one is cheaper
> to get to?  Do we make any further specification to make it machine
> usable in any way (I suspect machines would rather measure the
> latency and throughput against bundle.<id>.uri and .location is
> meant purely for human consumption)?

The intention is to be human-readable, for a user-facing prompt.
This could be for an interactive "chooser" or just letting the user
know "this is the location of the bundle URI I picked".

If we want the computer to automatically select, then using ping
latency would be a better way forward. Even in that case, it would
be helpful to tell the user "I discovered the closest bundle URI
is <location>".

>> +Here is an example bundle list using the Git config format:
>> +
>> +```
>> +[bundle "list"]
>> +	version = 1
>> +	mode = all
>> +	heuristic = timestamp
> 
> In all mode, how does heuristic help?  Doesn't mode=all by
> definition require you to grab everything anyway?

The heuristic is for incremental fetches, when you already have
some Git object data locally and don't want to download every
single bundle if you don't need to. (I think I have a step-by-step
flow of this lower in the doc.)

>> +[bundle "2022-02-09-1644442601-daily"]
>> +	uri = https://bundles.fake.com/git/git/2022-02-09-1644442601-daily.bundle
> 
> example.com (cf. RFC6761)?

Sure. Thanks.

>> +	timestamp = 1644442601
>> +	requires = 2022-02-02-1643842562
>> +
>> +[bundle "2022-02-02-1643842562"]
>> +	uri = https://bundles.fake.com/git/git/2022-02-02-1643842562.bundle
>> +	timestamp = 1643842562
>> +
>> +[bundle "2022-02-09-1644442631-daily-blobless"]
>> +	uri = 2022-02-09-1644442631-daily-blobless.bundle
>> +	timestamp = 1644442631
>> +	requires = 2022-02-02-1643842568-blobless
>> +	filter = blob:none
>> +
>> +[bundle "2022-02-02-1643842568-blobless"]
>> +	uri = /git/git/2022-02-02-1643842568-blobless.bundle
>> +	timestamp = 1643842568
>> +	filter = blob:none
>> +```
>> +
>> +This example uses `bundle.list.mode=all` as well as the
>> +`bundle.<id>.timestamp` heuristic. It also uses the `bundle.<id>.filter`
>> +options to present two parallel sets of bundles: one for full clones and
>> +another for blobless partial clones.
>> +
>> +Suppose that this bundle list was found at the URI
>> +`https://bundles.fake.com/git/git/` and so the two blobless bundles have
>> +the following fully-expanded URIs:
>> +
>> +* `https://bundles.fake.com/git/git/2022-02-09-1644442631-daily-blobless.bundle`
>> +* `https://bundles.fake.com/git/git/2022-02-02-1643842568-blobless.bundle`
> 
> So,... is the idea that bundle.list.mode=all does *not* mean "you
> need to get all of them"?  Rather, you first group bundles with the
> same filter, attribute, and then for each group with the same filter,
> you'd need to grab all of them?  IOW, if you are interested in a
> full clone, you can ignore <id>'s with non-empty bundle.<id>.filter 
> and grab all the rest?
> 
> If so, then I can see how the design makes sense.  I still do not
> know how heuristic kicks in, though.
> 
> ANother thing I noticed.  The above scheme makes it impossible to
> have <id> that happens to be "list".  I think the variables that
> apply to the entire list should be given two-level names, i.e.
> 
>       [bundle]
> 	version = 1
> 	mode = all
> 	heuristic = timestamp
>       [bundle "2022-02-09-1644442631-daily"]
> 	uri = ...

This then means that <id> can't be "version", "mode", or "heuristic",
or any other possible key that we use in the future, right? Using "list"
helped with this.

Perhaps we could do this:

	[bundles]
		version = 1
		mode = all
		heuristic = timestamp
	[bundle "id1"]
		uri = ...

Notably: "bundles" refers to the full list, while "bundle" refers
to a single bundle at a time. It makes the situation slightly more
complicated from the server side (we reserve bundles.* and bundle.*
for this advertisement).

>> +The client could choose an arbitrary bundle URI as an option _or_ select
>> +the URI with lowest latency by some exploratory checks.
> 
> Some places may have higher latency but great throughput.
> 
> The review for the rest of the document will be left for another
> sitting.

Thanks for your careful reading!
-Stolee

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/6] docs: document bundle URI standard
  2022-06-08 19:20     ` Derrick Stolee
@ 2022-06-08 19:27       ` Junio C Hamano
  2022-06-08 20:44         ` Junio C Hamano
  2022-06-08 20:39       ` Junio C Hamano
  1 sibling, 1 reply; 64+ messages in thread
From: Junio C Hamano @ 2022-06-08 19:27 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, newren, avarab,
	dyroneteng, Johannes.Schindelin

Derrick Stolee <derrickstolee@github.com> writes:

>> How does a client tell which one it got?  Do we register mimetype
>> with iana to use for these two types of files, and have the HTTP
>> downloader to use the information?
>
> My implementation is much dumber than that: it first attempts to
> parse the file as a bundle (looking for a bundle header) and then
> attempts to parse it as a config file after that. If neither
> succeed, then an error is thrown.

I think that is probably the best implementation after all.

We cannot trust what the other side tells us.  "They claimed that
this is a bundle file and not a table-of-contents, and it does look
like one, but it may be corrupt or even malicious file that may look
like a bundle file on surface, so let's carefully examine it" ought
to be the attitude the receiving side has towards the incoming data.

Thanks.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/6] docs: document bundle URI standard
  2022-06-07  0:33   ` Junio C Hamano
@ 2022-06-08 19:46     ` Derrick Stolee
  2022-06-08 21:01       ` Junio C Hamano
  2022-06-21 19:34       ` Derrick Stolee
  0 siblings, 2 replies; 64+ messages in thread
From: Derrick Stolee @ 2022-06-08 19:46 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, me, newren, avarab, dyroneteng, Johannes.Schindelin

On 6/6/2022 8:33 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> +Cloning with Bundle URIs
>> +------------------------
>> +
>> +The primary need for bundle URIs is to speed up clones. The Git client
>> +will interact with bundle URIs according to the following flow:
>> +
>> +1. The user specifies a bundle URI with the `--bundle-uri` command-line
>> +   option _or_ the client discovers a bundle list advertised by the
>> +   Git server.
>> +
>> +2. If the downloaded data from a bundle URI is a bundle, then the client
>> +   inspects the bundle headers to check that the negative commit OIDs are
> 
> Although "negative" would be understandable to pros, the commits
> required to unbundle a bundle file are officially called
> "prerequisite commits" (cf. "git bundle --help"), so that may be
> easier to understand by ordinary readers.

Ok. I can work to replace this language throughout.
 
>> +   present in the client repository. If some are missing, then the client
>> +   delays unbundling until other bundles have been unbundled, making those
>> +   OIDs present. When all required OIDs are present, the client unbundles
>> +   that data using a refspec. The default refspec is
>> +   `+refs/heads/*:refs/bundles/*`, but this can be configured.
> 
> The refs/bundles/ appear in the document only here, and it is
> unclear why we even want it (I am assuming this is against gc while
> "git clone" is still running) or how we are going to retire it, if
> ever.  If there are multiple bundle files involved in this "git clone",
> to anchor objects that are necessary against "gc", don't we need to use
> refs/bundles/<i>/* or something like that, where <i> is uniquely assigned
> number locally?

The real reason to keep them in refs/bundles/ is because then those
refs can be used in the incremental 'git fetch' after downloading the
bundles (in perpetuity) while not stomping refs/heads or refs/remotes/

>> +3. If the file is instead a bundle list, then the client inspects the
>> +   `bundle.list.mode` to see if the list is of the `all` or `any` form.
> 
> If the downloaded file is not a bundle (e.g. "git bundle list-heads"
> barfs on it) and it is not parseable with our configuration parser,
> do we error out, or do we pretend as if that bundle file or the TOC
> did not exist (if the bundle list with mode=any at the higher level
> has appropriate alternatives)?

I think the best thing to do would be to fail as gracefully as
possible here. In particular: give a warning that the remote did not
give anything helpful, but continue with the normal fetching from the
remote without help from bundles.

>> +   a. If `bundle.list.mode=all`, then the client considers all bundle
>> +      URIs. The list is reduced based on the `bundle.<id>.filter` options
>> +      matching the client repository's partial clone filter.
> 
> OK, this answers my earlier question nicely.  It probably means that
> either the presentation order needs a bit of rethinking, or "we
> group by .filter" needs to be mentioned a lot earlier.

Ok. It may even need to be _implemented_ a lot earlier, looking at my
current outline of future changes.

>> Then, all
>> +      bundle URIs are requested. If the `bundle.<id>.timestamp` heuristic
>> +      is provided, then the bundles are downloaded in reverse-
>> +      chronological order, stopping when a bundle has all required OIDs.
> 
> Stop as soon as just one bundle has all the prerequisite objects, or
> should we keep going until all bundles have their prerequisites
> satisfied?  I presume it is the latter.

True. We keep going until all have their required OIDs. With a .requires
chain of "1 requires 2" and "2 requires 3", it is possible that "1"
has a prerequisite commit that actually exists in "3", but somehow the
bundle "2" can unbundle in the client's repository.

One could argue that the "1" bundle should have a direct .requires
relationship on "3" (in addition to "2"), but for simplicity I think
it is OK to imply "transitive requires".

Using "all" here should make it clearer.

>> +      The bundles can then be unbundled in chronological order. The client
>> +      stores the latest timestamp as a heuristic for avoiding future
>> +      downloads if the bundle list does not advertise newer bundles.
> 
> So we see a list, we start grabbing from new to old.  Newer ones
> that are based on older ones may have dependencies, so we do not
> unbndle until we have all the prerequisites for them.  The bundles
> that satisfy their prerequisites are unbundled---that would give us
> enough objects to play with.  What happens to the refs recorded in
> them, though?

Those refs are translated into refs/bundles/. If multiple bundles
have the same refs/heads/ included, then the newest ones would
overwite those refs in refs/bundles/.

> Is the timestamp per the serving host, or per the CDN host that
> serve us bundle files, or...?  I guess it is premature to discuss it
> here. "git clone" bootstraps from the advertisement made only by a
> single serving host, so the single newest timestamp among the
> bundles used from the bundle list is what we store here.  How that
> timestamp is used is primarily of interest in future fetching, which
> would be discussed later.

The bundle provider is responsible for making the timestamps make
sense as an opaque increasing token.

>> +Fetching with Bundle URIs
>> +-------------------------
>> +
>> +When the client fetches new data, it can decide to fetch from bundle
>> +servers before fetching from the origin remote. This could be done via a
>> +command-line option, but it is more likely useful to use a config value
>> +such as the one specified during the clone.
>> +
>> +The fetch operation follows the same procedure to download bundles from a
>> +bundle list (although we do _not_ want to use parallel downloads here). We
>> +expect that the process will end when all negative commit OIDs in a thin
>> +bundle are already in the object database.
> 
> I do not see why we do not want to use parallel download, though.
> If our last bundle download was last month, and they have two newer
> bundles since then, don't we want to grab both at the same time?
> Wasn't that the point of recording the newest timestamp when "git
> clone" grabbed bundles?

In theory, we have also been fetching from the origin, so we might have
already received all of the objects in that second (or first!) bundle.
By going in reverse-chronological order, we minimize the amount of data
downloaded (assuming that is the most expensive part of the operation).

This is something where we have room to experiment. With Ævar's idea of
downloading only the headers, we could download all headers in parallel
and halt the download for any bundles where we have all of the ref tips.

>> +Error Conditions
>> +----------------
>> +
>> +If the Git client discovers something unexpected while downloading
>> +information according to a bundle URI or the bundle list found at that
>> +location, then Git can ignore that data and continue as if it was not
>> +given a bundle URI. The remote Git server is the ultimate source of truth,
>> +not the bundle URI.
>> +
>> +Here are a few example error conditions:
>> +
>> +* The client fails to connect with a server at the given URI or a connection
>> +  is lost without any chance to recover.
>> +
>> +* The client receives a response other than `200 OK` (such as `404 Not Found`,
>> +  `401 Not Authorized`, or `500 Internal Server Error`). The client should
>> +  use the `credential.helper` to attempt authentication after the first
>> +  `401 Not Authorized` response, but a second such response is a failure.
>> +
>> +* The client receives data that is not parsable as a bundle or table of
>> +  contents.
> 
> Is it an error if bundle.<id>.list and the contents disagree?

I think this can be flexible. It is not difficult to treat the .list
value as advisory, so we can ignore disagreements. If we start making
decisions that hinge on the value of .list, then we can start treating
it as an error.

Or: maybe the .list value is so "advisory" that it is useless, and I
should just drop it from the schema.

> It is fine to call the possibility other than "a bundle file" "table
> of contents", but then let's do so consistently throughout the document.
> When we explain bundle.<id>.list, we should not call the other
> possibility "list" but "table of contents", for example.

Sorry, I had intended to find-and-replace all of these instances, but
missed some that cross line breaks.

>> +* The bundle list describes a directed cycle in the
>> +  `bundle.<id>.requires` links.
>> +
>> +* A bundle includes a filter that does not match expectations.
> 
> Does this refer to a mismatch between the filter recorded in a
> bundle and bundle.<id>.filter entry that described the bundle?

Yes. I can make that more explicit.

>> +* The client cannot unbundle the bundles because the negative commit OIDs
>> +  are not in the object database and there are no more
>> +  `bundle.<id>.requires` links to follow.
> 
> Is a .requires link mandatory?  In a mode=all table of contents, we
> should not have to have .requires at all.  In the above description
> on how bundle files are downloaded and in what order in Clone and
> Fetch operations, I didn't see any mention of .requires at all, but
> I think there should be.  For example, the timestamp heuristics may
> say the bundle A is the latest.  In a mode=any table of contents,
> shouldn't bundles that contain prerequisite commits of the bundle A
> be pointed by A's .requires fields?

It could be considered useful for the .timestamp heuristic, or we
could infer the .requires value as "the next most-recent bundle" in
that heuristic. (So maybe we remove it from the schema.)

With "all" but no heuristic, then we need everything without knowing
where to "start" and trying to minimize downloads.

>> +4. Allow the client to understand the `bundle.list.forFetch` configuration
>> +   and the `bundle.<id>.timestamp` heuristic. When `git clone` discovers a
>> +   bundle URI with `bundle.list.forFetch=true`, it configures the client
>> +   repository to check that bundle URI during later `git fetch <remote>`
>> +   commands.
> 
> So bundle.list.forFetch is, unlike everything else we saw that
> looked like a configuration variable in this document, a
> configuration variable whose value is boolean?
> 
> Ah, no.  You mean the "git clone" sees a bundle URI, grabs it and
> sees a table of contents, and in it, finds "bundle.forFetch" is set
> to true?  Then "git fetch <remote>" is configured to also use bundle
> URI?

Yes. The bundle provider is advertising "I'm working hard to help you
with your 'git fetches' after your initial clone."

> It is unclear to me (with the information given here so far), why we
> want this.  Isn't this something the responder to "git fetch" can
> advertise over the wire?  If we leave a permanent record in the
> resulting repository to do the bundle URI during 'fetch', wouldn't
> it become more cumbersome to cancel (iow "no, you no longer want to
> talk to me with bundle URI feature") from the server side?

I've tried very hard to make the bundle provider as independent from
the origin Git server as possible (including no relationship at all).
Even if the origin Git server knows about a bundle provider at a
given URI, it does not necessarily know if _at this moment_ the
provider is bundling for fetches.

>> +5. Allow clients to discover bundle URIs during `git fetch` and configure
>> +   a bundle URI for later fetches if `bundle.list.forFetch=true`.
>> +
>> +6. Implement the "inspect headers" heuristic to reduce data downloads when
>> +   the `bundle.<id>.timestamp` heuristic is not available.
> 
> Sounds sensible, even though I do not offhand see why the "peek
> header and stop" is any less useful when the timestamp heurisitc is
> available.

Yes, it is still helpful with the heuristic. I can remove that
phrasing. (This functionality is _required_ to help incremental
fetches when a heuristic is not provided.)

>> +A major downside to this mechanism is that the origin server needs to know
>> +_exactly_ what is in those packfiles, and the packfiles need to be available
>> +to the user for some time after the server has responded. This coupling
>> +between the origin and the packfile data is difficult to manage.
> 
> Hmph.  I strongly suspect that there are Googlers on the list who
> have been managing such JGit server installations.  Has this
> "coupling" been difficult to manage for you guys in the real world?
>
>> +Further, this implementation is extremely hard to make work with fetches.
> 
> IOW, they do this only for clones and not fetches?

The last time I spoke with Googlers about this subject, what I heard
from them was "Yes, this is hard to get right for incremental fetches,
but the benefit for clones is big enough that we are happy with just
that." If advancements have come about since then to work with
incremental fetches, then I'd love to hear about it.
 
>> +Related Work: GVFS Cache Servers
>> +--------------------------------
>> ...
>> +During a `git fetch`, a hook requests the prefetch endpoint using the
>> +most-recent timestamp from a previously-downloaded prefetch packfile.
>> +Only the list of packfiles with later timestamps are downloaded.
> 
> That sounds quite straight-forward.  Do you envision that their
> incremental snapshot packfile chains can somehow be shared with the
> bundle URI implementations?  Doesn't it make it more cumbersome that
> this proposal uses the bundles as the encapsulation format, rather
> than packfiles?  As you are sending extra pieces of information on
> top of the payload in the form of table-of-contents already, I
> wonder if bundle.<id>.uri should point at a bare packfile (instead
> of a bundle), while multi-valued bundle.<id>.prerequisite give the
> prerequisite objects?  The machinery that is already generating the
> prefetch packfiles already know which packfile has what
> prerequisites in it, so it rather looks simpler if the solution did
> not involve bundles.

The prefetch packfiles could be replaced with bundle URIs, if desired.

The reason bundles were not needed for the GVFS protocol was that
all other object data not in those prefetch packfiles is downloaded
via direct requests (one object per request or a batch of objects
requested from a list of OIDs) and not from an incremental fetch. The
VFS for Git clients only talk to the origin server for the ref
advertisement and talk to the cache servers for the objects necessary
to satisfy the client's Git command. (Also: the client pushes directly
to the origin server.)

So in this world, the bundle URIs could be used as a replacement for
downloading these prefetch packfiles (bundles with filter=blob:none)
but the bundled refs become useless to the client.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/6] docs: document bundle URI standard
  2022-06-08 19:20     ` Derrick Stolee
  2022-06-08 19:27       ` Junio C Hamano
@ 2022-06-08 20:39       ` Junio C Hamano
  2022-06-08 20:52         ` Derrick Stolee
  1 sibling, 1 reply; 64+ messages in thread
From: Junio C Hamano @ 2022-06-08 20:39 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, newren, avarab,
	dyroneteng, Johannes.Schindelin

Derrick Stolee <derrickstolee@github.com> writes:

> How about creationToken? That communicates that we don't really care
> what the number is as long as it comes from an increasing sequence
> controlled by the bundle provider.

Is it important for the clients that this value is tied to creation
time?  I am trying to come up with a word like "serial" that makes
it clear that the values have no meaning other than that they are
always growing in one direction to suggest as a replacement, but
failing to find a good one.  The word should probably not have
any connotation that they must be consecutive, but "serial" numbers
are usually dense.

>> Another thing I noticed.  The above scheme makes it impossible to
>> have <id> that happens to be "list".  I think the variables that
>> apply to the entire list should be given two-level names, i.e.
>> 
>>       [bundle]
>> 	version = 1
>> 	mode = all
>> 	heuristic = timestamp
>>       [bundle "2022-02-09-1644442631-daily"]
>> 	uri = ...
>
> This then means that <id> can't be "version", "mode", or "heuristic",
> or any other possible key that we use in the future, right?

Left ;-).

Two-level variable names and three-level variable names live
completely in a separate namespace (there is no D/F conflict).

    [bundle]
        version = 1
    [bundle "version"]
        url = ...
        mode = ...

is perfectly legit.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/6] docs: document bundle URI standard
  2022-06-08 19:27       ` Junio C Hamano
@ 2022-06-08 20:44         ` Junio C Hamano
  0 siblings, 0 replies; 64+ messages in thread
From: Junio C Hamano @ 2022-06-08 20:44 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, newren, avarab,
	dyroneteng, Johannes.Schindelin

Junio C Hamano <gitster@pobox.com> writes:

> Derrick Stolee <derrickstolee@github.com> writes:
>
>>> How does a client tell which one it got?  Do we register mimetype
>>> with iana to use for these two types of files, and have the HTTP
>>> downloader to use the information?
>>
>> My implementation is much dumber than that: it first attempts to
>> parse the file as a bundle (looking for a bundle header) and then
>> attempts to parse it as a config file after that. If neither
>> succeed, then an error is thrown.
>
> I think that is probably the best implementation after all.
>
> We cannot trust what the other side tells us.  "They claimed that
> this is a bundle file and not a table-of-contents, and it does look
> like one, but it may be corrupt or even malicious file that may look
> like a bundle file on surface, so let's carefully examine it" ought
> to be the attitude the receiving side has towards the incoming data.

With the above, I do not mean that this new mechanism must be more
paranoia than we already are.  

    $ git fetch bootstrap.bndl refs/*:refs/bundle/bootstrap/*

should already have sensible error checking, and we should use the
available mechanism.  But there of course are places the new feature
should be careful in its new code, for example, we may want to
unbundle all these bundles in quarantined area until we resolve all
the prerequisite objects and then move them out of the quarantine,
for example, if the new feature rolls its own code to unbundle
instead of invoking "git fetch" on it.  Even if it spawns "git fetch"
on it, it may have to choose the parameters carefully (e.g. the refmap
would want to avoid clobbering our own ref namespace, which you plan
to do).



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/6] docs: document bundle URI standard
  2022-06-08 20:39       ` Junio C Hamano
@ 2022-06-08 20:52         ` Derrick Stolee
  0 siblings, 0 replies; 64+ messages in thread
From: Derrick Stolee @ 2022-06-08 20:52 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, me, newren, avarab,
	dyroneteng, Johannes.Schindelin

On 6/8/2022 4:39 PM, Junio C Hamano wrote:
> Derrick Stolee <derrickstolee@github.com> writes:
> 
>> How about creationToken? That communicates that we don't really care
>> what the number is as long as it comes from an increasing sequence
>> controlled by the bundle provider.
> 
> Is it important for the clients that this value is tied to creation
> time?  I am trying to come up with a word like "serial" that makes
> it clear that the values have no meaning other than that they are
> always growing in one direction to suggest as a replacement, but
> failing to find a good one.  The word should probably not have
> any connotation that they must be consecutive, but "serial" numbers
> are usually dense.
> 
>>> Another thing I noticed.  The above scheme makes it impossible to
>>> have <id> that happens to be "list".  I think the variables that
>>> apply to the entire list should be given two-level names, i.e.
>>>
>>>       [bundle]
>>> 	version = 1
>>> 	mode = all
>>> 	heuristic = timestamp
>>>       [bundle "2022-02-09-1644442631-daily"]
>>> 	uri = ...
>>
>> This then means that <id> can't be "version", "mode", or "heuristic",
>> or any other possible key that we use in the future, right?
> 
> Left ;-).
> 
> Two-level variable names and three-level variable names live
> completely in a separate namespace (there is no D/F conflict).
> 
>     [bundle]
>         version = 1
>     [bundle "version"]
>         url = ...
>         mode = ...
> 
> is perfectly legit.

Then I stand corrected. For some reason I thought this would
cause a problem, but I must have messed something _else_ up in
the process of testing it.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/6] docs: document bundle URI standard
  2022-06-08 19:46     ` Derrick Stolee
@ 2022-06-08 21:01       ` Junio C Hamano
  2022-06-09 16:00         ` Derrick Stolee
  2022-06-21 19:34       ` Derrick Stolee
  1 sibling, 1 reply; 64+ messages in thread
From: Junio C Hamano @ 2022-06-08 21:01 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, newren, avarab,
	dyroneteng, Johannes.Schindelin

Derrick Stolee <derrickstolee@github.com> writes:

>> Is it an error if bundle.<id>.list and the contents disagree?
>
> I think this can be flexible. It is not difficult to treat the .list
> value as advisory, so we can ignore disagreements. If we start making
> decisions that hinge on the value of .list, then we can start treating
> it as an error.
>
> Or: maybe the .list value is so "advisory" that it is useless, and I
> should just drop it from the schema.

I was wondering about the same thing ;-)

>>> +Related Work: GVFS Cache Servers
>>> +--------------------------------
>>> ...
>>> +During a `git fetch`, a hook requests the prefetch endpoint using the
>>> +most-recent timestamp from a previously-downloaded prefetch packfile.
>>> +Only the list of packfiles with later timestamps are downloaded.
>> 
>> That sounds quite straight-forward.  Do you envision that their
>> incremental snapshot packfile chains can somehow be shared with the
>> bundle URI implementations?  Doesn't it make it more cumbersome that
>> this proposal uses the bundles as the encapsulation format, rather
>> than packfiles?  As you are sending extra pieces of information on
>> top of the payload in the form of table-of-contents already, I
>> wonder if bundle.<id>.uri should point at a bare packfile (instead
>> of a bundle), while multi-valued bundle.<id>.prerequisite give the
>> prerequisite objects?  The machinery that is already generating the
>> prefetch packfiles already know which packfile has what
>> prerequisites in it, so it rather looks simpler if the solution did
>> not involve bundles.
>
> The prefetch packfiles could be replaced with bundle URIs, if desired.
> ...
> So in this world, the bundle URIs could be used as a replacement for
> downloading these prefetch packfiles (bundles with filter=blob:none)
> but the bundled refs become useless to the client.

That's all understandable, but what I was alluding to was to go in
the other direction.  Since "bundle URI" thing is new, while the
GVFS Cache Servers already use these prefetch packfiles, it could be
beneficial if the new thing can be done without bundle files and
instead with packfiles.  You are already generating these snapshot
packfiles for GVFS Cache Servers.  So if we can reuse them to also
serve "git clone" and "git fetch" clients, we can do so without
doubling the disk footprint.

Even if you scrapped the "bundle URI" and rebuilt it as the
"packfile URI" mechanism, the only change you need is to make
positive and negative refs, which were available in bundle files but
not stored in packfiles, available as a part of the metadata for
each packfile, no?  You'd be keeping track of associated metadata
(like the .timestamp and .requires fields) in addition to what is in
the bundle anyway, so...

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/6] docs: document bundle URI standard
  2022-06-08 21:01       ` Junio C Hamano
@ 2022-06-09 16:00         ` Derrick Stolee
  2022-06-09 17:56           ` Junio C Hamano
  0 siblings, 1 reply; 64+ messages in thread
From: Derrick Stolee @ 2022-06-09 16:00 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, me, newren, avarab,
	dyroneteng, Johannes.Schindelin

On 6/8/2022 5:01 PM, Junio C Hamano wrote:
> Derrick Stolee <derrickstolee@github.com> writes:
>>> That sounds quite straight-forward.  Do you envision that their
>>> incremental snapshot packfile chains can somehow be shared with the
>>> bundle URI implementations?  Doesn't it make it more cumbersome that
>>> this proposal uses the bundles as the encapsulation format, rather
>>> than packfiles?  As you are sending extra pieces of information on
>>> top of the payload in the form of table-of-contents already, I
>>> wonder if bundle.<id>.uri should point at a bare packfile (instead
>>> of a bundle), while multi-valued bundle.<id>.prerequisite give the
>>> prerequisite objects?  The machinery that is already generating the
>>> prefetch packfiles already know which packfile has what
>>> prerequisites in it, so it rather looks simpler if the solution did
>>> not involve bundles.
>>
>> The prefetch packfiles could be replaced with bundle URIs, if desired.
>> ...
>> So in this world, the bundle URIs could be used as a replacement for
>> downloading these prefetch packfiles (bundles with filter=blob:none)
>> but the bundled refs become useless to the client.
> 
> That's all understandable, but what I was alluding to was to go in
> the other direction.  Since "bundle URI" thing is new, while the
> GVFS Cache Servers already use these prefetch packfiles, it could be
> beneficial if the new thing can be done without bundle files and
> instead with packfiles.  You are already generating these snapshot
> packfiles for GVFS Cache Servers.  So if we can reuse them to also
> serve "git clone" and "git fetch" clients, we can do so without
> doubling the disk footprint.

Now I'm confused as to what you are trying to say, so let me back
up and start from the beginning. Hopefully, that brings clarity so
we can get to the root of my confusion.

The GVFS Cache Servers started as a way to have low-latency per-object
downloads to satisfy the filesystem virtualization feature of the
clients. This initially was going to be the _only_ way clients got
objects until we realized that commit and tree "misses" are very
expensive.

So, the "prefetch packfile" system was developed to use timestamp-
based packs that contain commits and trees. Clients would provide
their latest timestamp and the servers would provide the list of
packfiles to download.

Because the GVFS Protocol still has the "download objects on-demand"
feature, any objects that were needed that were not already in those
prefetch packfiles (including recently-pushed commits and trees)
could be downloaded by the clients on-demand.

This has been successful in production, and in particular is helpful
that cache servers can be maintained completely independently of the
origin Git server. There is some configuration to allow the origin
server to advertise the list of cache servers via the <url>/gvfs/config
REST API, but otherwise they are completely independent.

For years, I've been interested in bringing this kind of functionality
to Git proper, but struggled on multiple fronts:

1. The independence of the cache servers could not use packfile-URIs.

2. The way packfile-URIs happens _within_ a fetch negotiation makes it
   hard to integrate even if we didn't have this independence.

3. If the Git client directly downloaded these packfiles from the
   cache server, then how does it get the remaining objects from the
   origin server?

Ævar's observation that bundles also add ref tips to the packfile is
the key to breaking down this concern: these ref tips give us a way
to negotiate the difference between what the client already has
(including the bundles downloaded from a bundle provider) and what it
wants from the origin Git server. This all happens without any change
necessary to the origin Git server.

And thus, this bundle URI design came about. It takes all of the best
things about the GVFS Cache Server but then layers refs on top of the
time-based prefetch packfiles so a normal Git client can do that
"catch-up fetch" afterwards.

This motivated my "could we use the new bundle URI feature in the
old GVFS Cache Server environment?" comment:

I could imagine updating GVFS Cache Servers to generate bundles
instead (or also) and updating the VFS for Git clients to use the
bundle URI feature to download the data. However, for the sake of not
overloading the origin server with those incremental fetches, we would
probably keep the "only download missing objects on-demand" feature
in that environment. (Hence, the refs are useless to those clients.)

However, you seem to be hinting at "the GVFS Cache Servers seem to
work just fine, so why do we need bundles?" but I think that the
constraints of what is expected at the end of "git clone" or "git
fetch" require us to not "catch up later" and instead complete the
full download during the process. The refs in the bundles are critical
to making that work.
 
> Even if you scrapped the "bundle URI" and rebuilt it as the
> "packfile URI" mechanism, the only change you need is to make
> positive and negative refs, which were available in bundle files but
> not stored in packfiles, available as a part of the metadata for
> each packfile, no?  You'd be keeping track of associated metadata
> (like the .timestamp and .requires fields) in addition to what is in
> the bundle anyway, so...

From this comment, it seems you are suggesting that we augment the
packfile data being served by the packfile-URI feature in order to
include those positive/negative refs (as well as other metadata that
is included in the bundle URI design).

I see two major issues with that:

1. We don't have a way to add that metadata directly into packfiles,
   so we would need to update the file standard or update the
   packfile-URI protocol to include that metadata.

2. The only source of packfile-URI listings come as a response to the
   "git fetch" request to the origin Git server, so there is no way
   to allow an independent server to provide that data.

I hope I am going in the right direction here, but I likely
misunderstood some of your proposed alternatives.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/6] docs: document bundle URI standard
  2022-06-09 16:00         ` Derrick Stolee
@ 2022-06-09 17:56           ` Junio C Hamano
  2022-06-09 18:27             ` Ævar Arnfjörð Bjarmason
  2022-06-09 19:39             ` Derrick Stolee
  0 siblings, 2 replies; 64+ messages in thread
From: Junio C Hamano @ 2022-06-09 17:56 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, newren, avarab,
	dyroneteng, Johannes.Schindelin

Derrick Stolee <derrickstolee@github.com> writes:

> Ævar's observation that bundles also add ref tips to the packfile is
> the key to breaking down this concern: these ref tips give us a way
> to negotiate the difference between what the client already has
> (including the bundles downloaded from a bundle provider) and what it
> wants from the origin Git server. This all happens without any change
> necessary to the origin Git server.
>
> And thus, this bundle URI design came about. It takes all of the best
> things about the GVFS Cache Server but then layers refs on top of the
> time-based prefetch packfiles so a normal Git client can do that
> "catch-up fetch" afterwards.

Yup.  My observation was that (1) you would need ref tips in some
way, (2) you are conveying not just "here are the set of bundle
files", but "this bundle file has these associated attributes" (like
.timestamp, and .uri to fetch it from), in the table-of-contents the
clients are expected to obtain anyway, hence (3) you could, but you
do not need to, use bundle as a way to convey "packfile contents
plus refs" to the clients (iow, instead you can use packfile and
then report these refs information in the table-of-contents as more
"associated attributes" to the items listed in the table-of-contents).

> I could imagine updating GVFS Cache Servers to generate bundles
> instead (or also) and updating the VFS for Git clients to use the
> bundle URI feature to download the data.

I could, too, but I do not think that would buy us anything.  If an
existing system is happily working, I do not see a point in switching
it to use bundle.  What I was imagining is going the other way.  A
new thing being written, instead of basing it on bundles, can be
based on packfiles, and that would allow you to share the on-disk
packfiles between the two systems.

> However, you seem to be hinting at "the GVFS Cache Servers seem to
> work just fine, so why do we need bundles?" but I think that the
> constraints of what is expected at the end of "git clone" or "git
> fetch" require us to not "catch up later" and instead complete the
> full download during the process. The refs in the bundles are critical
> to making that work.

The refs are critical.  They do not have to be recorded in a bundle.

> I see two major issues with that:
>
> 1. We don't have a way to add that metadata directly into packfiles,
>    so we would need to update the file standard or update the
>    packfile-URI protocol to include that metadata.
>
> 2. The only source of packfile-URI listings come as a response to the
>    "git fetch" request to the origin Git server, so there is no way
>    to allow an independent server to provide that data.

It might be end up being the same thing at the end, but I was
thinking about starting from the "bundle URI standard" document.  I
am not yet interested in discussing "packfile URI" feature that
independently exists already in this discussion (but going this
route we _might_ be able to share data and some code with it---but
that was not where I am coming from).

Starting from "bundle URI standard" document at the beginning of the
thread, if we replace all the mentions of "bundle file" with
"packfile" in it, and then add .positiveRefs and .negativeRefs to
each "packfile" (renamed from "bundle file") as additional
"packfile.<id>.*" (renamed from "bundle.<id>.*") attributes, without
changing anything else, the result would be feature equivalent to
the original "bundle URI standard", I would think, but without
having to wrap a packfile in a bundle file?

> I hope I am going in the right direction here, but I likely
> misunderstood some of your proposed alternatives.

I wasn't seriously "proposing" an alternative.  It was just that it
looked wasteful to go to a separate format (i.e. bundle) when packfiles
should suffice, as you would be adding extra information that is not
in bundles via the table-of-contents anyway, and what is given by a
bundle that is missing in a packfile is only the refs information,
which should be trivial to add to the table-of-contents.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/6] docs: document bundle URI standard
  2022-06-09 17:56           ` Junio C Hamano
@ 2022-06-09 18:27             ` Ævar Arnfjörð Bjarmason
  2022-06-09 19:39             ` Derrick Stolee
  1 sibling, 0 replies; 64+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-06-09 18:27 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee, Derrick Stolee via GitGitGadget, git, me, newren,
	dyroneteng, Johannes.Schindelin


On Thu, Jun 09 2022, Junio C Hamano wrote:

> Derrick Stolee <derrickstolee@github.com> writes:
> [...]
>> I hope I am going in the right direction here, but I likely
>> misunderstood some of your proposed alternatives.
>
> I wasn't seriously "proposing" an alternative.  It was just that it
> looked wasteful to go to a separate format (i.e. bundle) when packfiles
> should suffice, as you would be adding extra information that is not
> in bundles via the table-of-contents anyway, and what is given by a
> bundle that is missing in a packfile is only the refs information,
> which should be trivial to add to the table-of-contents.

One thing that got pointed out to me by someone interested in this
feature is that they'd be interested in serving up historical git
repositories with a repo.bundle file.

We're much better off with a format that includes refs for those use
cases, even though no version of the patches that have been kicked
around support this particular use-case yet (but it would be relatively
easy as a follow-up).

I also daresay that bundles will tend to integrate better into existing
infrastructure, since if you're doing incremental backups with them it's
a natural extension to throw those same incremental updates at a CDN and
serve them up with bundle-uri.

I also think we can come up with much better tooling for collections of
bundles than random packfiles, since they declare what "tip" they want,
and we're able to add arbitrary key-values to the header format in the
future (and Stolee already added one such key-value).

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/6] docs: document bundle URI standard
  2022-06-09 17:56           ` Junio C Hamano
  2022-06-09 18:27             ` Ævar Arnfjörð Bjarmason
@ 2022-06-09 19:39             ` Derrick Stolee
  2022-06-09 20:13               ` Junio C Hamano
  1 sibling, 1 reply; 64+ messages in thread
From: Derrick Stolee @ 2022-06-09 19:39 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, me, newren, avarab,
	dyroneteng, Johannes.Schindelin

On 6/9/2022 1:56 PM, Junio C Hamano wrote:
> Derrick Stolee <derrickstolee@github.com> writes:
> 
>> Ævar's observation that bundles also add ref tips to the packfile is
>> the key to breaking down this concern: these ref tips give us a way
>> to negotiate the difference between what the client already has
>> (including the bundles downloaded from a bundle provider) and what it
>> wants from the origin Git server. This all happens without any change
>> necessary to the origin Git server.
>>
>> And thus, this bundle URI design came about. It takes all of the best
>> things about the GVFS Cache Server but then layers refs on top of the
>> time-based prefetch packfiles so a normal Git client can do that
>> "catch-up fetch" afterwards.
> 
> Yup.  My observation was that (1) you would need ref tips in some
> way, (2) you are conveying not just "here are the set of bundle
> files", but "this bundle file has these associated attributes" (like
> .timestamp, and .uri to fetch it from), in the table-of-contents the
> clients are expected to obtain anyway, hence (3) you could, but you
> do not need to, use bundle as a way to convey "packfile contents
> plus refs" to the clients (iow, instead you can use packfile and
> then report these refs information in the table-of-contents as more
> "associated attributes" to the items listed in the table-of-contents).
...
> Starting from "bundle URI standard" document at the beginning of the
> thread, if we replace all the mentions of "bundle file" with
> "packfile" in it, and then add .positiveRefs and .negativeRefs to
> each "packfile" (renamed from "bundle file") as additional
> "packfile.<id>.*" (renamed from "bundle.<id>.*") attributes, without
> changing anything else, the result would be feature equivalent to
> the original "bundle URI standard", I would think, but without
> having to wrap a packfile in a bundle file?
> 
>> I hope I am going in the right direction here, but I likely
>> misunderstood some of your proposed alternatives.
> 
> I wasn't seriously "proposing" an alternative.  It was just that it
> looked wasteful to go to a separate format (i.e. bundle) when packfiles
> should suffice, as you would be adding extra information that is not
> in bundles via the table-of-contents anyway, and what is given by a
> bundle that is missing in a packfile is only the refs information,
> which should be trivial to add to the table-of-contents.
 
Ok, I've trimmed your latest reply to focus on the main point:
"Why bundles?"

You are right that we could use a table of contents to list the
metadata that we need (that is currently stored in the bundle
header) except for one case: the single bundle. If the provider
wants to skip the table of contents/bundle list and only provide
one file that bootstraps clones, then we need something more than
just a packfile.

This could be remedied by _requiring_ the table of contents with
the ref list, but it does lead to separation of the packfile from
the important ref information.

Further, the provider might want to cover a large number of refs,
not just the default ref. That would increase the size of the
table of contents more than necessary.

With these things in mind, I do still think bundles are a good
way to store and share this data.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/6] docs: document bundle URI standard
  2022-06-09 19:39             ` Derrick Stolee
@ 2022-06-09 20:13               ` Junio C Hamano
  0 siblings, 0 replies; 64+ messages in thread
From: Junio C Hamano @ 2022-06-09 20:13 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, newren, avarab,
	dyroneteng, Johannes.Schindelin

Derrick Stolee <derrickstolee@github.com> writes:

> Further, the provider might want to cover a large number of refs,
> not just the default ref. That would increase the size of the
> table of contents more than necessary.
>
> With these things in mind, I do still think bundles are a good
> way to store and share this data.

If you keep the refs and filter information separately from the
packdata (i.e. in the table-of-contents like I outlined in the
message you are responding to), one downside is that you lose these
pieces of information but still have packfiles, such an accident
would make the set of packfiles pretty much useless.

But if you have bundles, the filter information to be placed in the
table-of-contents can be recovered from them.  Which is much better.
I wonder if we should add more to the bundle, like what we would
write to the .timestamp field of the table-of-contents, though, if
we are to go in that direction.

Thanks.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/6] docs: document bundle URI standard
  2022-06-08 19:46     ` Derrick Stolee
  2022-06-08 21:01       ` Junio C Hamano
@ 2022-06-21 19:34       ` Derrick Stolee
  2022-06-21 20:16         ` Junio C Hamano
  1 sibling, 1 reply; 64+ messages in thread
From: Derrick Stolee @ 2022-06-21 19:34 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget
  Cc: git, me, newren, avarab, dyroneteng, Johannes.Schindelin

On 6/8/2022 3:46 PM, Derrick Stolee wrote:
> On 6/6/2022 8:33 PM, Junio C Hamano wrote:
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

I'm finally getting around to applying the recommended changes
locally, when I noticed...

>>> +   present in the client repository. If some are missing, then the client
>>> +   delays unbundling until other bundles have been unbundled, making those
>>> +   OIDs present. When all required OIDs are present, the client unbundles
>>> +   that data using a refspec. The default refspec is
>>> +   `+refs/heads/*:refs/bundles/*`, but this can be configured.
>>
>> The refs/bundles/ appear in the document only here, and it is
>> unclear why we even want it (I am assuming this is against gc while
>> "git clone" is still running) or how we are going to retire it, if
>> ever.  If there are multiple bundle files involved in this "git clone",
>> to anchor objects that are necessary against "gc", don't we need to use
>> refs/bundles/<i>/* or something like that, where <i> is uniquely assigned
>> number locally?
> 
> The real reason to keep them in refs/bundles/ is because then those
> refs can be used in the incremental 'git fetch' after downloading the
> bundles (in perpetuity) while not stomping refs/heads or refs/remotes/

...I completely ignored your "refs/bundles/<i>/*" suggestion, which is
an interesting way to allow dropping refs from this space, allowing GC
to clear up space over time.

I'm making note of this and will include it as a potential way forward
(while I also think about what the implementation would look like).

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/6] docs: document bundle URI standard
  2022-06-21 19:34       ` Derrick Stolee
@ 2022-06-21 20:16         ` Junio C Hamano
  2022-06-21 21:10           ` Derrick Stolee
  0 siblings, 1 reply; 64+ messages in thread
From: Junio C Hamano @ 2022-06-21 20:16 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, newren, avarab,
	dyroneteng, Johannes.Schindelin

Derrick Stolee <derrickstolee@github.com> writes:

>>> The refs/bundles/ appear in the document only here, and it is
>>> unclear why we even want it (I am assuming this is against gc while
>>> "git clone" is still running) or how we are going to retire it, if
>>> ever.  If there are multiple bundle files involved in this "git clone",
>>> to anchor objects that are necessary against "gc", don't we need to use
>>> refs/bundles/<i>/* or something like that, where <i> is uniquely assigned
>>> number locally?
>> 
>> The real reason to keep them in refs/bundles/ is because then those
>> refs can be used in the incremental 'git fetch' after downloading the
>> bundles (in perpetuity) while not stomping refs/heads or refs/remotes/
>
> ...I completely ignored your "refs/bundles/<i>/*" suggestion, which is
> an interesting way to allow dropping refs from this space, allowing GC
> to clear up space over time.

FWIW, I wasn't thinking about GC and expiration.  If bundle URI
thing can say "you need this, that and that other bundle" and cause
you to fetch three bundles, I thought that there needs a way for you
to record the tips of these three bundles---these three bundles
should not have to compete for refs/bundles/master, for example.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/6] docs: document bundle URI standard
  2022-06-21 20:16         ` Junio C Hamano
@ 2022-06-21 21:10           ` Derrick Stolee
  2022-06-21 21:33             ` Junio C Hamano
  0 siblings, 1 reply; 64+ messages in thread
From: Derrick Stolee @ 2022-06-21 21:10 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Derrick Stolee via GitGitGadget, git, me, newren, avarab,
	dyroneteng, Johannes.Schindelin

On 6/21/2022 4:16 PM, Junio C Hamano wrote:
> Derrick Stolee <derrickstolee@github.com> writes:
> 
>>>> The refs/bundles/ appear in the document only here, and it is
>>>> unclear why we even want it (I am assuming this is against gc while
>>>> "git clone" is still running) or how we are going to retire it, if
>>>> ever.  If there are multiple bundle files involved in this "git clone",
>>>> to anchor objects that are necessary against "gc", don't we need to use
>>>> refs/bundles/<i>/* or something like that, where <i> is uniquely assigned
>>>> number locally?
>>>
>>> The real reason to keep them in refs/bundles/ is because then those
>>> refs can be used in the incremental 'git fetch' after downloading the
>>> bundles (in perpetuity) while not stomping refs/heads or refs/remotes/
>>
>> ...I completely ignored your "refs/bundles/<i>/*" suggestion, which is
>> an interesting way to allow dropping refs from this space, allowing GC
>> to clear up space over time.
> 
> FWIW, I wasn't thinking about GC and expiration.  If bundle URI
> thing can say "you need this, that and that other bundle" and cause
> you to fetch three bundles, I thought that there needs a way for you
> to record the tips of these three bundles---these three bundles
> should not have to compete for refs/bundles/master, for example.
 
Not wanting to compete makes sense, but also we should usually
expect the "most recent" bundle to be the most recent version of the
branch. However, that ordering only makes sense when we have the
creationToken (nee timestamp) heuristic, so having distinct ref
spaces makes sense to avoid collisions.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/6] docs: document bundle URI standard
  2022-06-21 21:10           ` Derrick Stolee
@ 2022-06-21 21:33             ` Junio C Hamano
  0 siblings, 0 replies; 64+ messages in thread
From: Junio C Hamano @ 2022-06-21 21:33 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Derrick Stolee via GitGitGadget, git, me, newren, avarab,
	dyroneteng, Johannes.Schindelin

Derrick Stolee <derrickstolee@github.com> writes:

>> FWIW, I wasn't thinking about GC and expiration.  If bundle URI
>> thing can say "you need this, that and that other bundle" and cause
>> you to fetch three bundles, I thought that there needs a way for you
>> to record the tips of these three bundles---these three bundles
>> should not have to compete for refs/bundles/master, for example.
>  
> Not wanting to compete makes sense, but also we should usually
> expect the "most recent" bundle to be the most recent version of the
> branch. However, that ordering only makes sense when we have the
> creationToken (nee timestamp) heuristic, so having distinct ref
> spaces makes sense to avoid collisions.

I still do not see how keeping track of bundle tips in the longer
term with refs fits in the larger picture.  It's not like we are
keeping the bundle files (we are instead exploding them into the
object store, at which point the contents are mixed with all other
objects), and if we do so as part of a boot-strapping, we'd fetch
more history on top of what came from bundles, at which point we
no longer need these refs for protecting objects from collection.
And if the project rewinds and force pushes history, some objects
that originally came from these bundles can and should go stale and
be collected.

Also, if I am not mistaken, the table of contents does not record
the tip commits of each bundle, so keeping track of the bundle tips
we have seen does not help us optimizing download upon seeing a
toc---we'd need to look at the bundle data (at least the header part
that lists the tips) anyway.

Not complaining against the existence of refs/bundles/ hierarchy.
Just stating that I do not think I got the use of it documented in
the proposed design (perhaps it was described but I didn't
understand it, or perhaps it was under-described and needs
clarification).

Thanks.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation
  2022-06-06 19:55 [PATCH 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
                   ` (5 preceding siblings ...)
  2022-06-06 19:55 ` [PATCH 6/6] fetch: add 'refs/bundle/' to log.excludeDecoration Derrick Stolee via GitGitGadget
@ 2022-06-29 20:40 ` Derrick Stolee via GitGitGadget
  2022-06-29 20:40   ` [PATCH v2 1/6] docs: document bundle URI standard Derrick Stolee via GitGitGadget
                     ` (7 more replies)
  6 siblings, 8 replies; 64+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-06-29 20:40 UTC (permalink / raw)
  To: git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee

This is the first of series towards building the bundle URI feature as
discussed in previous RFCs, specifically pulled directly out of [5]:

[1]
https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
[2]
https://lore.kernel.org/git/cover-0.3-00000000000-20211025T211159Z-avarab@gmail.com/
[3]
https://lore.kernel.org/git/pull.1160.git.1645641063.gitgitgadget@gmail.com
[4]
https://lore.kernel.org/git/RFC-cover-v2-00.36-00000000000-20220418T165545Z-avarab@gmail.com/
[5]
https://lore.kernel.org/git/pull.1234.git.1653072042.gitgitgadget@gmail.com

The first patch details the long-term design and goals of the bundle URI
feature, including complicated features such as the bundle-uri protocol v2
verb and bundle lists with heuristics.

However, then intention is to start small with the simplest features that
allow user benefit as soon as possible. In that direction, the rest of this
series creates the ability to run 'git fetch --bundle-uri=' to skip fetching
from any remote and instead download the file at the given . Currently, that
data is expected to be a bundle, which Git will then unbundle and modify the
refs to be in the 'refs/bundle/' namespace.

Currently, the can be a literal filename, a file:// URI, or an http[s]://
URI. Tests are added for both of these cases.

As outlined in [5], the next steps after this are:

 1. Add 'git clone --bundle-uri=' to run this 'git fetch --bundle-uri=' step
    before doing a fetch negotiation with the origin remote. [6]
 2. Allow parsing a bundle list as a config file at the given URI. The
    key-value format is unified with the protocol v2 verb (coming in (3)).
    [7]
 3. Implement the protocol v2 verb, re-using the bundle list logic from (2).
    Use this to auto-discover bundle URIs during 'git clone' (behind a
    config option). [8]
 4. Implement the 'creationToken' heuristic, allowing incremental 'git
    fetch' commands to download a bundle list from a configured URI, and
    only download bundles that are new based on the creation token values.
    [9]

I have prepared some of this work as pull requests on my personal fork so
curious readers can look ahead to where we are going:

[6] https://github.com/derrickstolee/git/pull/18 [7]
https://github.com/derrickstolee/git/pull/20 [8]
https://github.com/derrickstolee/git/pull/21 [9]
https://github.com/derrickstolee/git/pull/22

As mentioned in the design document, this is not all that is possible. For
instance, Ævar's suggestion to download only the bundle headers can be used
as a second heuristic (and as an augmentation of the timestamp heuristic).


Updates in v2
=============

 * The design document has been updated based on Junio's feedback.
 * The "bundle.list." keys are now just "bundle.".
 * The "timestamp" heuristic is now "creationToken".
 * More clarity on how Git parses data from the bundle URI.
 * Dropped some unnecessary bundle list keys (*.list, *.requires).

Thanks, -Stolee

Derrick Stolee (6):
  docs: document bundle URI standard
  remote-curl: add 'get' capability
  bundle-uri: create basic file-copy logic
  fetch: add --bundle-uri option
  bundle-uri: add support for http(s):// and file://
  fetch: add 'refs/bundle/' to log.excludeDecoration

 Documentation/fetch-options.txt        |   6 +
 Documentation/git-fetch.txt            |   1 +
 Documentation/gitremote-helpers.txt    |   9 +
 Documentation/technical/bundle-uri.txt | 464 +++++++++++++++++++++++++
 Makefile                               |   1 +
 builtin/fetch.c                        |  10 +
 bundle-uri.c                           | 167 +++++++++
 bundle-uri.h                           |  14 +
 remote-curl.c                          |  33 ++
 t/t5557-http-get.sh                    |  37 ++
 t/t5558-fetch-bundle-uri.sh            |  77 ++++
 transport-helper.c                     |   5 +-
 12 files changed, 823 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/technical/bundle-uri.txt
 create mode 100644 bundle-uri.c
 create mode 100644 bundle-uri.h
 create mode 100755 t/t5557-http-get.sh
 create mode 100755 t/t5558-fetch-bundle-uri.sh


base-commit: 89c6e450fe4a919ecb6fa698005a935531c732cf
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1248%2Fderrickstolee%2Fbundle-redo%2Ffetch-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1248/derrickstolee/bundle-redo/fetch-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1248

Range-diff vs v1:

 1:  e771b2971d0 ! 1:  d444042dc4d docs: document bundle URI standard
     @@ Documentation/technical/bundle-uri.txt (new)
      +  bundle provider to use different URIs for different geographies.
      +
      +* The bundle provider can organize the bundles using heuristics, such as
     -+  timestamps or tokens, to help the client prevent downloading bundles it
     -+  does not need. When the bundle provider does not provide these
     -+  heuristics, the client can use optimizations to minimize how much of the
     -+  data is downloaded.
     ++  creation tokens, to help the client prevent downloading bundles it does
     ++  not need. When the bundle provider does not provide these heuristics,
     ++  the client can use optimizations to minimize how much of the data is
     ++  downloaded.
      +
      +* The bundle provider does not need to be associated with the Git server.
      +  The client can choose to use the bundle provider without it being
     @@ Documentation/technical/bundle-uri.txt (new)
      +SSH URIs.)
      +
      +Assuming a `200 OK` response from the server, the content at the URL is
     -+expected to be of one of two forms:
     -+
     -+1. Bundle: A Git bundle file of version 2 or higher.
     -+
     -+2. Bundle List: A plain-text file that is parsable using Git's
     -+   config file parser. This file describes one or more bundles that are
     -+   accessible from other URIs.
     ++inspected. First, Git attempts to parse the file as a bundle file of
     ++version 2 or higher. If the file is not a bundle, then the file is parsed
     ++as a plain-text file using Git's config parser. The key-value pairs in
     ++that config file are expected to describe a list of bundle URIs. If
     ++neither of these parse attempts succeed, then Git will report an error to
     ++the user that the bundle URI provided erroneous data.
      +
      +Any other data provided by the server is considered erroneous.
      +
     @@ Documentation/technical/bundle-uri.txt (new)
      +
      +A few keys focus on properties of the list itself.
      +
     -+bundle.list.version::
     -+	(Required) This value provides a version number for the table of
     -+	contents. If a future Git change enables a feature that needs the Git
     ++bundle.version::
     ++	(Required) This value provides a version number for the bundle
     ++	list. If a future Git change enables a feature that needs the Git
      +	client to react to a new key in the bundle list file, then this version
      +	will increment. The only current version number is 1, and if any other
      +	value is specified then Git will fail to use this file.
      +
     -+bundle.list.mode::
     ++bundle.mode::
      +	(Required) This value has one of two values: `all` and `any`. When `all`
      +	is specified, then the client should expect to need all of the listed
      +	bundle URIs that match their repository's requirements. When `any` is
     @@ Documentation/technical/bundle-uri.txt (new)
      +	`any` option is used to list a number of different bundle servers
      +	located in different geographies.
      +
     -+bundle.list.heuristic::
     ++bundle.heuristic::
      +	If this string-valued key exists, then the bundle list is designed to
     -+  work well with incremental `git fetch` commands. The heuristic signals
     -+  that there are additional keys available for each bundle that help
     -+  determine which subset of bundles the client should download.
     ++	work well with incremental `git fetch` commands. The heuristic signals
     ++	that there are additional keys available for each bundle that help
     ++	determine which subset of bundles the client should download.
      +
      +The remaining keys include an `<id>` segment which is a server-designated
      +name for each available bundle.
     @@ Documentation/technical/bundle-uri.txt (new)
      +	bundles across a large number of servers or CDNs with different domain
      +	names.)
      +
     -+bundle.<id>.list::
     -+	This boolean value indicates whether the client should expect the
     -+	content from this URI to be a list (if `true`) or a bundle (if `false`).
     -+	This is typically used when `bundle.list.mode` is `any`.
     -+
      +bundle.<id>.filter::
      +	This string value represents an object filter that should also appear in
      +	the header of this bundle. The server uses this value to differentiate
      +	different kinds of bundles from which the client can choose those that
      +	match their object filters.
      +
     -+bundle.<id>.timestamp::
     -+	This value is the number of seconds since Unix epoch (UTC) that this
     -+	bundle was created. This is used as an approximation of a point in time
     -+	that the bundle matches the data available at the origin server. This is
     -+	used when `bundle.list.heuristic=timestamp`.
     -+
     -+bundle.<id>.requires::
     -+	This string value represents the ID of another bundle. When present, the
     -+	server is indicating that this bundle contains a thin packfile. If the
     -+	client does not have all necessary objects to unbundle this packfile,
     -+	then the client can download the bundle with the `requires` ID and try
     -+	again. (Note: it may be beneficial to allow the server to specify
     -+	multiple `requires` bundles.) This is used when
     -+	`bundle.list.heuristic=timestamp`.
     ++bundle.<id>.creationToken::
     ++	This value is a nonnegative 64-bit integer used for sorting the bundles
     ++	the list. This is used to download a subset of bundles during a fetch
     ++	when `bundle.heuristic=creationToken`.
      +
      +bundle.<id>.location::
      +	This string value advertises a real-world location from where the bundle
      +	URI is served. This can be used to present the user with an option for
     -+	which bundle URI to use. This is only valuable when `bundle.list.mode`
     -+	is `any`.
     ++	which bundle URI to use or simply as an informative indicator of which
     ++	bundle URI was selected by Git. This is only valuable when
     ++	`bundle.mode` is `any`.
      +
      +Here is an example bundle list using the Git config format:
      +
      +```
     -+[bundle "list"]
     ++[bundle]
      +	version = 1
      +	mode = all
     -+	heuristic = timestamp
     ++	heuristic = creationToken
      +
      +[bundle "2022-02-09-1644442601-daily"]
     -+	uri = https://bundles.fake.com/git/git/2022-02-09-1644442601-daily.bundle
     ++	uri = https://bundles.example.com/git/git/2022-02-09-1644442601-daily.bundle
      +	timestamp = 1644442601
     -+	requires = 2022-02-02-1643842562
      +
      +[bundle "2022-02-02-1643842562"]
     -+	uri = https://bundles.fake.com/git/git/2022-02-02-1643842562.bundle
     ++	uri = https://bundles.example.com/git/git/2022-02-02-1643842562.bundle
      +	timestamp = 1643842562
      +
      +[bundle "2022-02-09-1644442631-daily-blobless"]
      +	uri = 2022-02-09-1644442631-daily-blobless.bundle
      +	timestamp = 1644442631
     -+	requires = 2022-02-02-1643842568-blobless
      +	filter = blob:none
      +
      +[bundle "2022-02-02-1643842568-blobless"]
     @@ Documentation/technical/bundle-uri.txt (new)
      +	filter = blob:none
      +```
      +
     -+This example uses `bundle.list.mode=all` as well as the
     -+`bundle.<id>.timestamp` heuristic. It also uses the `bundle.<id>.filter`
     ++This example uses `bundle.mode=all` as well as the
     ++`bundle.<id>.creationToken` heuristic. It also uses the `bundle.<id>.filter`
      +options to present two parallel sets of bundles: one for full clones and
      +another for blobless partial clones.
      +
      +Suppose that this bundle list was found at the URI
     -+`https://bundles.fake.com/git/git/` and so the two blobless bundles have
     ++`https://bundles.example.com/git/git/` and so the two blobless bundles have
      +the following fully-expanded URIs:
      +
     -+* `https://bundles.fake.com/git/git/2022-02-09-1644442631-daily-blobless.bundle`
     -+* `https://bundles.fake.com/git/git/2022-02-02-1643842568-blobless.bundle`
     ++* `https://bundles.example.com/git/git/2022-02-09-1644442631-daily-blobless.bundle`
     ++* `https://bundles.example.com/git/git/2022-02-02-1643842568-blobless.bundle`
      +
      +Advertising Bundle URIs
      +-----------------------
     @@ Documentation/technical/bundle-uri.txt (new)
      +protocol v2 capability specifically for discovering bundle URIs.
      +
      +The client could choose an arbitrary bundle URI as an option _or_ select
     -+the URI with lowest latency by some exploratory checks. It is up to the
     ++the URI with best performance by some exploratory checks. It is up to the
      +bundle provider to decide if having multiple URIs is preferable to a
      +single URI that is geodistributed through server-side infrastructure.
      +
     @@ Documentation/technical/bundle-uri.txt (new)
      +   Git server.
      +
      +2. If the downloaded data from a bundle URI is a bundle, then the client
     -+   inspects the bundle headers to check that the negative commit OIDs are
     -+   present in the client repository. If some are missing, then the client
     -+   delays unbundling until other bundles have been unbundled, making those
     -+   OIDs present. When all required OIDs are present, the client unbundles
     -+   that data using a refspec. The default refspec is
     -+   `+refs/heads/*:refs/bundles/*`, but this can be configured.
     ++   inspects the bundle headers to check that the prerequisite commit OIDs
     ++   are present in the client repository. If some are missing, then the
     ++   client delays unbundling until other bundles have been unbundled,
     ++   making those OIDs present. When all required OIDs are present, the
     ++   client unbundles that data using a refspec. The default refspec is
     ++   `+refs/heads/*:refs/bundles/*`, but this can be configured. These refs
     ++   are stored so that later `git fetch` negotiations can communicate the
     ++   bundled refs as `have`s, reducing the size of the fetch over the Git
     ++   protocol. To allow pruning refs from this ref namespace, Git may
     ++   introduce a numbered namespace (such as `refs/bundles/<i>/*`) such that
     ++   stale bundle refs can be deleted.
      +
      +3. If the file is instead a bundle list, then the client inspects the
     -+   `bundle.list.mode` to see if the list is of the `all` or `any` form.
     ++   `bundle.mode` to see if the list is of the `all` or `any` form.
      +
     -+   a. If `bundle.list.mode=all`, then the client considers all bundle
     ++   a. If `bundle.mode=all`, then the client considers all bundle
      +      URIs. The list is reduced based on the `bundle.<id>.filter` options
      +      matching the client repository's partial clone filter. Then, all
     -+      bundle URIs are requested. If the `bundle.<id>.timestamp` heuristic
     -+      is provided, then the bundles are downloaded in reverse-
     -+      chronological order, stopping when a bundle has all required OIDs.
     -+      The bundles can then be unbundled in chronological order. The client
     -+      stores the latest timestamp as a heuristic for avoiding future
     -+      downloads if the bundle list does not advertise newer bundles.
     -+
     -+   b. If `bundle.list.mode=any`, then the client can choose any one of the
     ++      bundle URIs are requested. If the `bundle.<id>.creationToken`
     ++      heuristic is provided, then the bundles are downloaded in decreasing
     ++      order by the creation token, stopping when a bundle has all required
     ++      OIDs. The bundles can then be unbundled in increasing creation token
     ++      order. The client stores the latest creation token as a heuristic
     ++      for avoiding future downloads if the bundle list does not advertise
     ++      bundles with larger creation tokens.
     ++
     ++   b. If `bundle.mode=any`, then the client can choose any one of the
      +      bundle URIs to inspect. The client can use a variety of ways to
      +      choose among these URIs. The client can also fallback to another URI
      +      if the initial choice fails to return a result.
      +
      +Note that during a clone we expect that all bundles will be required, and
     -+heuristics such as `bundle.<uri>.timestamp` can be used to download bundles
     -+in chronological order or in parallel.
     ++heuristics such as `bundle.<uri>.creationToken` can be used to download
     ++bundles in chronological order or in parallel.
      +
     -+If a given bundle URI is a bundle list with a `bundle.list.heuristic`
     ++If a given bundle URI is a bundle list with a `bundle.heuristic`
      +value, then the client can choose to store that URI as its chosen bundle
      +URI. The client can then navigate directly to that URI during later `git
      +fetch` calls.
     @@ Documentation/technical/bundle-uri.txt (new)
      +
      +The fetch operation follows the same procedure to download bundles from a
      +bundle list (although we do _not_ want to use parallel downloads here). We
     -+expect that the process will end when all negative commit OIDs in a thin
     -+bundle are already in the object database.
     ++expect that the process will end when all prerequisite commit OIDs in a
     ++thin bundle are already in the object database.
      +
     -+When using the `timestamp` heuristic, the client can avoid downloading any
     -+bundles if their timestamps are not larger than the stored timestamp.
     -+After fetching new bundles, this local timestamp value is updated.
     ++When using the `creationToken` heuristic, the client can avoid downloading
     ++any bundles if their creation tokenss are not larger than the stored
     ++creation token. After fetching new bundles, Git updates this local
     ++creation token.
      +
      +If the bundle provider does not provide a heuristic, then the client
      +should attempt to inspect the bundle headers before downloading the full
     @@ Documentation/technical/bundle-uri.txt (new)
      +  use the `credential.helper` to attempt authentication after the first
      +  `401 Not Authorized` response, but a second such response is a failure.
      +
     -+* The client receives data that is not parsable as a bundle or table of
     -+  contents.
     ++* The client receives data that is not parsable as a bundle or bundle list.
      +
      +* The bundle list describes a directed cycle in the
      +  `bundle.<id>.requires` links.
      +
      +* A bundle includes a filter that does not match expectations.
      +
     -+* The client cannot unbundle the bundles because the negative commit OIDs
     ++* The client cannot unbundle the bundles because the prerequisite commit OIDs
      +  are not in the object database and there are no more
      +  `bundle.<id>.requires` links to follow.
      +
     @@ Documentation/technical/bundle-uri.txt (new)
      +  the client is using hourly prefetches with background maintenance, but
      +  the server is computing bundles weekly. For this reason, the client
      +  should not use bundle URIs for fetch unless the server has explicitly
     -+  recommended it through the `bundle.list.forFetch = true` value.
     ++  recommended it through the `bundle.flags = forFetch` value.
      +
      +Implementation Plan
      +-------------------
     @@ Documentation/technical/bundle-uri.txt (new)
      +
      +2. Implement the ability to parse a bundle list from a bundle URI and
      +   update the `git fetch --bundle-uri` logic to properly distinguish
     -+   between `bundle.list.mode` options. Specifically design the feature so
     ++   between `bundle.mode` options. Specifically design the feature so
      +   that the config format parsing feeds a list of key-value pairs into the
      +   bundle list logic.
      +
     @@ Documentation/technical/bundle-uri.txt (new)
      +   (This choice is an opt-in via a config option and a command-line
      +   option.)
      +
     -+4. Allow the client to understand the `bundle.list.forFetch` configuration
     -+   and the `bundle.<id>.timestamp` heuristic. When `git clone` discovers a
     -+   bundle URI with `bundle.list.forFetch=true`, it configures the client
     -+   repository to check that bundle URI during later `git fetch <remote>`
     ++4. Allow the client to understand the `bundle.flag=forFetch` configuration
     ++   and the `bundle.<id>.creationToken` heuristic. When `git clone`
     ++   discovers a bundle URI with `bundle.flag=forFetch`, it configures the
     ++   client repository to check that bundle URI during later `git fetch <remote>`
      +   commands.
      +
      +5. Allow clients to discover bundle URIs during `git fetch` and configure
     -+   a bundle URI for later fetches if `bundle.list.forFetch=true`.
     ++   a bundle URI for later fetches if `bundle.flag=forFetch`.
      +
      +6. Implement the "inspect headers" heuristic to reduce data downloads when
     -+   the `bundle.<id>.timestamp` heuristic is not available.
     ++   the `bundle.<id>.creationToken` heuristic is not available.
      +
      +As these features are reviewed, this plan might be updated. We also expect
      +that new designs will be discovered and implemented as this feature
 2:  977f0af40fc = 2:  0a2cf60437f remote-curl: add 'get' capability
 3:  8a18acfbf05 ! 3:  abec47564fd bundle-uri: create basic file-copy logic
     @@ bundle-uri.c (new)
      +{
      +	int fd;
      +	/*
     -+	 * Find a temporray filename that is available. This is briefly
     ++	 * Find a temporary filename that is available. This is briefly
      +	 * racy, but unlikely to collide.
      +	 */
      +	fd = odb_mkstemp(name, "bundles/tmp_uri_XXXXXX");
     @@ bundle-uri.c (new)
      +	if ((bundle_fd = read_bundle_header(file, &header)) < 0)
      +		return 1;
      +
     -+	result = unbundle(r, &header, bundle_fd, &extra_index_pack_args);
     ++	if ((result = unbundle(r, &header, bundle_fd, &extra_index_pack_args)))
     ++		return 1;
      +
      +	/*
      +	 * Convert all refs/heads/ from the bundle into refs/bundles/
 4:  f9e416164e4 ! 4:  f6255ec5188 fetch: add --bundle-uri option
     @@ Commit message
           4. git fetch origin
           5. git checkout FETCH_HEAD
      
     +    Since the unbundle_from_file() in bundle-uri.c stores the bundled refs
     +    into the refs/bundle/* namespace, those references are useful for the
     +    fetch negotiation in step 4, reducing the size of the download in that
     +    step.
     +
          Later changes will make this seamless within a 'git clone' command, but
          this implementation is large enough to delay that integration.
      
 5:  923feef84e4 ! 5:  bfbd11b48bf bundle-uri: add support for http(s):// and file://
     @@ Metadata
       ## Commit message ##
          bundle-uri: add support for http(s):// and file://
      
     -    The previous change created the logic for copying a file by name. Now,
     -    first inspect the URI for an HTTP(S) prefix and use git-remote-https as
     -    the way to download the data at that URI. Otherwise, check to see if
     -    file:// is present and modify the prefix accordingly.
     +    The previous change created the 'git fetch --bundle-uri=<uri>' option.
     +    Currently, <uri> must be a filename.
     +
     +    Update copy_uri_to_file() to first inspect the URI for an HTTP(S) prefix
     +    and use git-remote-https as the way to download the data at that URI.
     +    Otherwise, check to see if file:// is present and modify the prefix
     +    accordingly.
      
          Signed-off-by: Derrick Stolee <derrickstolee@github.com>
      
 6:  5ca02b841db ! 6:  a217e9a0640 fetch: add 'refs/bundle/' to log.excludeDecoration
     @@ t/t5558-fetch-bundle-uri.sh: test_expect_success 'fetch file bundle' '
      -	test_cmp expect actual
      +	test_cmp expect actual &&
      +
     -+	test_config log.excludedecoration refs/bundle/
     ++	test_config -C fetch-to log.excludedecoration refs/bundle/
       '
       
       test_expect_success 'fetch file:// bundle' '
     @@ t/t5558-fetch-bundle-uri.sh: test_expect_success 'fetch file:// bundle' '
      -	test_cmp expect actual
      +	test_cmp expect actual &&
      +
     -+	test_config log.excludedecoration refs/bundle/
     ++	test_config -C fetch-file log.excludedecoration refs/bundle/
       '
       
       #########################################################################

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 1/6] docs: document bundle URI standard
  2022-06-29 20:40 ` [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
@ 2022-06-29 20:40   ` Derrick Stolee via GitGitGadget
  2022-07-18  9:20     ` SZEDER Gábor
                       ` (2 more replies)
  2022-06-29 20:40   ` [PATCH v2 2/6] remote-curl: add 'get' capability Derrick Stolee via GitGitGadget
                     ` (6 subsequent siblings)
  7 siblings, 3 replies; 64+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-06-29 20:40 UTC (permalink / raw)
  To: git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

Introduce the idea of bundle URIs to the Git codebase through an
aspirational design document. This document includes the full design
intended to include the feature in its fully-implemented form. This will
take several steps as detailed in the Implementation Plan section.

By committing this document now, it can be used to motivate changes
necessary to reach these final goals. The design can still be altered as
new information is discovered.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/technical/bundle-uri.txt | 464 +++++++++++++++++++++++++
 1 file changed, 464 insertions(+)
 create mode 100644 Documentation/technical/bundle-uri.txt

diff --git a/Documentation/technical/bundle-uri.txt b/Documentation/technical/bundle-uri.txt
new file mode 100644
index 00000000000..a0230a902a4
--- /dev/null
+++ b/Documentation/technical/bundle-uri.txt
@@ -0,0 +1,464 @@
+Bundle URIs
+===========
+
+Bundle URIs are locations where Git can download one or more bundles in
+order to bootstrap the object database in advance of fetching the remaining
+objects from a remote.
+
+One goal is to speed up clones and fetches for users with poor network
+connectivity to the origin server. Another benefit is to allow heavy users,
+such as CI build farms, to use local resources for the majority of Git data
+and thereby reducing the load on the origin server.
+
+To enable the bundle URI feature, users can specify a bundle URI using
+command-line options or the origin server can advertise one or more URIs
+via a protocol v2 capability.
+
+Design Goals
+------------
+
+The bundle URI standard aims to be flexible enough to satisfy multiple
+workloads. The bundle provider and the Git client have several choices in
+how they create and consume bundle URIs.
+
+* Bundles can have whatever name the server desires. This name could refer
+  to immutable data by using a hash of the bundle contents. However, this
+  means that a new URI will be needed after every update of the content.
+  This might be acceptable if the server is advertising the URI (and the
+  server is aware of new bundles being generated) but would not be
+  ergonomic for users using the command line option.
+
+* The bundles could be organized specifically for bootstrapping full
+  clones, but could also be organized with the intention of bootstrapping
+  incremental fetches. The bundle provider must decide on one of several
+  organization schemes to minimize client downloads during incremental
+  fetches, but the Git client can also choose whether to use bundles for
+  either of these operations.
+
+* The bundle provider can choose to support full clones, partial clones,
+  or both. The client can detect which bundles are appropriate for the
+  repository's partial clone filter, if any.
+
+* The bundle provider can use a single bundle (for clones only), or a
+  list of bundles. When using a list of bundles, the provider can specify
+  whether or not the client needs _all_ of the bundle URIs for a full
+  clone, or if _any_ one of the bundle URIs is sufficient. This allows the
+  bundle provider to use different URIs for different geographies.
+
+* The bundle provider can organize the bundles using heuristics, such as
+  creation tokens, to help the client prevent downloading bundles it does
+  not need. When the bundle provider does not provide these heuristics,
+  the client can use optimizations to minimize how much of the data is
+  downloaded.
+
+* The bundle provider does not need to be associated with the Git server.
+  The client can choose to use the bundle provider without it being
+  advertised by the Git server.
+
+* The client can choose to discover bundle providers that are advertised
+  by the Git server. This could happen during `git clone`, during
+  `git fetch`, both, or neither. The user can choose which combination
+  works best for them.
+
+* The client can choose to configure a bundle provider manually at any
+  time. The client can also choose to specify a bundle provider manually
+  as a command-line option to `git clone`.
+
+Each repository is different and every Git server has different needs.
+Hopefully the bundle URI feature is flexible enough to satisfy all needs.
+If not, then the feature can be extended through its versioning mechanism.
+
+Server requirements
+-------------------
+
+To provide a server-side implementation of bundle servers, no other parts
+of the Git protocol are required. This allows server maintainers to use
+static content solutions such as CDNs in order to serve the bundle files.
+
+At the current scope of the bundle URI feature, all URIs are expected to
+be HTTP(S) URLs where content is downloaded to a local file using a `GET`
+request to that URL. The server could include authentication requirements
+to those requests with the aim of triggering the configured credential
+helper for secure access. (Future extensions could use "file://" URIs or
+SSH URIs.)
+
+Assuming a `200 OK` response from the server, the content at the URL is
+inspected. First, Git attempts to parse the file as a bundle file of
+version 2 or higher. If the file is not a bundle, then the file is parsed
+as a plain-text file using Git's config parser. The key-value pairs in
+that config file are expected to describe a list of bundle URIs. If
+neither of these parse attempts succeed, then Git will report an error to
+the user that the bundle URI provided erroneous data.
+
+Any other data provided by the server is considered erroneous.
+
+Bundle Lists
+------------
+
+The Git server can advertise bundle URIs using a set of `key=value` pairs.
+A bundle URI can also serve a plain-text file in the Git config format
+containing these same `key=value` pairs. In both cases, we consider this
+to be a _bundle list_. The pairs specify information about the bundles
+that the client can use to make decisions for which bundles to download
+and which to ignore.
+
+A few keys focus on properties of the list itself.
+
+bundle.version::
+	(Required) This value provides a version number for the bundle
+	list. If a future Git change enables a feature that needs the Git
+	client to react to a new key in the bundle list file, then this version
+	will increment. The only current version number is 1, and if any other
+	value is specified then Git will fail to use this file.
+
+bundle.mode::
+	(Required) This value has one of two values: `all` and `any`. When `all`
+	is specified, then the client should expect to need all of the listed
+	bundle URIs that match their repository's requirements. When `any` is
+	specified, then the client should expect that any one of the bundle URIs
+	that match their repository's requirements will suffice. Typically, the
+	`any` option is used to list a number of different bundle servers
+	located in different geographies.
+
+bundle.heuristic::
+	If this string-valued key exists, then the bundle list is designed to
+	work well with incremental `git fetch` commands. The heuristic signals
+	that there are additional keys available for each bundle that help
+	determine which subset of bundles the client should download.
+
+The remaining keys include an `<id>` segment which is a server-designated
+name for each available bundle.
+
+bundle.<id>.uri::
+	(Required) This string value is the URI for downloading bundle `<id>`.
+	If the URI begins with a protocol (`http://` or `https://`) then the URI
+	is absolute. Otherwise, the URI is interpreted as relative to the URI
+	used for the bundle list. If the URI begins with `/`, then that relative
+	path is relative to the domain name used for the bundle list. (This use
+	of relative paths is intended to make it easier to distribute a set of
+	bundles across a large number of servers or CDNs with different domain
+	names.)
+
+bundle.<id>.filter::
+	This string value represents an object filter that should also appear in
+	the header of this bundle. The server uses this value to differentiate
+	different kinds of bundles from which the client can choose those that
+	match their object filters.
+
+bundle.<id>.creationToken::
+	This value is a nonnegative 64-bit integer used for sorting the bundles
+	the list. This is used to download a subset of bundles during a fetch
+	when `bundle.heuristic=creationToken`.
+
+bundle.<id>.location::
+	This string value advertises a real-world location from where the bundle
+	URI is served. This can be used to present the user with an option for
+	which bundle URI to use or simply as an informative indicator of which
+	bundle URI was selected by Git. This is only valuable when
+	`bundle.mode` is `any`.
+
+Here is an example bundle list using the Git config format:
+
+```
+[bundle]
+	version = 1
+	mode = all
+	heuristic = creationToken
+
+[bundle "2022-02-09-1644442601-daily"]
+	uri = https://bundles.example.com/git/git/2022-02-09-1644442601-daily.bundle
+	timestamp = 1644442601
+
+[bundle "2022-02-02-1643842562"]
+	uri = https://bundles.example.com/git/git/2022-02-02-1643842562.bundle
+	timestamp = 1643842562
+
+[bundle "2022-02-09-1644442631-daily-blobless"]
+	uri = 2022-02-09-1644442631-daily-blobless.bundle
+	timestamp = 1644442631
+	filter = blob:none
+
+[bundle "2022-02-02-1643842568-blobless"]
+	uri = /git/git/2022-02-02-1643842568-blobless.bundle
+	timestamp = 1643842568
+	filter = blob:none
+```
+
+This example uses `bundle.mode=all` as well as the
+`bundle.<id>.creationToken` heuristic. It also uses the `bundle.<id>.filter`
+options to present two parallel sets of bundles: one for full clones and
+another for blobless partial clones.
+
+Suppose that this bundle list was found at the URI
+`https://bundles.example.com/git/git/` and so the two blobless bundles have
+the following fully-expanded URIs:
+
+* `https://bundles.example.com/git/git/2022-02-09-1644442631-daily-blobless.bundle`
+* `https://bundles.example.com/git/git/2022-02-02-1643842568-blobless.bundle`
+
+Advertising Bundle URIs
+-----------------------
+
+If a user knows a bundle URI for the repository they are cloning, then
+they can specify that URI manually through a command-line option. However,
+a Git host may want to advertise bundle URIs during the clone operation,
+helping users unaware of the feature.
+
+The only thing required for this feature is that the server can advertise
+one or more bundle URIs. This advertisement takes the form of a new
+protocol v2 capability specifically for discovering bundle URIs.
+
+The client could choose an arbitrary bundle URI as an option _or_ select
+the URI with best performance by some exploratory checks. It is up to the
+bundle provider to decide if having multiple URIs is preferable to a
+single URI that is geodistributed through server-side infrastructure.
+
+Cloning with Bundle URIs
+------------------------
+
+The primary need for bundle URIs is to speed up clones. The Git client
+will interact with bundle URIs according to the following flow:
+
+1. The user specifies a bundle URI with the `--bundle-uri` command-line
+   option _or_ the client discovers a bundle list advertised by the
+   Git server.
+
+2. If the downloaded data from a bundle URI is a bundle, then the client
+   inspects the bundle headers to check that the prerequisite commit OIDs
+   are present in the client repository. If some are missing, then the
+   client delays unbundling until other bundles have been unbundled,
+   making those OIDs present. When all required OIDs are present, the
+   client unbundles that data using a refspec. The default refspec is
+   `+refs/heads/*:refs/bundles/*`, but this can be configured. These refs
+   are stored so that later `git fetch` negotiations can communicate the
+   bundled refs as `have`s, reducing the size of the fetch over the Git
+   protocol. To allow pruning refs from this ref namespace, Git may
+   introduce a numbered namespace (such as `refs/bundles/<i>/*`) such that
+   stale bundle refs can be deleted.
+
+3. If the file is instead a bundle list, then the client inspects the
+   `bundle.mode` to see if the list is of the `all` or `any` form.
+
+   a. If `bundle.mode=all`, then the client considers all bundle
+      URIs. The list is reduced based on the `bundle.<id>.filter` options
+      matching the client repository's partial clone filter. Then, all
+      bundle URIs are requested. If the `bundle.<id>.creationToken`
+      heuristic is provided, then the bundles are downloaded in decreasing
+      order by the creation token, stopping when a bundle has all required
+      OIDs. The bundles can then be unbundled in increasing creation token
+      order. The client stores the latest creation token as a heuristic
+      for avoiding future downloads if the bundle list does not advertise
+      bundles with larger creation tokens.
+
+   b. If `bundle.mode=any`, then the client can choose any one of the
+      bundle URIs to inspect. The client can use a variety of ways to
+      choose among these URIs. The client can also fallback to another URI
+      if the initial choice fails to return a result.
+
+Note that during a clone we expect that all bundles will be required, and
+heuristics such as `bundle.<uri>.creationToken` can be used to download
+bundles in chronological order or in parallel.
+
+If a given bundle URI is a bundle list with a `bundle.heuristic`
+value, then the client can choose to store that URI as its chosen bundle
+URI. The client can then navigate directly to that URI during later `git
+fetch` calls.
+
+When downloading bundle URIs, the client can choose to inspect the initial
+content before committing to downloading the entire content. This may
+provide enough information to determine if the URI is a bundle list or
+a bundle. In the case of a bundle, the client may inspect the bundle
+header to determine that all advertised tips are already in the client
+repository and cancel the remaining download.
+
+Fetching with Bundle URIs
+-------------------------
+
+When the client fetches new data, it can decide to fetch from bundle
+servers before fetching from the origin remote. This could be done via a
+command-line option, but it is more likely useful to use a config value
+such as the one specified during the clone.
+
+The fetch operation follows the same procedure to download bundles from a
+bundle list (although we do _not_ want to use parallel downloads here). We
+expect that the process will end when all prerequisite commit OIDs in a
+thin bundle are already in the object database.
+
+When using the `creationToken` heuristic, the client can avoid downloading
+any bundles if their creation tokenss are not larger than the stored
+creation token. After fetching new bundles, Git updates this local
+creation token.
+
+If the bundle provider does not provide a heuristic, then the client
+should attempt to inspect the bundle headers before downloading the full
+bundle data in case the bundle tips already exist in the client
+repository.
+
+Error Conditions
+----------------
+
+If the Git client discovers something unexpected while downloading
+information according to a bundle URI or the bundle list found at that
+location, then Git can ignore that data and continue as if it was not
+given a bundle URI. The remote Git server is the ultimate source of truth,
+not the bundle URI.
+
+Here are a few example error conditions:
+
+* The client fails to connect with a server at the given URI or a connection
+  is lost without any chance to recover.
+
+* The client receives a response other than `200 OK` (such as `404 Not Found`,
+  `401 Not Authorized`, or `500 Internal Server Error`). The client should
+  use the `credential.helper` to attempt authentication after the first
+  `401 Not Authorized` response, but a second such response is a failure.
+
+* The client receives data that is not parsable as a bundle or bundle list.
+
+* The bundle list describes a directed cycle in the
+  `bundle.<id>.requires` links.
+
+* A bundle includes a filter that does not match expectations.
+
+* The client cannot unbundle the bundles because the prerequisite commit OIDs
+  are not in the object database and there are no more
+  `bundle.<id>.requires` links to follow.
+
+There are also situations that could be seen as wasteful, but are not
+error conditions:
+
+* The downloaded bundles contain more information than is requested by
+  the clone or fetch request. A primary example is if the user requests
+  a clone with `--single-branch` but downloads bundles that store every
+  reachable commit from all `refs/heads/*` references. This might be
+  initially wasteful, but perhaps these objects will become reachable by
+  a later ref update that the client cares about.
+
+* A bundle download during a `git fetch` contains objects already in the
+  object database. This is probably unavoidable if we are using bundles
+  for fetches, since the client will almost always be slightly ahead of
+  the bundle servers after performing its "catch-up" fetch to the remote
+  server. This extra work is most wasteful when the client is fetching
+  much more frequently than the server is computing bundles, such as if
+  the client is using hourly prefetches with background maintenance, but
+  the server is computing bundles weekly. For this reason, the client
+  should not use bundle URIs for fetch unless the server has explicitly
+  recommended it through the `bundle.flags = forFetch` value.
+
+Implementation Plan
+-------------------
+
+This design document is being submitted on its own as an aspirational
+document, with the goal of implementing all of the mentioned client
+features over the course of several patch series. Here is a potential
+outline for submitting these features:
+
+1. Integrate bundle URIs into `git clone` with a `--bundle-uri` option.
+   This will include a new `git fetch --bundle-uri` mode for use as the
+   implementation underneath `git clone`. The initial version here will
+   expect a single bundle at the given URI.
+
+2. Implement the ability to parse a bundle list from a bundle URI and
+   update the `git fetch --bundle-uri` logic to properly distinguish
+   between `bundle.mode` options. Specifically design the feature so
+   that the config format parsing feeds a list of key-value pairs into the
+   bundle list logic.
+
+3. Create the `bundle-uri` protocol v2 verb so Git servers can advertise
+   bundle URIs using the key-value pairs. Plug into the existing key-value
+   input to the bundle list logic. Allow `git clone` to discover these
+   bundle URIs and bootstrap the client repository from the bundle data.
+   (This choice is an opt-in via a config option and a command-line
+   option.)
+
+4. Allow the client to understand the `bundle.flag=forFetch` configuration
+   and the `bundle.<id>.creationToken` heuristic. When `git clone`
+   discovers a bundle URI with `bundle.flag=forFetch`, it configures the
+   client repository to check that bundle URI during later `git fetch <remote>`
+   commands.
+
+5. Allow clients to discover bundle URIs during `git fetch` and configure
+   a bundle URI for later fetches if `bundle.flag=forFetch`.
+
+6. Implement the "inspect headers" heuristic to reduce data downloads when
+   the `bundle.<id>.creationToken` heuristic is not available.
+
+As these features are reviewed, this plan might be updated. We also expect
+that new designs will be discovered and implemented as this feature
+matures and becomes used in real-world scenarios.
+
+Related Work: Packfile URIs
+---------------------------
+
+The Git protocol already has a capability where the Git server can list
+a set of URLs along with the packfile response when serving a client
+request. The client is then expected to download the packfiles at those
+locations in order to have a complete understanding of the response.
+
+This mechanism is used by the Gerrit server (implemented with JGit) and
+has been effective at reducing CPU load and improving user performance for
+clones.
+
+A major downside to this mechanism is that the origin server needs to know
+_exactly_ what is in those packfiles, and the packfiles need to be available
+to the user for some time after the server has responded. This coupling
+between the origin and the packfile data is difficult to manage.
+
+Further, this implementation is extremely hard to make work with fetches.
+
+Related Work: GVFS Cache Servers
+--------------------------------
+
+The GVFS Protocol [2] is a set of HTTP endpoints designed independently of
+the Git project before Git's partial clone was created. One feature of this
+protocol is the idea of a "cache server" which can be colocated with build
+machines or developer offices to transfer Git data without overloading the
+central server.
+
+The endpoint that VFS for Git is famous for is the `GET /gvfs/objects/{oid}`
+endpoint, which allows downloading an object on-demand. This is a critical
+piece of the filesystem virtualization of that product.
+
+However, a more subtle need is the `GET /gvfs/prefetch?lastPackTimestamp=<t>`
+endpoint. Given an optional timestamp, the cache server responds with a list
+of precomputed packfiles containing the commits and trees that were introduced
+in those time intervals.
+
+The cache server computes these "prefetch" packfiles using the following
+strategy:
+
+1. Every hour, an "hourly" pack is generated with a given timestamp.
+2. Nightly, the previous 24 hourly packs are rolled up into a "daily" pack.
+3. Nightly, all prefetch packs more than 30 days old are rolled up into
+   one pack.
+
+When a user runs `gvfs clone` or `scalar clone` against a repo with cache
+servers, the client requests all prefetch packfiles, which is at most
+`24 + 30 + 1` packfiles downloading only commits and trees. The client
+then follows with a request to the origin server for the references, and
+attempts to checkout that tip reference. (There is an extra endpoint that
+helps get all reachable trees from a given commit, in case that commit
+was not already in a prefetch packfile.)
+
+During a `git fetch`, a hook requests the prefetch endpoint using the
+most-recent timestamp from a previously-downloaded prefetch packfile.
+Only the list of packfiles with later timestamps are downloaded. Most
+users fetch hourly, so they get at most one hourly prefetch pack. Users
+whose machines have been off or otherwise have not fetched in over 30 days
+might redownload all prefetch packfiles. This is rare.
+
+It is important to note that the clients always contact the origin server
+for the refs advertisement, so the refs are frequently "ahead" of the
+prefetched pack data. The missing objects are downloaded on-demand using
+the `GET gvfs/objects/{oid}` requests, when needed by a command such as
+`git checkout` or `git log`. Some Git optimizations disable checks that
+would cause these on-demand downloads to be too aggressive.
+
+See Also
+--------
+
+[1] https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
+    An earlier RFC for a bundle URI feature.
+
+[2] https://github.com/microsoft/VFSForGit/blob/master/Protocol.md
+    The GVFS Protocol
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 2/6] remote-curl: add 'get' capability
  2022-06-29 20:40 ` [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
  2022-06-29 20:40   ` [PATCH v2 1/6] docs: document bundle URI standard Derrick Stolee via GitGitGadget
@ 2022-06-29 20:40   ` Derrick Stolee via GitGitGadget
  2022-07-21 21:41     ` Josh Steadmon
  2022-06-29 20:40   ` [PATCH v2 3/6] bundle-uri: create basic file-copy logic Derrick Stolee via GitGitGadget
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 64+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-06-29 20:40 UTC (permalink / raw)
  To: git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

A future change will want a way to download a file over HTTP(S) using
the simplest of download mechanisms. We do not want to assume that the
server on the other side understands anything about the Git protocol but
could be a simple static web server.

Create the new 'get' capability for the remote helpers which advertises
that the 'get' command is avalable. A caller can send a line containing
'get <url> <path>' to download the file at <url> into the file at
<path>.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/gitremote-helpers.txt |  9 +++++++
 remote-curl.c                       | 33 +++++++++++++++++++++++++
 t/t5557-http-get.sh                 | 37 +++++++++++++++++++++++++++++
 transport-helper.c                  |  5 +++-
 4 files changed, 83 insertions(+), 1 deletion(-)
 create mode 100755 t/t5557-http-get.sh

diff --git a/Documentation/gitremote-helpers.txt b/Documentation/gitremote-helpers.txt
index 6f1e269ae43..ed8da428c98 100644
--- a/Documentation/gitremote-helpers.txt
+++ b/Documentation/gitremote-helpers.txt
@@ -168,6 +168,9 @@ Supported commands: 'list', 'import'.
 	Can guarantee that when a clone is requested, the received
 	pack is self contained and is connected.
 
+'get'::
+	Can use the 'get' command to download a file from a given URI.
+
 If a helper advertises 'connect', Git will use it if possible and
 fall back to another capability if the helper requests so when
 connecting (see the 'connect' command under COMMANDS).
@@ -418,6 +421,12 @@ Supported if the helper has the "connect" capability.
 +
 Supported if the helper has the "stateless-connect" capability.
 
+'get' <uri> <path>::
+	Downloads the file from the given `<uri>` to the given `<path>`. If
+	`<path>.temp` exists, then Git assumes that the `.temp` file is a
+	partial download from a previous attempt and will resume the
+	download from that position.
+
 If a fatal error occurs, the program writes the error message to
 stderr and exits. The caller should expect that a suitable error
 message has been printed if the child closes the connection without
diff --git a/remote-curl.c b/remote-curl.c
index 67f178b1120..f005419f872 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -1276,6 +1276,34 @@ static void parse_fetch(struct strbuf *buf)
 	strbuf_reset(buf);
 }
 
+static void parse_get(struct strbuf *buf)
+{
+	struct http_get_options opts = { 0 };
+	struct strbuf url = STRBUF_INIT;
+	struct strbuf path = STRBUF_INIT;
+	const char *p, *space;
+
+	if (!skip_prefix(buf->buf, "get ", &p))
+		die(_("http transport does not support %s"), buf->buf);
+
+	space = strchr(p, ' ');
+
+	if (!space)
+		die(_("protocol error: expected '<url> <path>', missing space"));
+
+	strbuf_add(&url, p, space - p);
+	strbuf_addstr(&path, space + 1);
+
+	if (http_get_file(url.buf, path.buf, &opts))
+		die(_("failed to download file at URL '%s'"), url.buf);
+
+	strbuf_release(&url);
+	strbuf_release(&path);
+	printf("\n");
+	fflush(stdout);
+	strbuf_reset(buf);
+}
+
 static int push_dav(int nr_spec, const char **specs)
 {
 	struct child_process child = CHILD_PROCESS_INIT;
@@ -1549,9 +1577,14 @@ int cmd_main(int argc, const char **argv)
 				printf("unsupported\n");
 			fflush(stdout);
 
+		} else if (skip_prefix(buf.buf, "get ", &arg)) {
+			parse_get(&buf);
+			fflush(stdout);
+
 		} else if (!strcmp(buf.buf, "capabilities")) {
 			printf("stateless-connect\n");
 			printf("fetch\n");
+			printf("get\n");
 			printf("option\n");
 			printf("push\n");
 			printf("check-connectivity\n");
diff --git a/t/t5557-http-get.sh b/t/t5557-http-get.sh
new file mode 100755
index 00000000000..1fd4ded3eb1
--- /dev/null
+++ b/t/t5557-http-get.sh
@@ -0,0 +1,37 @@
+#!/bin/sh
+
+test_description='test downloading a file by URL'
+
+. ./test-lib.sh
+
+. "$TEST_DIRECTORY"/lib-httpd.sh
+start_httpd
+
+test_expect_success 'get by URL: 404' '
+	url="$HTTPD_URL/none.txt" &&
+	cat >input <<-EOF &&
+	capabilities
+	get $url file1
+	EOF
+
+	test_must_fail git remote-http $url $url <input 2>err &&
+	test_path_is_missing file1 &&
+	grep "failed to download file at URL" err &&
+	rm file1.temp
+'
+
+test_expect_success 'get by URL: 200' '
+	echo data >"$HTTPD_DOCUMENT_ROOT_PATH/exists.txt" &&
+
+	url="$HTTPD_URL/exists.txt" &&
+	cat >input <<-EOF &&
+	capabilities
+	get $url file2
+
+	EOF
+
+	GIT_TRACE2_PERF=1 git remote-http $url $url <input &&
+	test_cmp "$HTTPD_DOCUMENT_ROOT_PATH/exists.txt" file2
+'
+
+test_done
diff --git a/transport-helper.c b/transport-helper.c
index b4dbbabb0c2..dfbeaebe40c 100644
--- a/transport-helper.c
+++ b/transport-helper.c
@@ -33,7 +33,8 @@ struct helper_data {
 		check_connectivity : 1,
 		no_disconnect_req : 1,
 		no_private_update : 1,
-		object_format : 1;
+		object_format : 1,
+		get : 1;
 
 	/*
 	 * As an optimization, the transport code may invoke fetch before
@@ -210,6 +211,8 @@ static struct child_process *get_helper(struct transport *transport)
 			data->no_private_update = 1;
 		} else if (starts_with(capname, "object-format")) {
 			data->object_format = 1;
+		} else if (!strcmp(capname, "get")) {
+			data->get = 1;
 		} else if (mandatory) {
 			die(_("unknown mandatory capability %s; this remote "
 			      "helper probably needs newer version of Git"),
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 3/6] bundle-uri: create basic file-copy logic
  2022-06-29 20:40 ` [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
  2022-06-29 20:40   ` [PATCH v2 1/6] docs: document bundle URI standard Derrick Stolee via GitGitGadget
  2022-06-29 20:40   ` [PATCH v2 2/6] remote-curl: add 'get' capability Derrick Stolee via GitGitGadget
@ 2022-06-29 20:40   ` Derrick Stolee via GitGitGadget
  2022-07-21 21:45     ` Josh Steadmon
  2022-06-29 20:40   ` [PATCH v2 4/6] fetch: add --bundle-uri option Derrick Stolee via GitGitGadget
                     ` (4 subsequent siblings)
  7 siblings, 1 reply; 64+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-06-29 20:40 UTC (permalink / raw)
  To: git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

Before implementing a way to fetch bundles into a repository, create the
basic logic. Assume that the URI is actually a file path. Future logic
will make this more careful to other protocols.

For now, we also only succeed if the content at the URI is a bundle
file, not a bundle list. Bundle lists will be implemented in a future
change.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Makefile     |  1 +
 bundle-uri.c | 93 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 bundle-uri.h | 14 ++++++++
 3 files changed, 108 insertions(+)
 create mode 100644 bundle-uri.c
 create mode 100644 bundle-uri.h

diff --git a/Makefile b/Makefile
index f8bccfab5e9..8f27310836d 100644
--- a/Makefile
+++ b/Makefile
@@ -887,6 +887,7 @@ LIB_OBJS += blob.o
 LIB_OBJS += bloom.o
 LIB_OBJS += branch.o
 LIB_OBJS += bulk-checkin.o
+LIB_OBJS += bundle-uri.o
 LIB_OBJS += bundle.o
 LIB_OBJS += cache-tree.o
 LIB_OBJS += cbtree.o
diff --git a/bundle-uri.c b/bundle-uri.c
new file mode 100644
index 00000000000..f0abbf434c8
--- /dev/null
+++ b/bundle-uri.c
@@ -0,0 +1,93 @@
+#include "cache.h"
+#include "bundle-uri.h"
+#include "bundle.h"
+#include "object-store.h"
+#include "refs.h"
+#include "run-command.h"
+
+static void find_temp_filename(struct strbuf *name)
+{
+	int fd;
+	/*
+	 * Find a temporary filename that is available. This is briefly
+	 * racy, but unlikely to collide.
+	 */
+	fd = odb_mkstemp(name, "bundles/tmp_uri_XXXXXX");
+	if (fd < 0)
+		die(_("failed to create temporary file"));
+	close(fd);
+	unlink(name->buf);
+}
+
+static int copy_uri_to_file(const char *uri, const char *file)
+{
+	/* Copy as a file */
+	return copy_file(file, uri, 0444);
+}
+
+static int unbundle_from_file(struct repository *r, const char *file)
+{
+	int result = 0;
+	int bundle_fd;
+	struct bundle_header header = BUNDLE_HEADER_INIT;
+	struct strvec extra_index_pack_args = STRVEC_INIT;
+	struct string_list_item *refname;
+	struct strbuf bundle_ref = STRBUF_INIT;
+	size_t bundle_prefix_len;
+
+	if ((bundle_fd = read_bundle_header(file, &header)) < 0)
+		return 1;
+
+	if ((result = unbundle(r, &header, bundle_fd, &extra_index_pack_args)))
+		return 1;
+
+	/*
+	 * Convert all refs/heads/ from the bundle into refs/bundles/
+	 * in the local repository.
+	 */
+	strbuf_addstr(&bundle_ref, "refs/bundles/");
+	bundle_prefix_len = bundle_ref.len;
+
+	for_each_string_list_item(refname, &header.references) {
+		struct object_id *oid = refname->util;
+		struct object_id old_oid;
+		const char *branch_name;
+		int has_old;
+
+		if (!skip_prefix(refname->string, "refs/heads/", &branch_name))
+			continue;
+
+		strbuf_setlen(&bundle_ref, bundle_prefix_len);
+		strbuf_addstr(&bundle_ref, branch_name);
+
+		has_old = !read_ref(bundle_ref.buf, &old_oid);
+		update_ref("fetched bundle", bundle_ref.buf, oid,
+			   has_old ? &old_oid : NULL,
+			   REF_SKIP_OID_VERIFICATION,
+			   UPDATE_REFS_MSG_ON_ERR);
+	}
+
+	bundle_header_release(&header);
+	return result;
+}
+
+int fetch_bundle_uri(struct repository *r, const char *uri)
+{
+	int result = 0;
+	struct strbuf filename = STRBUF_INIT;
+
+	find_temp_filename(&filename);
+	if ((result = copy_uri_to_file(uri, filename.buf)))
+		goto cleanup;
+
+	if ((result = !is_bundle(filename.buf, 0)))
+		goto cleanup;
+
+	if ((result = unbundle_from_file(r, filename.buf)))
+		goto cleanup;
+
+cleanup:
+	unlink(filename.buf);
+	strbuf_release(&filename);
+	return result;
+}
diff --git a/bundle-uri.h b/bundle-uri.h
new file mode 100644
index 00000000000..8a152f1ef14
--- /dev/null
+++ b/bundle-uri.h
@@ -0,0 +1,14 @@
+#ifndef BUNDLE_URI_H
+#define BUNDLE_URI_H
+
+struct repository;
+
+/**
+ * Fetch data from the given 'uri' and unbundle the bundle data found
+ * based on that information.
+ *
+ * Returns non-zero if no bundle information is found at the given 'uri'.
+ */
+int fetch_bundle_uri(struct repository *r, const char *uri);
+
+#endif
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 4/6] fetch: add --bundle-uri option
  2022-06-29 20:40 ` [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
                     ` (2 preceding siblings ...)
  2022-06-29 20:40   ` [PATCH v2 3/6] bundle-uri: create basic file-copy logic Derrick Stolee via GitGitGadget
@ 2022-06-29 20:40   ` Derrick Stolee via GitGitGadget
  2022-06-29 20:40   ` [PATCH v2 5/6] bundle-uri: add support for http(s):// and file:// Derrick Stolee via GitGitGadget
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 64+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-06-29 20:40 UTC (permalink / raw)
  To: git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

Teach 'git fetch' a new --bundle-uri=<uri> option which changes the mode
from fetching from a remote using the Git protocol to fetching a bundle
from the given <uri>. See Documentation/technical/bundle-uri.txt for
more information on the design of this feature.

This implementation is limited to the most basic version of the feature.
We expect the content at that URI to be a bundle file, not a bundle
list. Bundle lists will be implemented later.

This implementation is sufficient for a bundle provider to create a
single bootstrap bundle for a large repository. The user would bootstrap
a repository using a sequence of Git commands, such as:

 1. git init <repo> && cd <repo>
 2. git fetch --bundle-uri=<uri>
 3. git remote add origin <url>
 4. git fetch origin
 5. git checkout FETCH_HEAD

Since the unbundle_from_file() in bundle-uri.c stores the bundled refs
into the refs/bundle/* namespace, those references are useful for the
fetch negotiation in step 4, reducing the size of the download in that
step.

Later changes will make this seamless within a 'git clone' command, but
this implementation is large enough to delay that integration.

Currently, this option supports local filenames. Other protocols will be
added in the future.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/fetch-options.txt |  5 +++++
 Documentation/git-fetch.txt     |  1 +
 builtin/fetch.c                 | 10 ++++++++++
 bundle-uri.c                    |  8 ++++++--
 t/t5558-fetch-bundle-uri.sh     | 34 +++++++++++++++++++++++++++++++++
 5 files changed, 56 insertions(+), 2 deletions(-)
 create mode 100755 t/t5558-fetch-bundle-uri.sh

diff --git a/Documentation/fetch-options.txt b/Documentation/fetch-options.txt
index 622bd84768b..09bd1feeed8 100644
--- a/Documentation/fetch-options.txt
+++ b/Documentation/fetch-options.txt
@@ -317,3 +317,8 @@ endif::git-pull[]
 -6::
 --ipv6::
 	Use IPv6 addresses only, ignoring IPv4 addresses.
+
+--bundle-uri=<uri>::
+	Instead of fetching from a remote, fetch a bundle from the given
+	`<uri>` and unbundle the data into the local repository. The refs
+	in the bundle will be stored under the `refs/bundle/*` namespace.
diff --git a/Documentation/git-fetch.txt b/Documentation/git-fetch.txt
index e9d364669af..4fd8911b336 100644
--- a/Documentation/git-fetch.txt
+++ b/Documentation/git-fetch.txt
@@ -13,6 +13,7 @@ SYNOPSIS
 'git fetch' [<options>] <group>
 'git fetch' --multiple [<options>] [(<repository> | <group>)...]
 'git fetch' --all [<options>]
+'git fetch' --bundle-uri=<uri> [<options>]
 
 
 DESCRIPTION
diff --git a/builtin/fetch.c b/builtin/fetch.c
index e3791f09ed5..cb0d2fbe82c 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -29,6 +29,7 @@
 #include "commit-graph.h"
 #include "shallow.h"
 #include "worktree.h"
+#include "bundle-uri.h"
 
 #define FORCED_UPDATES_DELAY_WARNING_IN_MS (10 * 1000)
 
@@ -37,6 +38,7 @@ static const char * const builtin_fetch_usage[] = {
 	N_("git fetch [<options>] <group>"),
 	N_("git fetch --multiple [<options>] [(<repository> | <group>)...]"),
 	N_("git fetch --all [<options>]"),
+	N_("git fetch --bundle-uri=<uri> [<options>]"),
 	NULL
 };
 
@@ -86,6 +88,7 @@ static struct string_list negotiation_tip = STRING_LIST_INIT_NODUP;
 static int fetch_write_commit_graph = -1;
 static int stdin_refspecs = 0;
 static int negotiate_only;
+static const char *bundle_uri;
 
 static int git_fetch_config(const char *k, const char *v, void *cb)
 {
@@ -224,6 +227,8 @@ static struct option builtin_fetch_options[] = {
 		 N_("write the commit-graph after fetching")),
 	OPT_BOOL(0, "stdin", &stdin_refspecs,
 		 N_("accept refspecs from stdin")),
+	OPT_STRING(0, "bundle-uri", &bundle_uri, N_("uri"),
+		   N_("download bundle data from the given URI instead of from a remote")),
 	OPT_END()
 };
 
@@ -2181,6 +2186,11 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 	if (dry_run)
 		write_fetch_head = 0;
 
+	if (bundle_uri) {
+		result = fetch_bundle_uri(the_repository, bundle_uri);
+		goto cleanup;
+	}
+
 	if (all) {
 		if (argc == 1)
 			die(_("fetch --all does not take a repository argument"));
diff --git a/bundle-uri.c b/bundle-uri.c
index f0abbf434c8..f80a3b14103 100644
--- a/bundle-uri.c
+++ b/bundle-uri.c
@@ -77,11 +77,15 @@ int fetch_bundle_uri(struct repository *r, const char *uri)
 	struct strbuf filename = STRBUF_INIT;
 
 	find_temp_filename(&filename);
-	if ((result = copy_uri_to_file(uri, filename.buf)))
+	if ((result = copy_uri_to_file(uri, filename.buf))) {
+		error(_("failed to download bundle from URI '%s'"), uri);
 		goto cleanup;
+	}
 
-	if ((result = !is_bundle(filename.buf, 0)))
+	if ((result = !is_bundle(filename.buf, 0))) {
+		error(_("file at URI '%s' is not a bundle"), uri);
 		goto cleanup;
+	}
 
 	if ((result = unbundle_from_file(r, filename.buf)))
 		goto cleanup;
diff --git a/t/t5558-fetch-bundle-uri.sh b/t/t5558-fetch-bundle-uri.sh
new file mode 100755
index 00000000000..381e56cac20
--- /dev/null
+++ b/t/t5558-fetch-bundle-uri.sh
@@ -0,0 +1,34 @@
+#!/bin/sh
+
+test_description='test fetching bundles with --bundle-uri'
+
+. ./test-lib.sh
+
+test_expect_success 'fail to fetch from non-existent file' '
+	test_must_fail git fetch --bundle-uri="$(pwd)/does-not-exist" 2>err &&
+	grep "failed to download bundle from URI" err
+'
+
+test_expect_success 'fail to fetch from non-bundle file' '
+	echo bogus >bogus &&
+	test_must_fail git fetch --bundle-uri="$(pwd)/bogus" 2>err &&
+	grep "is not a bundle" err
+'
+
+test_expect_success 'create bundle' '
+	git init fetch-from &&
+	git -C fetch-from checkout -b topic &&
+	test_commit -C fetch-from A &&
+	test_commit -C fetch-from B &&
+	git -C fetch-from bundle create B.bundle topic
+'
+
+test_expect_success 'fetch file bundle' '
+	git init fetch-to &&
+	git -C fetch-to fetch --bundle-uri="$(pwd)/fetch-from/B.bundle" &&
+	git -C fetch-to rev-parse refs/bundles/topic >actual &&
+	git -C fetch-from rev-parse topic >expect &&
+	test_cmp expect actual
+'
+
+test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 5/6] bundle-uri: add support for http(s):// and file://
  2022-06-29 20:40 ` [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
                     ` (3 preceding siblings ...)
  2022-06-29 20:40   ` [PATCH v2 4/6] fetch: add --bundle-uri option Derrick Stolee via GitGitGadget
@ 2022-06-29 20:40   ` Derrick Stolee via GitGitGadget
  2022-06-29 20:40   ` [PATCH v2 6/6] fetch: add 'refs/bundle/' to log.excludeDecoration Derrick Stolee via GitGitGadget
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 64+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-06-29 20:40 UTC (permalink / raw)
  To: git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The previous change created the 'git fetch --bundle-uri=<uri>' option.
Currently, <uri> must be a filename.

Update copy_uri_to_file() to first inspect the URI for an HTTP(S) prefix
and use git-remote-https as the way to download the data at that URI.
Otherwise, check to see if file:// is present and modify the prefix
accordingly.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 bundle-uri.c                | 65 ++++++++++++++++++++++++++++++++++++-
 t/t5558-fetch-bundle-uri.sh | 37 +++++++++++++++++++++
 2 files changed, 101 insertions(+), 1 deletion(-)

diff --git a/bundle-uri.c b/bundle-uri.c
index f80a3b14103..4c793843a2a 100644
--- a/bundle-uri.c
+++ b/bundle-uri.c
@@ -19,10 +19,73 @@ static void find_temp_filename(struct strbuf *name)
 	unlink(name->buf);
 }
 
+static int download_https_uri_to_file(const char *uri, const char *file)
+{
+	int result = 0;
+	struct child_process cp = CHILD_PROCESS_INIT;
+	FILE *child_in = NULL, *child_out = NULL;
+	struct strbuf line = STRBUF_INIT;
+	int found_get = 0;
+
+	strvec_pushl(&cp.args, "git-remote-https", "origin", uri, NULL);
+	cp.in = -1;
+	cp.out = -1;
+
+	if (start_command(&cp))
+		return 1;
+
+	child_in = fdopen(cp.in, "w");
+	if (!child_in) {
+		result = 1;
+		goto cleanup;
+	}
+
+	child_out = fdopen(cp.out, "r");
+	if (!child_out) {
+		result = 1;
+		goto cleanup;
+	}
+
+	fprintf(child_in, "capabilities\n");
+	fflush(child_in);
+
+	while (!strbuf_getline(&line, child_out)) {
+		if (!line.len)
+			break;
+		if (!strcmp(line.buf, "get"))
+			found_get = 1;
+	}
+	strbuf_release(&line);
+
+	if (!found_get) {
+		result = error(_("insufficient capabilities"));
+		goto cleanup;
+	}
+
+	fprintf(child_in, "get %s %s\n\n", uri, file);
+
+cleanup:
+	if (child_in)
+		fclose(child_in);
+	if (finish_command(&cp))
+		return 1;
+	if (child_out)
+		fclose(child_out);
+	return result;
+}
+
 static int copy_uri_to_file(const char *uri, const char *file)
 {
+	const char *out;
+	if (skip_prefix(uri, "https:", &out) ||
+	    skip_prefix(uri, "http:", &out))
+		return download_https_uri_to_file(uri, file);
+
+	if (!skip_prefix(uri, "file://", &out))
+		out = uri;
+
 	/* Copy as a file */
-	return copy_file(file, uri, 0444);
+	return !!copy_file(file, out, 0);
 }
 
 static int unbundle_from_file(struct repository *r, const char *file)
diff --git a/t/t5558-fetch-bundle-uri.sh b/t/t5558-fetch-bundle-uri.sh
index 381e56cac20..919db6f4551 100755
--- a/t/t5558-fetch-bundle-uri.sh
+++ b/t/t5558-fetch-bundle-uri.sh
@@ -31,4 +31,41 @@ test_expect_success 'fetch file bundle' '
 	test_cmp expect actual
 '
 
+test_expect_success 'fetch file:// bundle' '
+	git init fetch-file &&
+	git -C fetch-file fetch --bundle-uri="file://$(pwd)/fetch-from/B.bundle" &&
+	git -C fetch-file rev-parse refs/bundles/topic >actual &&
+	git -C fetch-from rev-parse topic >expect &&
+	test_cmp expect actual
+'
+
+#########################################################################
+# HTTP tests begin here
+
+. "$TEST_DIRECTORY"/lib-httpd.sh
+start_httpd
+
+test_expect_success 'fail to fetch from non-existent HTTP URL' '
+	test_must_fail git fetch --bundle-uri="$HTTPD_URL/does-not-exist" 2>err &&
+	grep "failed to download bundle from URI" err
+'
+
+test_expect_success 'fail to fetch from non-bundle HTTP URL' '
+	echo bogus >"$HTTPD_DOCUMENT_ROOT_PATH/bogus" &&
+	test_must_fail git fetch --bundle-uri="$HTTPD_URL/bogus" 2>err &&
+	grep "is not a bundle" err
+'
+
+test_expect_success 'fetch HTTP bundle' '
+	cp fetch-from/B.bundle "$HTTPD_DOCUMENT_ROOT_PATH/B.bundle" &&
+	git init fetch-http &&
+	git -C fetch-http fetch --bundle-uri="$HTTPD_URL/B.bundle" &&
+	git -C fetch-http rev-parse refs/bundles/topic >actual &&
+	git -C fetch-from rev-parse topic >expect &&
+	test_cmp expect actual
+'
+
+# Do not add tests here unless they use the HTTP server, as they will
+# not run unless the HTTP dependencies exist.
+
 test_done
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 6/6] fetch: add 'refs/bundle/' to log.excludeDecoration
  2022-06-29 20:40 ` [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
                     ` (4 preceding siblings ...)
  2022-06-29 20:40   ` [PATCH v2 5/6] bundle-uri: add support for http(s):// and file:// Derrick Stolee via GitGitGadget
@ 2022-06-29 20:40   ` Derrick Stolee via GitGitGadget
  2022-07-21 21:47     ` Josh Steadmon
  2022-07-21 21:48   ` [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Josh Steadmon
  2022-07-25 13:53   ` [PATCH v3 0/2] " Derrick Stolee via GitGitGadget
  7 siblings, 1 reply; 64+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-06-29 20:40 UTC (permalink / raw)
  To: git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

When fetching from a bundle URI, the branches of that bundle are stored
in a different ref namespace: refs/bundles/. This namespace is intended
to assist with later 'git fetch' negotiations with a Git server,
allowing the client to advertise which data it already has from a bundle
URI.

These references can be confusing for a user when they appear as a
decoration in 'git log' output. Add "refs/bundles/" to the multi-valued
log.excludeDecoration config value. This is similar to the way
"refs/prefetch/" is hidden by background prefetch operations in 'git
maintenance' as added by 96eaffebb (maintenance: set
log.excludeDecoration durin prefetch, 2021-01-19).

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/fetch-options.txt |  3 ++-
 bundle-uri.c                    |  7 +++++++
 t/t5558-fetch-bundle-uri.sh     | 12 +++++++++---
 3 files changed, 18 insertions(+), 4 deletions(-)

diff --git a/Documentation/fetch-options.txt b/Documentation/fetch-options.txt
index 09bd1feeed8..8b801bcc2f3 100644
--- a/Documentation/fetch-options.txt
+++ b/Documentation/fetch-options.txt
@@ -321,4 +321,5 @@ endif::git-pull[]
 --bundle-uri=<uri>::
 	Instead of fetching from a remote, fetch a bundle from the given
 	`<uri>` and unbundle the data into the local repository. The refs
-	in the bundle will be stored under the `refs/bundle/*` namespace.
+	in the bundle will be stored under the hidden `refs/bundle/*`
+	namespace.
diff --git a/bundle-uri.c b/bundle-uri.c
index 4c793843a2a..6e0f1cb06fd 100644
--- a/bundle-uri.c
+++ b/bundle-uri.c
@@ -1,6 +1,7 @@
 #include "cache.h"
 #include "bundle-uri.h"
 #include "bundle.h"
+#include "config.h"
 #include "object-store.h"
 #include "refs.h"
 #include "run-command.h"
@@ -153,6 +154,12 @@ int fetch_bundle_uri(struct repository *r, const char *uri)
 	if ((result = unbundle_from_file(r, filename.buf)))
 		goto cleanup;
 
+	git_config_set_multivar_gently("log.excludedecoration",
+					"refs/bundle/",
+					"refs/bundle/",
+					CONFIG_FLAGS_FIXED_VALUE |
+					CONFIG_FLAGS_MULTI_REPLACE);
+
 cleanup:
 	unlink(filename.buf);
 	strbuf_release(&filename);
diff --git a/t/t5558-fetch-bundle-uri.sh b/t/t5558-fetch-bundle-uri.sh
index 919db6f4551..bed4cc3e99c 100755
--- a/t/t5558-fetch-bundle-uri.sh
+++ b/t/t5558-fetch-bundle-uri.sh
@@ -28,7 +28,9 @@ test_expect_success 'fetch file bundle' '
 	git -C fetch-to fetch --bundle-uri="$(pwd)/fetch-from/B.bundle" &&
 	git -C fetch-to rev-parse refs/bundles/topic >actual &&
 	git -C fetch-from rev-parse topic >expect &&
-	test_cmp expect actual
+	test_cmp expect actual &&
+
+	test_config -C fetch-to log.excludedecoration refs/bundle/
 '
 
 test_expect_success 'fetch file:// bundle' '
@@ -36,7 +38,9 @@ test_expect_success 'fetch file:// bundle' '
 	git -C fetch-file fetch --bundle-uri="file://$(pwd)/fetch-from/B.bundle" &&
 	git -C fetch-file rev-parse refs/bundles/topic >actual &&
 	git -C fetch-from rev-parse topic >expect &&
-	test_cmp expect actual
+	test_cmp expect actual &&
+
+	test_config -C fetch-file log.excludedecoration refs/bundle/
 '
 
 #########################################################################
@@ -62,7 +66,9 @@ test_expect_success 'fetch HTTP bundle' '
 	git -C fetch-http fetch --bundle-uri="$HTTPD_URL/B.bundle" &&
 	git -C fetch-http rev-parse refs/bundles/topic >actual &&
 	git -C fetch-from rev-parse topic >expect &&
-	test_cmp expect actual
+	test_cmp expect actual &&
+
+	test_config log.excludedecoration refs/bundle/
 '
 
 # Do not add tests here unless they use the HTTP server, as they will
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 1/6] docs: document bundle URI standard
  2022-06-29 20:40   ` [PATCH v2 1/6] docs: document bundle URI standard Derrick Stolee via GitGitGadget
@ 2022-07-18  9:20     ` SZEDER Gábor
  2022-07-21 12:09     ` Matthew John Cheetham
  2022-07-21 21:39     ` Josh Steadmon
  2 siblings, 0 replies; 64+ messages in thread
From: SZEDER Gábor @ 2022-07-18  9:20 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee

On Wed, Jun 29, 2022 at 08:40:40PM +0000, Derrick Stolee via GitGitGadget wrote:
> diff --git a/Documentation/technical/bundle-uri.txt b/Documentation/technical/bundle-uri.txt
> new file mode 100644
> index 00000000000..a0230a902a4
> --- /dev/null
> +++ b/Documentation/technical/bundle-uri.txt

> +Implementation Plan
> +-------------------
> +
> +This design document is being submitted on its own as an aspirational
> +document, with the goal of implementing all of the mentioned client
> +features over the course of several patch series. Here is a potential
> +outline for submitting these features:
> +
> +1. Integrate bundle URIs into `git clone` with a `--bundle-uri` option.
> +   This will include a new `git fetch --bundle-uri` mode for use as the
> +   implementation underneath `git clone`. The initial version here will
> +   expect a single bundle at the given URI.
> +
> +2. Implement the ability to parse a bundle list from a bundle URI and
> +   update the `git fetch --bundle-uri` logic to properly distinguish
> +   between `bundle.mode` options. Specifically design the feature so
> +   that the config format parsing feeds a list of key-value pairs into the
> +   bundle list logic.
> +
> +3. Create the `bundle-uri` protocol v2 verb so Git servers can advertise

s/verb/command/ or s/verb/request/

'bundle-uri' is not a verb, but even if it were, the protocol v2
documentation talks only about commands and requests, but never
mentions a single "verb".

> +   bundle URIs using the key-value pairs. Plug into the existing key-value
> +   input to the bundle list logic. Allow `git clone` to discover these
> +   bundle URIs and bootstrap the client repository from the bundle data.
> +   (This choice is an opt-in via a config option and a command-line
> +   option.)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 1/6] docs: document bundle URI standard
  2022-06-29 20:40   ` [PATCH v2 1/6] docs: document bundle URI standard Derrick Stolee via GitGitGadget
  2022-07-18  9:20     ` SZEDER Gábor
@ 2022-07-21 12:09     ` Matthew John Cheetham
  2022-07-22 13:52       ` Derrick Stolee
  2022-07-22 16:03       ` Derrick Stolee
  2022-07-21 21:39     ` Josh Steadmon
  2 siblings, 2 replies; 64+ messages in thread
From: Matthew John Cheetham @ 2022-07-21 12:09 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee, git

I had a few questions and suggestions; below.

On 2022-06-29 21:40, Derrick Stolee via GitGitGadget wrote:> +Assuming a 
`200 OK` response from the server, the content at the URL is
> +inspected. First, Git attempts to parse the file as a bundle file of
> +version 2 or higher. If the file is not a bundle, then the file is parsed
> +as a plain-text file using Git's config parser. The key-value pairs in
> +that config file are expected to describe a list of bundle URIs. If
> +neither of these parse attempts succeed, then Git will report an error to
> +the user that the bundle URI provided erroneous data.
> +
> +Any other data provided by the server is considered erroneous.

I wonder if it may be worth considering adding an optional server 
requirement ("MAY" not "MUST") to provide a `Content-Type` header 
indicating if the response is a bundle list (or bundle) to skip straight 
to parsing the correct type of file? Eg "application/x-git-bundle-list"?

Even the simplest of content servers should be able to set Content-Type. 
If not falling back to 'try parse bundle else try parse bundle list' is 
still OK.

> +bundle.mode::
> +	(Required) This value has one of two values: `all` and `any`. When `all`
> +	is specified, then the client should expect to need all of the listed
> +	bundle URIs that match their repository's requirements. When `any` is
> +	specified, then the client should expect that any one of the bundle URIs
> +	that match their repository's requirements will suffice. Typically, the
> +	`any` option is used to list a number of different bundle servers
> +	located in different geographies.

Do you forsee any future where we'd want or need to specify 'sets' of 
bundles where "all" _of_ "any" particular set would be required?
Eg. there are 3 sets of bundles (A, B, C), and the client would need to 
download all bundles belonging to any of A, B, or C? Where ABC would be 
different geo-distributed sets?

I guess what I'm getting at here is with this design (which I appreciate 
is intentionally flexible), there are several different ways a server 
could direct a client to bundles stored in a nearby geography:

1. Serve an "all" bundle list with geo-located bundle URIs?
2. Serve a global "any" bundle list with each single bundle in a 
different geography (specified by `location`)
3. Serve a single bundle (not a list) with a different

Are any of these going to be preferred over another for potential client 
optimisations?

> +
> +bundle.heuristic::
> +	If this string-valued key exists, then the bundle list is designed to
> +	work well with incremental `git fetch` commands. The heuristic signals
> +	that there are additional keys available for each bundle that help
> +	determine which subset of bundles the client should download.
> +
> +The remaining keys include an `<id>` segment which is a server-designated
> +name for each available bundle.

Case-sensitive ID? A-Za-z0-9 only? "Same as Git config rules"?

> +bundle.<id>.location::
> +	This string value advertises a real-world location from where the bundle
> +	URI is served. This can be used to present the user with an option for
> +	which bundle URI to use or simply as an informative indicator of which
> +	bundle URI was selected by Git. This is only valuable when
> +	`bundle.mode` is `any`.

I assume `location` is just an opaque string that is just used for info 
or display purposes? Does it make sense for other 'display' type strings 
like 'name' or 'message'?

> +Here is an example bundle list using the Git config format:
> +
> +```
> +[bundle]
> +	version = 1
> +	mode = all
> +	heuristic = creationToken
> +
> +[bundle "2022-02-09-1644442601-daily"]
> +	uri = https://bundles.example.com/git/git/2022-02-09-1644442601-daily.bundle
> +	timestamp = 1644442601
> +
> +[bundle "2022-02-02-1643842562"]
> +	uri = https://bundles.example.com/git/git/2022-02-02-1643842562.bundle
> +	timestamp = 1643842562
> +
> +[bundle "2022-02-09-1644442631-daily-blobless"]
> +	uri = 2022-02-09-1644442631-daily-blobless.bundle
> +	timestamp = 1644442631
> +	filter = blob:none
> +
> +[bundle "2022-02-02-1643842568-blobless"]
> +	uri = /git/git/2022-02-02-1643842568-blobless.bundle
> +	timestamp = 1643842568
> +	filter = blob:none
> +```

Do you mean to use `creationToken` in these examples rather than 
`timestamp`?

> +Advertising Bundle URIs
> +-----------------------
> +
...
> +The client could choose an arbitrary bundle URI as an option _or_ select
> +the URI with best performance by some exploratory checks. It is up to the
> +bundle provider to decide if having multiple URIs is preferable to a
> +single URI that is geodistributed through server-side infrastructure.

Would it make sense for the client to pick the first bundle URI rather 
than an arbitrary one? The server could use information about the 
request (origin IP/geography) to provide a sorted list of URIs by 
physical distance to the client.

I guess if arbitrary is 'random' then this provides some client-side 
load balancing over multiple potential servers too. Interested in your 
thoughts behind what would be 'best practice' for a bundle server here.

> +If the bundle provider does not provide a heuristic, then the client
> +should attempt to inspect the bundle headers before downloading the full
> +bundle data in case the bundle tips already exist in the client
> +repository.

Would this default behaviour also be considered another explicit 
heurisitic option? For example: `bundle.heuristic=default` or `inspect`.
Is it ever likely that the default behaviour would change?

> +* The client receives a response other than `200 OK` (such as `404 Not Found`,
> +  `401 Not Authorized`, or `500 Internal Server Error`). The client should
> +  use the `credential.helper` to attempt authentication after the first
> +  `401 Not Authorized` response, but a second such response is a failure.

I'd probably say a 500 is not solvable with a different set of 
credentials, but potentially a retry or just `die`. Do we attempt to 
`credential_fill()` with anything other than a 401 (or maybe 403 or 404) 
elsewhere in Git?

> +* A bundle download during a `git fetch` contains objects already in the
> +  object database. This is probably unavoidable if we are using bundles
> +  for fetches, since the client will almost always be slightly ahead of
> +  the bundle servers after performing its "catch-up" fetch to the remote
> +  server. This extra work is most wasteful when the client is fetching
> +  much more frequently than the server is computing bundles, such as if
> +  the client is using hourly prefetches with background maintenance, but
> +  the server is computing bundles weekly. For this reason, the client
> +  should not use bundle URIs for fetch unless the server has explicitly
> +  recommended it through the `bundle.flags = forFetch` value.

`bundle.flags` is not mentioned elsewhere in this document. Might be 
worth including this, and possible values, with the other key 
definitions above.


> +Implementation Plan
> +-------------------
> +
> +This design document is being submitted on its own as an aspirational
> +document, with the goal of implementing all of the mentioned client
> +features over the course of several patch series. Here is a potential
> +outline for submitting these features:
> +
> +1. Integrate bundle URIs into `git clone` with a `--bundle-uri` option.
> +   This will include a new `git fetch --bundle-uri` mode for use as the
> +   implementation underneath `git clone`. The initial version here will
> +   expect a single bundle at the given URI.
> +
> +2. Implement the ability to parse a bundle list from a bundle URI and
> +   update the `git fetch --bundle-uri` logic to properly distinguish
> +   between `bundle.mode` options. Specifically design the feature so
> +   that the config format parsing feeds a list of key-value pairs into the
> +   bundle list logic.
> +
> +3. Create the `bundle-uri` protocol v2 verb so Git servers can advertise
> +   bundle URIs using the key-value pairs. Plug into the existing key-value
> +   input to the bundle list logic. Allow `git clone` to discover these
> +   bundle URIs and bootstrap the client repository from the bundle data.
> +   (This choice is an opt-in via a config option and a command-line
> +   option.)
> +
> +4. Allow the client to understand the `bundle.flag=forFetch` configuration
> +   and the `bundle.<id>.creationToken` heuristic. When `git clone`
> +   discovers a bundle URI with `bundle.flag=forFetch`, it configures the
> +   client repository to check that bundle URI during later `git fetch <remote>`
> +   commands.
> +
> +5. Allow clients to discover bundle URIs during `git fetch` and configure
> +   a bundle URI for later fetches if `bundle.flag=forFetch`.
> +
> +6. Implement the "inspect headers" heuristic to reduce data downloads when
> +   the `bundle.<id>.creationToken` heuristic is not available.
> +
> +As these features are reviewed, this plan might be updated. We also expect
> +that new designs will be discovered and implemented as this feature
> +matures and becomes used in real-world scenarios.

This plan seems logical to me at least! :-)

--
Thanks,
Matthew

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 1/6] docs: document bundle URI standard
  2022-06-29 20:40   ` [PATCH v2 1/6] docs: document bundle URI standard Derrick Stolee via GitGitGadget
  2022-07-18  9:20     ` SZEDER Gábor
  2022-07-21 12:09     ` Matthew John Cheetham
@ 2022-07-21 21:39     ` Josh Steadmon
  2022-07-22 13:15       ` Derrick Stolee
  2022-07-22 15:01       ` Derrick Stolee
  2 siblings, 2 replies; 64+ messages in thread
From: Josh Steadmon @ 2022-07-21 21:39 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee

On 2022.06.29 20:40, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <derrickstolee@github.com>
> 
> Introduce the idea of bundle URIs to the Git codebase through an
> aspirational design document. This document includes the full design
> intended to include the feature in its fully-implemented form. This will
> take several steps as detailed in the Implementation Plan section.
> 
> By committing this document now, it can be used to motivate changes
> necessary to reach these final goals. The design can still be altered as
> new information is discovered.
> 
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
>  Documentation/technical/bundle-uri.txt | 464 +++++++++++++++++++++++++
>  1 file changed, 464 insertions(+)
>  create mode 100644 Documentation/technical/bundle-uri.txt
> 
> diff --git a/Documentation/technical/bundle-uri.txt b/Documentation/technical/bundle-uri.txt
> new file mode 100644
> index 00000000000..a0230a902a4
> --- /dev/null
> +++ b/Documentation/technical/bundle-uri.txt
> @@ -0,0 +1,464 @@
> +Bundle URIs
> +===========
> +
> +Bundle URIs are locations where Git can download one or more bundles in
> +order to bootstrap the object database in advance of fetching the remaining
> +objects from a remote.
> +
> +One goal is to speed up clones and fetches for users with poor network
> +connectivity to the origin server. Another benefit is to allow heavy users,
> +such as CI build farms, to use local resources for the majority of Git data
> +and thereby reducing the load on the origin server.
> +
> +To enable the bundle URI feature, users can specify a bundle URI using
> +command-line options or the origin server can advertise one or more URIs
> +via a protocol v2 capability.
> +
> +Design Goals
> +------------
> +
> +The bundle URI standard aims to be flexible enough to satisfy multiple
> +workloads. The bundle provider and the Git client have several choices in
> +how they create and consume bundle URIs.
> +
> +* Bundles can have whatever name the server desires. This name could refer
> +  to immutable data by using a hash of the bundle contents. However, this
> +  means that a new URI will be needed after every update of the content.
> +  This might be acceptable if the server is advertising the URI (and the
> +  server is aware of new bundles being generated) but would not be
> +  ergonomic for users using the command line option.
> +
> +* The bundles could be organized specifically for bootstrapping full
> +  clones, but could also be organized with the intention of bootstrapping
> +  incremental fetches. The bundle provider must decide on one of several
> +  organization schemes to minimize client downloads during incremental
> +  fetches, but the Git client can also choose whether to use bundles for
> +  either of these operations.
> +
> +* The bundle provider can choose to support full clones, partial clones,
> +  or both. The client can detect which bundles are appropriate for the
> +  repository's partial clone filter, if any.
> +
> +* The bundle provider can use a single bundle (for clones only), or a
> +  list of bundles. When using a list of bundles, the provider can specify
> +  whether or not the client needs _all_ of the bundle URIs for a full
> +  clone, or if _any_ one of the bundle URIs is sufficient. This allows the
> +  bundle provider to use different URIs for different geographies.
> +
> +* The bundle provider can organize the bundles using heuristics, such as
> +  creation tokens, to help the client prevent downloading bundles it does
> +  not need. When the bundle provider does not provide these heuristics,
> +  the client can use optimizations to minimize how much of the data is
> +  downloaded.
> +
> +* The bundle provider does not need to be associated with the Git server.
> +  The client can choose to use the bundle provider without it being
> +  advertised by the Git server.
> +
> +* The client can choose to discover bundle providers that are advertised
> +  by the Git server. This could happen during `git clone`, during
> +  `git fetch`, both, or neither. The user can choose which combination
> +  works best for them.
> +
> +* The client can choose to configure a bundle provider manually at any
> +  time. The client can also choose to specify a bundle provider manually
> +  as a command-line option to `git clone`.
> +
> +Each repository is different and every Git server has different needs.
> +Hopefully the bundle URI feature is flexible enough to satisfy all needs.
> +If not, then the feature can be extended through its versioning mechanism.
> +
> +Server requirements
> +-------------------
> +
> +To provide a server-side implementation of bundle servers, no other parts
> +of the Git protocol are required. This allows server maintainers to use
> +static content solutions such as CDNs in order to serve the bundle files.
> +
> +At the current scope of the bundle URI feature, all URIs are expected to
> +be HTTP(S) URLs where content is downloaded to a local file using a `GET`
> +request to that URL. The server could include authentication requirements
> +to those requests with the aim of triggering the configured credential
> +helper for secure access. (Future extensions could use "file://" URIs or
> +SSH URIs.)
> +
> +Assuming a `200 OK` response from the server, the content at the URL is
> +inspected. First, Git attempts to parse the file as a bundle file of
> +version 2 or higher. If the file is not a bundle, then the file is parsed
> +as a plain-text file using Git's config parser. The key-value pairs in
> +that config file are expected to describe a list of bundle URIs. If
> +neither of these parse attempts succeed, then Git will report an error to
> +the user that the bundle URI provided erroneous data.
> +
> +Any other data provided by the server is considered erroneous.
> +
> +Bundle Lists
> +------------
> +
> +The Git server can advertise bundle URIs using a set of `key=value` pairs.
> +A bundle URI can also serve a plain-text file in the Git config format
> +containing these same `key=value` pairs. In both cases, we consider this
> +to be a _bundle list_. The pairs specify information about the bundles
> +that the client can use to make decisions for which bundles to download
> +and which to ignore.
> +
> +A few keys focus on properties of the list itself.
> +
> +bundle.version::
> +	(Required) This value provides a version number for the bundle
> +	list. If a future Git change enables a feature that needs the Git
> +	client to react to a new key in the bundle list file, then this version
> +	will increment. The only current version number is 1, and if any other
> +	value is specified then Git will fail to use this file.
> +
> +bundle.mode::
> +	(Required) This value has one of two values: `all` and `any`. When `all`
> +	is specified, then the client should expect to need all of the listed
> +	bundle URIs that match their repository's requirements. When `any` is
> +	specified, then the client should expect that any one of the bundle URIs
> +	that match their repository's requirements will suffice. Typically, the
> +	`any` option is used to list a number of different bundle servers
> +	located in different geographies.
> +
> +bundle.heuristic::
> +	If this string-valued key exists, then the bundle list is designed to
> +	work well with incremental `git fetch` commands. The heuristic signals
> +	that there are additional keys available for each bundle that help
> +	determine which subset of bundles the client should download.

To be clear, the values of `bundle.heuristic` are keys underneath the
following `bundle.<id>` segments?


> +
> +The remaining keys include an `<id>` segment which is a server-designated
> +name for each available bundle.
> +
> +bundle.<id>.uri::
> +	(Required) This string value is the URI for downloading bundle `<id>`.
> +	If the URI begins with a protocol (`http://` or `https://`) then the URI
> +	is absolute. Otherwise, the URI is interpreted as relative to the URI
> +	used for the bundle list. If the URI begins with `/`, then that relative
> +	path is relative to the domain name used for the bundle list. (This use
> +	of relative paths is intended to make it easier to distribute a set of
> +	bundles across a large number of servers or CDNs with different domain
> +	names.)
> +
> +bundle.<id>.filter::
> +	This string value represents an object filter that should also appear in
> +	the header of this bundle. The server uses this value to differentiate
> +	different kinds of bundles from which the client can choose those that
> +	match their object filters.
> +
> +bundle.<id>.creationToken::
> +	This value is a nonnegative 64-bit integer used for sorting the bundles
> +	the list. This is used to download a subset of bundles during a fetch
> +	when `bundle.heuristic=creationToken`.
> +
> +bundle.<id>.location::
> +	This string value advertises a real-world location from where the bundle
> +	URI is served. This can be used to present the user with an option for
> +	which bundle URI to use or simply as an informative indicator of which
> +	bundle URI was selected by Git. This is only valuable when
> +	`bundle.mode` is `any`.
> +
> +Here is an example bundle list using the Git config format:
> +
> +```
> +[bundle]
> +	version = 1
> +	mode = all
> +	heuristic = creationToken
> +
> +[bundle "2022-02-09-1644442601-daily"]
> +	uri = https://bundles.example.com/git/git/2022-02-09-1644442601-daily.bundle
> +	timestamp = 1644442601
> +
> +[bundle "2022-02-02-1643842562"]
> +	uri = https://bundles.example.com/git/git/2022-02-02-1643842562.bundle
> +	timestamp = 1643842562
> +
> +[bundle "2022-02-09-1644442631-daily-blobless"]
> +	uri = 2022-02-09-1644442631-daily-blobless.bundle
> +	timestamp = 1644442631
> +	filter = blob:none
> +
> +[bundle "2022-02-02-1643842568-blobless"]
> +	uri = /git/git/2022-02-02-1643842568-blobless.bundle
> +	timestamp = 1643842568
> +	filter = blob:none
> +```
> +
> +This example uses `bundle.mode=all` as well as the
> +`bundle.<id>.creationToken` heuristic. It also uses the `bundle.<id>.filter`
> +options to present two parallel sets of bundles: one for full clones and
> +another for blobless partial clones.
> +
> +Suppose that this bundle list was found at the URI
> +`https://bundles.example.com/git/git/` and so the two blobless bundles have
> +the following fully-expanded URIs:
> +
> +* `https://bundles.example.com/git/git/2022-02-09-1644442631-daily-blobless.bundle`
> +* `https://bundles.example.com/git/git/2022-02-02-1643842568-blobless.bundle`
> +
> +Advertising Bundle URIs
> +-----------------------
> +
> +If a user knows a bundle URI for the repository they are cloning, then
> +they can specify that URI manually through a command-line option. However,
> +a Git host may want to advertise bundle URIs during the clone operation,
> +helping users unaware of the feature.
> +
> +The only thing required for this feature is that the server can advertise
> +one or more bundle URIs. This advertisement takes the form of a new
> +protocol v2 capability specifically for discovering bundle URIs.
> +
> +The client could choose an arbitrary bundle URI as an option _or_ select
> +the URI with best performance by some exploratory checks. It is up to the
> +bundle provider to decide if having multiple URIs is preferable to a
> +single URI that is geodistributed through server-side infrastructure.
> +
> +Cloning with Bundle URIs
> +------------------------
> +
> +The primary need for bundle URIs is to speed up clones. The Git client
> +will interact with bundle URIs according to the following flow:
> +
> +1. The user specifies a bundle URI with the `--bundle-uri` command-line
> +   option _or_ the client discovers a bundle list advertised by the
> +   Git server.
> +
> +2. If the downloaded data from a bundle URI is a bundle, then the client
> +   inspects the bundle headers to check that the prerequisite commit OIDs
> +   are present in the client repository. If some are missing, then the
> +   client delays unbundling until other bundles have been unbundled,
> +   making those OIDs present. When all required OIDs are present, the
> +   client unbundles that data using a refspec. The default refspec is
> +   `+refs/heads/*:refs/bundles/*`, but this can be configured. These refs
> +   are stored so that later `git fetch` negotiations can communicate the
> +   bundled refs as `have`s, reducing the size of the fetch over the Git
> +   protocol. To allow pruning refs from this ref namespace, Git may
> +   introduce a numbered namespace (such as `refs/bundles/<i>/*`) such that
> +   stale bundle refs can be deleted.
> +
> +3. If the file is instead a bundle list, then the client inspects the
> +   `bundle.mode` to see if the list is of the `all` or `any` form.
> +
> +   a. If `bundle.mode=all`, then the client considers all bundle
> +      URIs. The list is reduced based on the `bundle.<id>.filter` options
> +      matching the client repository's partial clone filter. Then, all
> +      bundle URIs are requested. If the `bundle.<id>.creationToken`
> +      heuristic is provided, then the bundles are downloaded in decreasing
> +      order by the creation token, stopping when a bundle has all required
> +      OIDs. The bundles can then be unbundled in increasing creation token
> +      order. The client stores the latest creation token as a heuristic
> +      for avoiding future downloads if the bundle list does not advertise
> +      bundles with larger creation tokens.
> +
> +   b. If `bundle.mode=any`, then the client can choose any one of the
> +      bundle URIs to inspect. The client can use a variety of ways to
> +      choose among these URIs. The client can also fallback to another URI
> +      if the initial choice fails to return a result.
> +
> +Note that during a clone we expect that all bundles will be required, and
> +heuristics such as `bundle.<uri>.creationToken` can be used to download
> +bundles in chronological order or in parallel.
> +
> +If a given bundle URI is a bundle list with a `bundle.heuristic`
> +value, then the client can choose to store that URI as its chosen bundle
> +URI. The client can then navigate directly to that URI during later `git
> +fetch` calls.
> +
> +When downloading bundle URIs, the client can choose to inspect the initial
> +content before committing to downloading the entire content. This may
> +provide enough information to determine if the URI is a bundle list or
> +a bundle. In the case of a bundle, the client may inspect the bundle
> +header to determine that all advertised tips are already in the client
> +repository and cancel the remaining download.
> +
> +Fetching with Bundle URIs
> +-------------------------
> +
> +When the client fetches new data, it can decide to fetch from bundle
> +servers before fetching from the origin remote. This could be done via a
> +command-line option, but it is more likely useful to use a config value
> +such as the one specified during the clone.
> +
> +The fetch operation follows the same procedure to download bundles from a
> +bundle list (although we do _not_ want to use parallel downloads here). We
> +expect that the process will end when all prerequisite commit OIDs in a
> +thin bundle are already in the object database.
> +
> +When using the `creationToken` heuristic, the client can avoid downloading
> +any bundles if their creation tokenss are not larger than the stored

Typo: tokenss


> +creation token. After fetching new bundles, Git updates this local
> +creation token.
> +
> +If the bundle provider does not provide a heuristic, then the client
> +should attempt to inspect the bundle headers before downloading the full
> +bundle data in case the bundle tips already exist in the client
> +repository.
> +
> +Error Conditions
> +----------------
> +
> +If the Git client discovers something unexpected while downloading
> +information according to a bundle URI or the bundle list found at that
> +location, then Git can ignore that data and continue as if it was not
> +given a bundle URI. The remote Git server is the ultimate source of truth,
> +not the bundle URI.
> +
> +Here are a few example error conditions:
> +
> +* The client fails to connect with a server at the given URI or a connection
> +  is lost without any chance to recover.
> +
> +* The client receives a response other than `200 OK` (such as `404 Not Found`,
> +  `401 Not Authorized`, or `500 Internal Server Error`). The client should
> +  use the `credential.helper` to attempt authentication after the first
> +  `401 Not Authorized` response, but a second such response is a failure.
> +
> +* The client receives data that is not parsable as a bundle or bundle list.
> +
> +* The bundle list describes a directed cycle in the
> +  `bundle.<id>.requires` links.
> +
> +* A bundle includes a filter that does not match expectations.
> +
> +* The client cannot unbundle the bundles because the prerequisite commit OIDs
> +  are not in the object database and there are no more
> +  `bundle.<id>.requires` links to follow.

The list above still mentions `.requires` fields, which IIUC have been
removed from the V2 design.


> +There are also situations that could be seen as wasteful, but are not
> +error conditions:
> +
> +* The downloaded bundles contain more information than is requested by
> +  the clone or fetch request. A primary example is if the user requests
> +  a clone with `--single-branch` but downloads bundles that store every
> +  reachable commit from all `refs/heads/*` references. This might be
> +  initially wasteful, but perhaps these objects will become reachable by
> +  a later ref update that the client cares about.
> +
> +* A bundle download during a `git fetch` contains objects already in the
> +  object database. This is probably unavoidable if we are using bundles
> +  for fetches, since the client will almost always be slightly ahead of
> +  the bundle servers after performing its "catch-up" fetch to the remote
> +  server. This extra work is most wasteful when the client is fetching
> +  much more frequently than the server is computing bundles, such as if
> +  the client is using hourly prefetches with background maintenance, but
> +  the server is computing bundles weekly. For this reason, the client
> +  should not use bundle URIs for fetch unless the server has explicitly
> +  recommended it through the `bundle.flags = forFetch` value.

The "Bundle Lists" section above doesn't mention `bundle.flags` at all;
has this been replaced by `bundle.heuristic`?


> +
> +Implementation Plan
> +-------------------
> +
> +This design document is being submitted on its own as an aspirational
> +document, with the goal of implementing all of the mentioned client
> +features over the course of several patch series. Here is a potential
> +outline for submitting these features:
> +
> +1. Integrate bundle URIs into `git clone` with a `--bundle-uri` option.
> +   This will include a new `git fetch --bundle-uri` mode for use as the
> +   implementation underneath `git clone`. The initial version here will
> +   expect a single bundle at the given URI.
> +
> +2. Implement the ability to parse a bundle list from a bundle URI and
> +   update the `git fetch --bundle-uri` logic to properly distinguish
> +   between `bundle.mode` options. Specifically design the feature so
> +   that the config format parsing feeds a list of key-value pairs into the
> +   bundle list logic.
> +
> +3. Create the `bundle-uri` protocol v2 verb so Git servers can advertise
> +   bundle URIs using the key-value pairs. Plug into the existing key-value
> +   input to the bundle list logic. Allow `git clone` to discover these
> +   bundle URIs and bootstrap the client repository from the bundle data.
> +   (This choice is an opt-in via a config option and a command-line
> +   option.)
> +
> +4. Allow the client to understand the `bundle.flag=forFetch` configuration
> +   and the `bundle.<id>.creationToken` heuristic. When `git clone`
> +   discovers a bundle URI with `bundle.flag=forFetch`, it configures the
> +   client repository to check that bundle URI during later `git fetch <remote>`
> +   commands.
> +
> +5. Allow clients to discover bundle URIs during `git fetch` and configure
> +   a bundle URI for later fetches if `bundle.flag=forFetch`.
> +
> +6. Implement the "inspect headers" heuristic to reduce data downloads when
> +   the `bundle.<id>.creationToken` heuristic is not available.
> +
> +As these features are reviewed, this plan might be updated. We also expect
> +that new designs will be discovered and implemented as this feature
> +matures and becomes used in real-world scenarios.
> +
> +Related Work: Packfile URIs
> +---------------------------

Thanks for including this section; my very first question before I even
started reading the doc was "How does this compare to Packfile URIs?".


> +
> +The Git protocol already has a capability where the Git server can list
> +a set of URLs along with the packfile response when serving a client
> +request. The client is then expected to download the packfiles at those
> +locations in order to have a complete understanding of the response.
> +
> +This mechanism is used by the Gerrit server (implemented with JGit) and
> +has been effective at reducing CPU load and improving user performance for
> +clones.
> +
> +A major downside to this mechanism is that the origin server needs to know
> +_exactly_ what is in those packfiles, and the packfiles need to be available
> +to the user for some time after the server has responded. This coupling
> +between the origin and the packfile data is difficult to manage.

It was not immediately clear to me why bundle URIs would avoid these two
downsides, but after thinking about it and discussing it, I believe this
is the reasoning (please correct me if I'm wrong):

Bundle URIs are intended to be fully processed before negotiation
happens, so the server can rely solely on the client's reported Haves /
Wants to determine its response, as usual, and therefore doesn't need to
know the bundle contents. Packfile URIs are not processed by the client
before negotiation, so the server needs to be aware of the packfile
contents in order to determine which additional objects to send, and
can't rely solely on the Haves/Wants from the client.

If that's accurate, then it seems fine that when the bundle URI is
provided by the user on the command-line, Git can download and process
the bundle before even attempting to contact the server. But in future
series, when we add the ability for the server to provide a URI via a
capability, the client will have to pause after seeing the server's
advertised URI, fetch and process the bundle, and then proceed with a
fetch command. This seems fine assuming the bundles can be handled in a
reasonable amount of time. And even if the connection breaks before the
fetch command can be issued, presumably the client would not need to
attempt to re-download the bundle a second time when the client makes a
second connection to re-attempt the fetch.

For the point about the packfiles needing to be available for some time,
this makes sense because the server's response is processed by the
client before the referenced packfile is downloaded, but the response is
incomplete without the packfile. But with bundle URIs, the server's
response doesn't depend on whether or not the client actually processed
the referenced bundle, only on what was negotiated. So the server's
response is still valid and useful even if the client is unable to get
the bundle.


> +
> +Further, this implementation is extremely hard to make work with fetches.
> +
> +Related Work: GVFS Cache Servers
> +--------------------------------
> +
> +The GVFS Protocol [2] is a set of HTTP endpoints designed independently of
> +the Git project before Git's partial clone was created. One feature of this
> +protocol is the idea of a "cache server" which can be colocated with build
> +machines or developer offices to transfer Git data without overloading the
> +central server.
> +
> +The endpoint that VFS for Git is famous for is the `GET /gvfs/objects/{oid}`
> +endpoint, which allows downloading an object on-demand. This is a critical
> +piece of the filesystem virtualization of that product.
> +
> +However, a more subtle need is the `GET /gvfs/prefetch?lastPackTimestamp=<t>`
> +endpoint. Given an optional timestamp, the cache server responds with a list
> +of precomputed packfiles containing the commits and trees that were introduced
> +in those time intervals.
> +
> +The cache server computes these "prefetch" packfiles using the following
> +strategy:
> +
> +1. Every hour, an "hourly" pack is generated with a given timestamp.
> +2. Nightly, the previous 24 hourly packs are rolled up into a "daily" pack.
> +3. Nightly, all prefetch packs more than 30 days old are rolled up into
> +   one pack.
> +
> +When a user runs `gvfs clone` or `scalar clone` against a repo with cache
> +servers, the client requests all prefetch packfiles, which is at most
> +`24 + 30 + 1` packfiles downloading only commits and trees. The client
> +then follows with a request to the origin server for the references, and
> +attempts to checkout that tip reference. (There is an extra endpoint that
> +helps get all reachable trees from a given commit, in case that commit
> +was not already in a prefetch packfile.)
> +
> +During a `git fetch`, a hook requests the prefetch endpoint using the
> +most-recent timestamp from a previously-downloaded prefetch packfile.
> +Only the list of packfiles with later timestamps are downloaded. Most
> +users fetch hourly, so they get at most one hourly prefetch pack. Users
> +whose machines have been off or otherwise have not fetched in over 30 days
> +might redownload all prefetch packfiles. This is rare.
> +
> +It is important to note that the clients always contact the origin server
> +for the refs advertisement, so the refs are frequently "ahead" of the
> +prefetched pack data. The missing objects are downloaded on-demand using
> +the `GET gvfs/objects/{oid}` requests, when needed by a command such as
> +`git checkout` or `git log`. Some Git optimizations disable checks that
> +would cause these on-demand downloads to be too aggressive.
> +
> +See Also
> +--------
> +
> +[1] https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
> +    An earlier RFC for a bundle URI feature.
> +
> +[2] https://github.com/microsoft/VFSForGit/blob/master/Protocol.md
> +    The GVFS Protocol
> -- 
> gitgitgadget
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 2/6] remote-curl: add 'get' capability
  2022-06-29 20:40   ` [PATCH v2 2/6] remote-curl: add 'get' capability Derrick Stolee via GitGitGadget
@ 2022-07-21 21:41     ` Josh Steadmon
  0 siblings, 0 replies; 64+ messages in thread
From: Josh Steadmon @ 2022-07-21 21:41 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee

On 2022.06.29 20:40, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <derrickstolee@github.com>
> 
> A future change will want a way to download a file over HTTP(S) using
> the simplest of download mechanisms. We do not want to assume that the
> server on the other side understands anything about the Git protocol but
> could be a simple static web server.
> 
> Create the new 'get' capability for the remote helpers which advertises
> that the 'get' command is avalable. A caller can send a line containing
> 'get <url> <path>' to download the file at <url> into the file at
> <path>.
> 
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
>  Documentation/gitremote-helpers.txt |  9 +++++++
>  remote-curl.c                       | 33 +++++++++++++++++++++++++
>  t/t5557-http-get.sh                 | 37 +++++++++++++++++++++++++++++
>  transport-helper.c                  |  5 +++-
>  4 files changed, 83 insertions(+), 1 deletion(-)
>  create mode 100755 t/t5557-http-get.sh
> 
> diff --git a/Documentation/gitremote-helpers.txt b/Documentation/gitremote-helpers.txt
> index 6f1e269ae43..ed8da428c98 100644
> --- a/Documentation/gitremote-helpers.txt
> +++ b/Documentation/gitremote-helpers.txt
> @@ -168,6 +168,9 @@ Supported commands: 'list', 'import'.
>  	Can guarantee that when a clone is requested, the received
>  	pack is self contained and is connected.
>  
> +'get'::
> +	Can use the 'get' command to download a file from a given URI.
> +
>  If a helper advertises 'connect', Git will use it if possible and
>  fall back to another capability if the helper requests so when
>  connecting (see the 'connect' command under COMMANDS).
> @@ -418,6 +421,12 @@ Supported if the helper has the "connect" capability.
>  +
>  Supported if the helper has the "stateless-connect" capability.
>  
> +'get' <uri> <path>::
> +	Downloads the file from the given `<uri>` to the given `<path>`. If
> +	`<path>.temp` exists, then Git assumes that the `.temp` file is a
> +	partial download from a previous attempt and will resume the
> +	download from that position.
> +
>  If a fatal error occurs, the program writes the error message to
>  stderr and exits. The caller should expect that a suitable error
>  message has been printed if the child closes the connection without
> diff --git a/remote-curl.c b/remote-curl.c
> index 67f178b1120..f005419f872 100644
> --- a/remote-curl.c
> +++ b/remote-curl.c
> @@ -1276,6 +1276,34 @@ static void parse_fetch(struct strbuf *buf)
>  	strbuf_reset(buf);
>  }
>  
> +static void parse_get(struct strbuf *buf)
> +{
> +	struct http_get_options opts = { 0 };
> +	struct strbuf url = STRBUF_INIT;
> +	struct strbuf path = STRBUF_INIT;
> +	const char *p, *space;
> +
> +	if (!skip_prefix(buf->buf, "get ", &p))
> +		die(_("http transport does not support %s"), buf->buf);
> +
> +	space = strchr(p, ' ');
> +
> +	if (!space)
> +		die(_("protocol error: expected '<url> <path>', missing space"));
> +
> +	strbuf_add(&url, p, space - p);
> +	strbuf_addstr(&path, space + 1);
> +
> +	if (http_get_file(url.buf, path.buf, &opts))
> +		die(_("failed to download file at URL '%s'"), url.buf);
> +
> +	strbuf_release(&url);
> +	strbuf_release(&path);
> +	printf("\n");
> +	fflush(stdout);
> +	strbuf_reset(buf);
> +}
> +
>  static int push_dav(int nr_spec, const char **specs)
>  {
>  	struct child_process child = CHILD_PROCESS_INIT;
> @@ -1549,9 +1577,14 @@ int cmd_main(int argc, const char **argv)
>  				printf("unsupported\n");
>  			fflush(stdout);
>  
> +		} else if (skip_prefix(buf.buf, "get ", &arg)) {
> +			parse_get(&buf);
> +			fflush(stdout);
> +
>  		} else if (!strcmp(buf.buf, "capabilities")) {
>  			printf("stateless-connect\n");
>  			printf("fetch\n");
> +			printf("get\n");
>  			printf("option\n");
>  			printf("push\n");
>  			printf("check-connectivity\n");
> diff --git a/t/t5557-http-get.sh b/t/t5557-http-get.sh
> new file mode 100755
> index 00000000000..1fd4ded3eb1
> --- /dev/null
> +++ b/t/t5557-http-get.sh
> @@ -0,0 +1,37 @@
> +#!/bin/sh
> +
> +test_description='test downloading a file by URL'
> +
> +. ./test-lib.sh
> +
> +. "$TEST_DIRECTORY"/lib-httpd.sh
> +start_httpd
> +
> +test_expect_success 'get by URL: 404' '
> +	url="$HTTPD_URL/none.txt" &&
> +	cat >input <<-EOF &&
> +	capabilities
> +	get $url file1
> +	EOF
> +
> +	test_must_fail git remote-http $url $url <input 2>err &&
> +	test_path_is_missing file1 &&
> +	grep "failed to download file at URL" err &&
> +	rm file1.temp
> +'
> +
> +test_expect_success 'get by URL: 200' '
> +	echo data >"$HTTPD_DOCUMENT_ROOT_PATH/exists.txt" &&
> +
> +	url="$HTTPD_URL/exists.txt" &&
> +	cat >input <<-EOF &&
> +	capabilities
> +	get $url file2
> +
> +	EOF
> +
> +	GIT_TRACE2_PERF=1 git remote-http $url $url <input &&
> +	test_cmp "$HTTPD_DOCUMENT_ROOT_PATH/exists.txt" file2
> +'
> +
> +test_done
> diff --git a/transport-helper.c b/transport-helper.c
> index b4dbbabb0c2..dfbeaebe40c 100644
> --- a/transport-helper.c
> +++ b/transport-helper.c
> @@ -33,7 +33,8 @@ struct helper_data {
>  		check_connectivity : 1,
>  		no_disconnect_req : 1,
>  		no_private_update : 1,
> -		object_format : 1;
> +		object_format : 1,
> +		get : 1;

This new `get` field is set in the hunk below, but it never seems to be
accessed, even in later patches in this series.


>  
>  	/*
>  	 * As an optimization, the transport code may invoke fetch before
> @@ -210,6 +211,8 @@ static struct child_process *get_helper(struct transport *transport)
>  			data->no_private_update = 1;
>  		} else if (starts_with(capname, "object-format")) {
>  			data->object_format = 1;
> +		} else if (!strcmp(capname, "get")) {
> +			data->get = 1;
>  		} else if (mandatory) {
>  			die(_("unknown mandatory capability %s; this remote "
>  			      "helper probably needs newer version of Git"),
> -- 
> gitgitgadget
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 3/6] bundle-uri: create basic file-copy logic
  2022-06-29 20:40   ` [PATCH v2 3/6] bundle-uri: create basic file-copy logic Derrick Stolee via GitGitGadget
@ 2022-07-21 21:45     ` Josh Steadmon
  2022-07-22 13:18       ` Derrick Stolee
  0 siblings, 1 reply; 64+ messages in thread
From: Josh Steadmon @ 2022-07-21 21:45 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee

On 2022.06.29 20:40, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <derrickstolee@github.com>
> 
> Before implementing a way to fetch bundles into a repository, create the
> basic logic. Assume that the URI is actually a file path. Future logic
> will make this more careful to other protocols.
> 
> For now, we also only succeed if the content at the URI is a bundle
> file, not a bundle list. Bundle lists will be implemented in a future
> change.
> 
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
>  Makefile     |  1 +
>  bundle-uri.c | 93 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  bundle-uri.h | 14 ++++++++
>  3 files changed, 108 insertions(+)
>  create mode 100644 bundle-uri.c
>  create mode 100644 bundle-uri.h
> 
> diff --git a/Makefile b/Makefile
> index f8bccfab5e9..8f27310836d 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -887,6 +887,7 @@ LIB_OBJS += blob.o
>  LIB_OBJS += bloom.o
>  LIB_OBJS += branch.o
>  LIB_OBJS += bulk-checkin.o
> +LIB_OBJS += bundle-uri.o
>  LIB_OBJS += bundle.o
>  LIB_OBJS += cache-tree.o
>  LIB_OBJS += cbtree.o
> diff --git a/bundle-uri.c b/bundle-uri.c
> new file mode 100644
> index 00000000000..f0abbf434c8
> --- /dev/null
> +++ b/bundle-uri.c
> @@ -0,0 +1,93 @@
> +#include "cache.h"
> +#include "bundle-uri.h"
> +#include "bundle.h"
> +#include "object-store.h"
> +#include "refs.h"
> +#include "run-command.h"
> +
> +static void find_temp_filename(struct strbuf *name)
> +{
> +	int fd;
> +	/*
> +	 * Find a temporary filename that is available. This is briefly
> +	 * racy, but unlikely to collide.
> +	 */
> +	fd = odb_mkstemp(name, "bundles/tmp_uri_XXXXXX");
> +	if (fd < 0)
> +		die(_("failed to create temporary file"));
> +	close(fd);
> +	unlink(name->buf);

Is there a reason why we unlink() here? If we allow the empty file to
remain on-disk until we write to it, wouldn't that prevent odb_mkstemp()
from being racy?


> +}
> +
> +static int copy_uri_to_file(const char *uri, const char *file)

Nitpick: from a brief glance, it seems that most other copy* functions
take the destination as the first parameter, and the source second. I
don't feel strongly about it, because to me src followed by dst feels
more natural, but perhaps we should be consistent with other functions.


> +{
> +	/* Copy as a file */
> +	return copy_file(file, uri, 0444);
> +}
> +
> +static int unbundle_from_file(struct repository *r, const char *file)
> +{
> +	int result = 0;
> +	int bundle_fd;
> +	struct bundle_header header = BUNDLE_HEADER_INIT;
> +	struct strvec extra_index_pack_args = STRVEC_INIT;
> +	struct string_list_item *refname;
> +	struct strbuf bundle_ref = STRBUF_INIT;
> +	size_t bundle_prefix_len;
> +
> +	if ((bundle_fd = read_bundle_header(file, &header)) < 0)
> +		return 1;
> +
> +	if ((result = unbundle(r, &header, bundle_fd, &extra_index_pack_args)))

Can we just pass NULL here instead of creating an empty
extra_index_pack_args?


> +		return 1;
> +
> +	/*
> +	 * Convert all refs/heads/ from the bundle into refs/bundles/
> +	 * in the local repository.
> +	 */
> +	strbuf_addstr(&bundle_ref, "refs/bundles/");
> +	bundle_prefix_len = bundle_ref.len;
> +
> +	for_each_string_list_item(refname, &header.references) {
> +		struct object_id *oid = refname->util;
> +		struct object_id old_oid;
> +		const char *branch_name;
> +		int has_old;
> +
> +		if (!skip_prefix(refname->string, "refs/heads/", &branch_name))
> +			continue;
> +
> +		strbuf_setlen(&bundle_ref, bundle_prefix_len);
> +		strbuf_addstr(&bundle_ref, branch_name);
> +
> +		has_old = !read_ref(bundle_ref.buf, &old_oid);
> +		update_ref("fetched bundle", bundle_ref.buf, oid,
> +			   has_old ? &old_oid : NULL,
> +			   REF_SKIP_OID_VERIFICATION,
> +			   UPDATE_REFS_MSG_ON_ERR);
> +	}
> +
> +	bundle_header_release(&header);

We also need to release bundle_ref (and extra_index_pack_args if we
decide to keep it).


> +	return result;
> +}
> +
> +int fetch_bundle_uri(struct repository *r, const char *uri)
> +{
> +	int result = 0;
> +	struct strbuf filename = STRBUF_INIT;
> +
> +	find_temp_filename(&filename);
> +	if ((result = copy_uri_to_file(uri, filename.buf)))
> +		goto cleanup;
> +
> +	if ((result = !is_bundle(filename.buf, 0)))
> +		goto cleanup;
> +
> +	if ((result = unbundle_from_file(r, filename.buf)))
> +		goto cleanup;
> +
> +cleanup:
> +	unlink(filename.buf);
> +	strbuf_release(&filename);
> +	return result;
> +}
> diff --git a/bundle-uri.h b/bundle-uri.h
> new file mode 100644
> index 00000000000..8a152f1ef14
> --- /dev/null
> +++ b/bundle-uri.h
> @@ -0,0 +1,14 @@
> +#ifndef BUNDLE_URI_H
> +#define BUNDLE_URI_H
> +
> +struct repository;
> +
> +/**
> + * Fetch data from the given 'uri' and unbundle the bundle data found
> + * based on that information.
> + *
> + * Returns non-zero if no bundle information is found at the given 'uri'.
> + */
> +int fetch_bundle_uri(struct repository *r, const char *uri);
> +
> +#endif
> -- 
> gitgitgadget
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 6/6] fetch: add 'refs/bundle/' to log.excludeDecoration
  2022-06-29 20:40   ` [PATCH v2 6/6] fetch: add 'refs/bundle/' to log.excludeDecoration Derrick Stolee via GitGitGadget
@ 2022-07-21 21:47     ` Josh Steadmon
  2022-07-22 13:20       ` Derrick Stolee
  0 siblings, 1 reply; 64+ messages in thread
From: Josh Steadmon @ 2022-07-21 21:47 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee

On 2022.06.29 20:40, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <derrickstolee@github.com>
> 
> When fetching from a bundle URI, the branches of that bundle are stored
> in a different ref namespace: refs/bundles/. This namespace is intended
> to assist with later 'git fetch' negotiations with a Git server,
> allowing the client to advertise which data it already has from a bundle
> URI.
> 
> These references can be confusing for a user when they appear as a
> decoration in 'git log' output. Add "refs/bundles/" to the multi-valued
> log.excludeDecoration config value. This is similar to the way
> "refs/prefetch/" is hidden by background prefetch operations in 'git
> maintenance' as added by 96eaffebb (maintenance: set
> log.excludeDecoration durin prefetch, 2021-01-19).
> 
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
>  Documentation/fetch-options.txt |  3 ++-
>  bundle-uri.c                    |  7 +++++++
>  t/t5558-fetch-bundle-uri.sh     | 12 +++++++++---
>  3 files changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/fetch-options.txt b/Documentation/fetch-options.txt
> index 09bd1feeed8..8b801bcc2f3 100644
> --- a/Documentation/fetch-options.txt
> +++ b/Documentation/fetch-options.txt
> @@ -321,4 +321,5 @@ endif::git-pull[]
>  --bundle-uri=<uri>::
>  	Instead of fetching from a remote, fetch a bundle from the given
>  	`<uri>` and unbundle the data into the local repository. The refs
> -	in the bundle will be stored under the `refs/bundle/*` namespace.
> +	in the bundle will be stored under the hidden `refs/bundle/*`
> +	namespace.
> diff --git a/bundle-uri.c b/bundle-uri.c
> index 4c793843a2a..6e0f1cb06fd 100644
> --- a/bundle-uri.c
> +++ b/bundle-uri.c
> @@ -1,6 +1,7 @@
>  #include "cache.h"
>  #include "bundle-uri.h"
>  #include "bundle.h"
> +#include "config.h"
>  #include "object-store.h"
>  #include "refs.h"
>  #include "run-command.h"
> @@ -153,6 +154,12 @@ int fetch_bundle_uri(struct repository *r, const char *uri)
>  	if ((result = unbundle_from_file(r, filename.buf)))
>  		goto cleanup;
>  
> +	git_config_set_multivar_gently("log.excludedecoration",
> +					"refs/bundle/",
> +					"refs/bundle/",
> +					CONFIG_FLAGS_FIXED_VALUE |
> +					CONFIG_FLAGS_MULTI_REPLACE);
> +

I dislike the idea of modifying the user's config as a side effect here.
Since it's a cosmetic issue, can we drop this patch and figure out
better default values for log.excludedecoration?


>  cleanup:
>  	unlink(filename.buf);
>  	strbuf_release(&filename);
> diff --git a/t/t5558-fetch-bundle-uri.sh b/t/t5558-fetch-bundle-uri.sh
> index 919db6f4551..bed4cc3e99c 100755
> --- a/t/t5558-fetch-bundle-uri.sh
> +++ b/t/t5558-fetch-bundle-uri.sh
> @@ -28,7 +28,9 @@ test_expect_success 'fetch file bundle' '
>  	git -C fetch-to fetch --bundle-uri="$(pwd)/fetch-from/B.bundle" &&
>  	git -C fetch-to rev-parse refs/bundles/topic >actual &&
>  	git -C fetch-from rev-parse topic >expect &&
> -	test_cmp expect actual
> +	test_cmp expect actual &&
> +
> +	test_config -C fetch-to log.excludedecoration refs/bundle/
>  '
>  
>  test_expect_success 'fetch file:// bundle' '
> @@ -36,7 +38,9 @@ test_expect_success 'fetch file:// bundle' '
>  	git -C fetch-file fetch --bundle-uri="file://$(pwd)/fetch-from/B.bundle" &&
>  	git -C fetch-file rev-parse refs/bundles/topic >actual &&
>  	git -C fetch-from rev-parse topic >expect &&
> -	test_cmp expect actual
> +	test_cmp expect actual &&
> +
> +	test_config -C fetch-file log.excludedecoration refs/bundle/
>  '
>  
>  #########################################################################
> @@ -62,7 +66,9 @@ test_expect_success 'fetch HTTP bundle' '
>  	git -C fetch-http fetch --bundle-uri="$HTTPD_URL/B.bundle" &&
>  	git -C fetch-http rev-parse refs/bundles/topic >actual &&
>  	git -C fetch-from rev-parse topic >expect &&
> -	test_cmp expect actual
> +	test_cmp expect actual &&
> +
> +	test_config log.excludedecoration refs/bundle/
>  '
>  
>  # Do not add tests here unless they use the HTTP server, as they will
> -- 
> gitgitgadget

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation
  2022-06-29 20:40 ` [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
                     ` (5 preceding siblings ...)
  2022-06-29 20:40   ` [PATCH v2 6/6] fetch: add 'refs/bundle/' to log.excludeDecoration Derrick Stolee via GitGitGadget
@ 2022-07-21 21:48   ` Josh Steadmon
  2022-07-21 21:56     ` Junio C Hamano
  2022-07-25 13:53   ` [PATCH v3 0/2] " Derrick Stolee via GitGitGadget
  7 siblings, 1 reply; 64+ messages in thread
From: Josh Steadmon @ 2022-07-21 21:48 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee

On 2022.06.29 20:40, Derrick Stolee via GitGitGadget wrote:
> This is the first of series towards building the bundle URI feature as
> discussed in previous RFCs, specifically pulled directly out of [5]:
> 
> [1]
> https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
> [2]
> https://lore.kernel.org/git/cover-0.3-00000000000-20211025T211159Z-avarab@gmail.com/
> [3]
> https://lore.kernel.org/git/pull.1160.git.1645641063.gitgitgadget@gmail.com
> [4]
> https://lore.kernel.org/git/RFC-cover-v2-00.36-00000000000-20220418T165545Z-avarab@gmail.com/
> [5]
> https://lore.kernel.org/git/pull.1234.git.1653072042.gitgitgadget@gmail.com
> 
> The first patch details the long-term design and goals of the bundle URI
> feature, including complicated features such as the bundle-uri protocol v2
> verb and bundle lists with heuristics.
> 
> However, then intention is to start small with the simplest features that
> allow user benefit as soon as possible. In that direction, the rest of this
> series creates the ability to run 'git fetch --bundle-uri=' to skip fetching
> from any remote and instead download the file at the given . Currently, that
> data is expected to be a bundle, which Git will then unbundle and modify the
> refs to be in the 'refs/bundle/' namespace.
> 
> Currently, the can be a literal filename, a file:// URI, or an http[s]://
> URI. Tests are added for both of these cases.
> 
> As outlined in [5], the next steps after this are:
> 
>  1. Add 'git clone --bundle-uri=' to run this 'git fetch --bundle-uri=' step
>     before doing a fetch negotiation with the origin remote. [6]
>  2. Allow parsing a bundle list as a config file at the given URI. The
>     key-value format is unified with the protocol v2 verb (coming in (3)).
>     [7]
>  3. Implement the protocol v2 verb, re-using the bundle list logic from (2).
>     Use this to auto-discover bundle URIs during 'git clone' (behind a
>     config option). [8]
>  4. Implement the 'creationToken' heuristic, allowing incremental 'git
>     fetch' commands to download a bundle list from a configured URI, and
>     only download bundles that are new based on the creation token values.
>     [9]
> 
> I have prepared some of this work as pull requests on my personal fork so
> curious readers can look ahead to where we are going:
> 
> [6] https://github.com/derrickstolee/git/pull/18 [7]
> https://github.com/derrickstolee/git/pull/20 [8]
> https://github.com/derrickstolee/git/pull/21 [9]
> https://github.com/derrickstolee/git/pull/22
> 
> As mentioned in the design document, this is not all that is possible. For
> instance, Ævar's suggestion to download only the bundle headers can be used
> as a second heuristic (and as an augmentation of the timestamp heuristic).
> 
> 
> Updates in v2
> =============
> 
>  * The design document has been updated based on Junio's feedback.
>  * The "bundle.list." keys are now just "bundle.".
>  * The "timestamp" heuristic is now "creationToken".
>  * More clarity on how Git parses data from the bundle URI.
>  * Dropped some unnecessary bundle list keys (*.list, *.requires).
> 
> Thanks, -Stolee
> 
> Derrick Stolee (6):
>   docs: document bundle URI standard
>   remote-curl: add 'get' capability
>   bundle-uri: create basic file-copy logic
>   fetch: add --bundle-uri option
>   bundle-uri: add support for http(s):// and file://
>   fetch: add 'refs/bundle/' to log.excludeDecoration
> 
>  Documentation/fetch-options.txt        |   6 +
>  Documentation/git-fetch.txt            |   1 +
>  Documentation/gitremote-helpers.txt    |   9 +
>  Documentation/technical/bundle-uri.txt | 464 +++++++++++++++++++++++++
>  Makefile                               |   1 +
>  builtin/fetch.c                        |  10 +
>  bundle-uri.c                           | 167 +++++++++
>  bundle-uri.h                           |  14 +
>  remote-curl.c                          |  33 ++
>  t/t5557-http-get.sh                    |  37 ++
>  t/t5558-fetch-bundle-uri.sh            |  77 ++++
>  transport-helper.c                     |   5 +-
>  12 files changed, 823 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/technical/bundle-uri.txt
>  create mode 100644 bundle-uri.c
>  create mode 100644 bundle-uri.h
>  create mode 100755 t/t5557-http-get.sh
>  create mode 100755 t/t5558-fetch-bundle-uri.sh
> 
> 
> base-commit: 89c6e450fe4a919ecb6fa698005a935531c732cf
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1248%2Fderrickstolee%2Fbundle-redo%2Ffetch-v2
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1248/derrickstolee/bundle-redo/fetch-v2
> Pull-Request: https://github.com/gitgitgadget/git/pull/1248
> 
> Range-diff vs v1:
> 
>  1:  e771b2971d0 ! 1:  d444042dc4d docs: document bundle URI standard
>      @@ Documentation/technical/bundle-uri.txt (new)
>       +  bundle provider to use different URIs for different geographies.
>       +
>       +* The bundle provider can organize the bundles using heuristics, such as
>      -+  timestamps or tokens, to help the client prevent downloading bundles it
>      -+  does not need. When the bundle provider does not provide these
>      -+  heuristics, the client can use optimizations to minimize how much of the
>      -+  data is downloaded.
>      ++  creation tokens, to help the client prevent downloading bundles it does
>      ++  not need. When the bundle provider does not provide these heuristics,
>      ++  the client can use optimizations to minimize how much of the data is
>      ++  downloaded.
>       +
>       +* The bundle provider does not need to be associated with the Git server.
>       +  The client can choose to use the bundle provider without it being
>      @@ Documentation/technical/bundle-uri.txt (new)
>       +SSH URIs.)
>       +
>       +Assuming a `200 OK` response from the server, the content at the URL is
>      -+expected to be of one of two forms:
>      -+
>      -+1. Bundle: A Git bundle file of version 2 or higher.
>      -+
>      -+2. Bundle List: A plain-text file that is parsable using Git's
>      -+   config file parser. This file describes one or more bundles that are
>      -+   accessible from other URIs.
>      ++inspected. First, Git attempts to parse the file as a bundle file of
>      ++version 2 or higher. If the file is not a bundle, then the file is parsed
>      ++as a plain-text file using Git's config parser. The key-value pairs in
>      ++that config file are expected to describe a list of bundle URIs. If
>      ++neither of these parse attempts succeed, then Git will report an error to
>      ++the user that the bundle URI provided erroneous data.
>       +
>       +Any other data provided by the server is considered erroneous.
>       +
>      @@ Documentation/technical/bundle-uri.txt (new)
>       +
>       +A few keys focus on properties of the list itself.
>       +
>      -+bundle.list.version::
>      -+	(Required) This value provides a version number for the table of
>      -+	contents. If a future Git change enables a feature that needs the Git
>      ++bundle.version::
>      ++	(Required) This value provides a version number for the bundle
>      ++	list. If a future Git change enables a feature that needs the Git
>       +	client to react to a new key in the bundle list file, then this version
>       +	will increment. The only current version number is 1, and if any other
>       +	value is specified then Git will fail to use this file.
>       +
>      -+bundle.list.mode::
>      ++bundle.mode::
>       +	(Required) This value has one of two values: `all` and `any`. When `all`
>       +	is specified, then the client should expect to need all of the listed
>       +	bundle URIs that match their repository's requirements. When `any` is
>      @@ Documentation/technical/bundle-uri.txt (new)
>       +	`any` option is used to list a number of different bundle servers
>       +	located in different geographies.
>       +
>      -+bundle.list.heuristic::
>      ++bundle.heuristic::
>       +	If this string-valued key exists, then the bundle list is designed to
>      -+  work well with incremental `git fetch` commands. The heuristic signals
>      -+  that there are additional keys available for each bundle that help
>      -+  determine which subset of bundles the client should download.
>      ++	work well with incremental `git fetch` commands. The heuristic signals
>      ++	that there are additional keys available for each bundle that help
>      ++	determine which subset of bundles the client should download.
>       +
>       +The remaining keys include an `<id>` segment which is a server-designated
>       +name for each available bundle.
>      @@ Documentation/technical/bundle-uri.txt (new)
>       +	bundles across a large number of servers or CDNs with different domain
>       +	names.)
>       +
>      -+bundle.<id>.list::
>      -+	This boolean value indicates whether the client should expect the
>      -+	content from this URI to be a list (if `true`) or a bundle (if `false`).
>      -+	This is typically used when `bundle.list.mode` is `any`.
>      -+
>       +bundle.<id>.filter::
>       +	This string value represents an object filter that should also appear in
>       +	the header of this bundle. The server uses this value to differentiate
>       +	different kinds of bundles from which the client can choose those that
>       +	match their object filters.
>       +
>      -+bundle.<id>.timestamp::
>      -+	This value is the number of seconds since Unix epoch (UTC) that this
>      -+	bundle was created. This is used as an approximation of a point in time
>      -+	that the bundle matches the data available at the origin server. This is
>      -+	used when `bundle.list.heuristic=timestamp`.
>      -+
>      -+bundle.<id>.requires::
>      -+	This string value represents the ID of another bundle. When present, the
>      -+	server is indicating that this bundle contains a thin packfile. If the
>      -+	client does not have all necessary objects to unbundle this packfile,
>      -+	then the client can download the bundle with the `requires` ID and try
>      -+	again. (Note: it may be beneficial to allow the server to specify
>      -+	multiple `requires` bundles.) This is used when
>      -+	`bundle.list.heuristic=timestamp`.
>      ++bundle.<id>.creationToken::
>      ++	This value is a nonnegative 64-bit integer used for sorting the bundles
>      ++	the list. This is used to download a subset of bundles during a fetch
>      ++	when `bundle.heuristic=creationToken`.
>       +
>       +bundle.<id>.location::
>       +	This string value advertises a real-world location from where the bundle
>       +	URI is served. This can be used to present the user with an option for
>      -+	which bundle URI to use. This is only valuable when `bundle.list.mode`
>      -+	is `any`.
>      ++	which bundle URI to use or simply as an informative indicator of which
>      ++	bundle URI was selected by Git. This is only valuable when
>      ++	`bundle.mode` is `any`.
>       +
>       +Here is an example bundle list using the Git config format:
>       +
>       +```
>      -+[bundle "list"]
>      ++[bundle]
>       +	version = 1
>       +	mode = all
>      -+	heuristic = timestamp
>      ++	heuristic = creationToken
>       +
>       +[bundle "2022-02-09-1644442601-daily"]
>      -+	uri = https://bundles.fake.com/git/git/2022-02-09-1644442601-daily.bundle
>      ++	uri = https://bundles.example.com/git/git/2022-02-09-1644442601-daily.bundle
>       +	timestamp = 1644442601
>      -+	requires = 2022-02-02-1643842562
>       +
>       +[bundle "2022-02-02-1643842562"]
>      -+	uri = https://bundles.fake.com/git/git/2022-02-02-1643842562.bundle
>      ++	uri = https://bundles.example.com/git/git/2022-02-02-1643842562.bundle
>       +	timestamp = 1643842562
>       +
>       +[bundle "2022-02-09-1644442631-daily-blobless"]
>       +	uri = 2022-02-09-1644442631-daily-blobless.bundle
>       +	timestamp = 1644442631
>      -+	requires = 2022-02-02-1643842568-blobless
>       +	filter = blob:none
>       +
>       +[bundle "2022-02-02-1643842568-blobless"]
>      @@ Documentation/technical/bundle-uri.txt (new)
>       +	filter = blob:none
>       +```
>       +
>      -+This example uses `bundle.list.mode=all` as well as the
>      -+`bundle.<id>.timestamp` heuristic. It also uses the `bundle.<id>.filter`
>      ++This example uses `bundle.mode=all` as well as the
>      ++`bundle.<id>.creationToken` heuristic. It also uses the `bundle.<id>.filter`
>       +options to present two parallel sets of bundles: one for full clones and
>       +another for blobless partial clones.
>       +
>       +Suppose that this bundle list was found at the URI
>      -+`https://bundles.fake.com/git/git/` and so the two blobless bundles have
>      ++`https://bundles.example.com/git/git/` and so the two blobless bundles have
>       +the following fully-expanded URIs:
>       +
>      -+* `https://bundles.fake.com/git/git/2022-02-09-1644442631-daily-blobless.bundle`
>      -+* `https://bundles.fake.com/git/git/2022-02-02-1643842568-blobless.bundle`
>      ++* `https://bundles.example.com/git/git/2022-02-09-1644442631-daily-blobless.bundle`
>      ++* `https://bundles.example.com/git/git/2022-02-02-1643842568-blobless.bundle`
>       +
>       +Advertising Bundle URIs
>       +-----------------------
>      @@ Documentation/technical/bundle-uri.txt (new)
>       +protocol v2 capability specifically for discovering bundle URIs.
>       +
>       +The client could choose an arbitrary bundle URI as an option _or_ select
>      -+the URI with lowest latency by some exploratory checks. It is up to the
>      ++the URI with best performance by some exploratory checks. It is up to the
>       +bundle provider to decide if having multiple URIs is preferable to a
>       +single URI that is geodistributed through server-side infrastructure.
>       +
>      @@ Documentation/technical/bundle-uri.txt (new)
>       +   Git server.
>       +
>       +2. If the downloaded data from a bundle URI is a bundle, then the client
>      -+   inspects the bundle headers to check that the negative commit OIDs are
>      -+   present in the client repository. If some are missing, then the client
>      -+   delays unbundling until other bundles have been unbundled, making those
>      -+   OIDs present. When all required OIDs are present, the client unbundles
>      -+   that data using a refspec. The default refspec is
>      -+   `+refs/heads/*:refs/bundles/*`, but this can be configured.
>      ++   inspects the bundle headers to check that the prerequisite commit OIDs
>      ++   are present in the client repository. If some are missing, then the
>      ++   client delays unbundling until other bundles have been unbundled,
>      ++   making those OIDs present. When all required OIDs are present, the
>      ++   client unbundles that data using a refspec. The default refspec is
>      ++   `+refs/heads/*:refs/bundles/*`, but this can be configured. These refs
>      ++   are stored so that later `git fetch` negotiations can communicate the
>      ++   bundled refs as `have`s, reducing the size of the fetch over the Git
>      ++   protocol. To allow pruning refs from this ref namespace, Git may
>      ++   introduce a numbered namespace (such as `refs/bundles/<i>/*`) such that
>      ++   stale bundle refs can be deleted.
>       +
>       +3. If the file is instead a bundle list, then the client inspects the
>      -+   `bundle.list.mode` to see if the list is of the `all` or `any` form.
>      ++   `bundle.mode` to see if the list is of the `all` or `any` form.
>       +
>      -+   a. If `bundle.list.mode=all`, then the client considers all bundle
>      ++   a. If `bundle.mode=all`, then the client considers all bundle
>       +      URIs. The list is reduced based on the `bundle.<id>.filter` options
>       +      matching the client repository's partial clone filter. Then, all
>      -+      bundle URIs are requested. If the `bundle.<id>.timestamp` heuristic
>      -+      is provided, then the bundles are downloaded in reverse-
>      -+      chronological order, stopping when a bundle has all required OIDs.
>      -+      The bundles can then be unbundled in chronological order. The client
>      -+      stores the latest timestamp as a heuristic for avoiding future
>      -+      downloads if the bundle list does not advertise newer bundles.
>      -+
>      -+   b. If `bundle.list.mode=any`, then the client can choose any one of the
>      ++      bundle URIs are requested. If the `bundle.<id>.creationToken`
>      ++      heuristic is provided, then the bundles are downloaded in decreasing
>      ++      order by the creation token, stopping when a bundle has all required
>      ++      OIDs. The bundles can then be unbundled in increasing creation token
>      ++      order. The client stores the latest creation token as a heuristic
>      ++      for avoiding future downloads if the bundle list does not advertise
>      ++      bundles with larger creation tokens.
>      ++
>      ++   b. If `bundle.mode=any`, then the client can choose any one of the
>       +      bundle URIs to inspect. The client can use a variety of ways to
>       +      choose among these URIs. The client can also fallback to another URI
>       +      if the initial choice fails to return a result.
>       +
>       +Note that during a clone we expect that all bundles will be required, and
>      -+heuristics such as `bundle.<uri>.timestamp` can be used to download bundles
>      -+in chronological order or in parallel.
>      ++heuristics such as `bundle.<uri>.creationToken` can be used to download
>      ++bundles in chronological order or in parallel.
>       +
>      -+If a given bundle URI is a bundle list with a `bundle.list.heuristic`
>      ++If a given bundle URI is a bundle list with a `bundle.heuristic`
>       +value, then the client can choose to store that URI as its chosen bundle
>       +URI. The client can then navigate directly to that URI during later `git
>       +fetch` calls.
>      @@ Documentation/technical/bundle-uri.txt (new)
>       +
>       +The fetch operation follows the same procedure to download bundles from a
>       +bundle list (although we do _not_ want to use parallel downloads here). We
>      -+expect that the process will end when all negative commit OIDs in a thin
>      -+bundle are already in the object database.
>      ++expect that the process will end when all prerequisite commit OIDs in a
>      ++thin bundle are already in the object database.
>       +
>      -+When using the `timestamp` heuristic, the client can avoid downloading any
>      -+bundles if their timestamps are not larger than the stored timestamp.
>      -+After fetching new bundles, this local timestamp value is updated.
>      ++When using the `creationToken` heuristic, the client can avoid downloading
>      ++any bundles if their creation tokenss are not larger than the stored
>      ++creation token. After fetching new bundles, Git updates this local
>      ++creation token.
>       +
>       +If the bundle provider does not provide a heuristic, then the client
>       +should attempt to inspect the bundle headers before downloading the full
>      @@ Documentation/technical/bundle-uri.txt (new)
>       +  use the `credential.helper` to attempt authentication after the first
>       +  `401 Not Authorized` response, but a second such response is a failure.
>       +
>      -+* The client receives data that is not parsable as a bundle or table of
>      -+  contents.
>      ++* The client receives data that is not parsable as a bundle or bundle list.
>       +
>       +* The bundle list describes a directed cycle in the
>       +  `bundle.<id>.requires` links.
>       +
>       +* A bundle includes a filter that does not match expectations.
>       +
>      -+* The client cannot unbundle the bundles because the negative commit OIDs
>      ++* The client cannot unbundle the bundles because the prerequisite commit OIDs
>       +  are not in the object database and there are no more
>       +  `bundle.<id>.requires` links to follow.
>       +
>      @@ Documentation/technical/bundle-uri.txt (new)
>       +  the client is using hourly prefetches with background maintenance, but
>       +  the server is computing bundles weekly. For this reason, the client
>       +  should not use bundle URIs for fetch unless the server has explicitly
>      -+  recommended it through the `bundle.list.forFetch = true` value.
>      ++  recommended it through the `bundle.flags = forFetch` value.
>       +
>       +Implementation Plan
>       +-------------------
>      @@ Documentation/technical/bundle-uri.txt (new)
>       +
>       +2. Implement the ability to parse a bundle list from a bundle URI and
>       +   update the `git fetch --bundle-uri` logic to properly distinguish
>      -+   between `bundle.list.mode` options. Specifically design the feature so
>      ++   between `bundle.mode` options. Specifically design the feature so
>       +   that the config format parsing feeds a list of key-value pairs into the
>       +   bundle list logic.
>       +
>      @@ Documentation/technical/bundle-uri.txt (new)
>       +   (This choice is an opt-in via a config option and a command-line
>       +   option.)
>       +
>      -+4. Allow the client to understand the `bundle.list.forFetch` configuration
>      -+   and the `bundle.<id>.timestamp` heuristic. When `git clone` discovers a
>      -+   bundle URI with `bundle.list.forFetch=true`, it configures the client
>      -+   repository to check that bundle URI during later `git fetch <remote>`
>      ++4. Allow the client to understand the `bundle.flag=forFetch` configuration
>      ++   and the `bundle.<id>.creationToken` heuristic. When `git clone`
>      ++   discovers a bundle URI with `bundle.flag=forFetch`, it configures the
>      ++   client repository to check that bundle URI during later `git fetch <remote>`
>       +   commands.
>       +
>       +5. Allow clients to discover bundle URIs during `git fetch` and configure
>      -+   a bundle URI for later fetches if `bundle.list.forFetch=true`.
>      ++   a bundle URI for later fetches if `bundle.flag=forFetch`.
>       +
>       +6. Implement the "inspect headers" heuristic to reduce data downloads when
>      -+   the `bundle.<id>.timestamp` heuristic is not available.
>      ++   the `bundle.<id>.creationToken` heuristic is not available.
>       +
>       +As these features are reviewed, this plan might be updated. We also expect
>       +that new designs will be discovered and implemented as this feature
>  2:  977f0af40fc = 2:  0a2cf60437f remote-curl: add 'get' capability
>  3:  8a18acfbf05 ! 3:  abec47564fd bundle-uri: create basic file-copy logic
>      @@ bundle-uri.c (new)
>       +{
>       +	int fd;
>       +	/*
>      -+	 * Find a temporray filename that is available. This is briefly
>      ++	 * Find a temporary filename that is available. This is briefly
>       +	 * racy, but unlikely to collide.
>       +	 */
>       +	fd = odb_mkstemp(name, "bundles/tmp_uri_XXXXXX");
>      @@ bundle-uri.c (new)
>       +	if ((bundle_fd = read_bundle_header(file, &header)) < 0)
>       +		return 1;
>       +
>      -+	result = unbundle(r, &header, bundle_fd, &extra_index_pack_args);
>      ++	if ((result = unbundle(r, &header, bundle_fd, &extra_index_pack_args)))
>      ++		return 1;
>       +
>       +	/*
>       +	 * Convert all refs/heads/ from the bundle into refs/bundles/
>  4:  f9e416164e4 ! 4:  f6255ec5188 fetch: add --bundle-uri option
>      @@ Commit message
>            4. git fetch origin
>            5. git checkout FETCH_HEAD
>       
>      +    Since the unbundle_from_file() in bundle-uri.c stores the bundled refs
>      +    into the refs/bundle/* namespace, those references are useful for the
>      +    fetch negotiation in step 4, reducing the size of the download in that
>      +    step.
>      +
>           Later changes will make this seamless within a 'git clone' command, but
>           this implementation is large enough to delay that integration.
>       
>  5:  923feef84e4 ! 5:  bfbd11b48bf bundle-uri: add support for http(s):// and file://
>      @@ Metadata
>        ## Commit message ##
>           bundle-uri: add support for http(s):// and file://
>       
>      -    The previous change created the logic for copying a file by name. Now,
>      -    first inspect the URI for an HTTP(S) prefix and use git-remote-https as
>      -    the way to download the data at that URI. Otherwise, check to see if
>      -    file:// is present and modify the prefix accordingly.
>      +    The previous change created the 'git fetch --bundle-uri=<uri>' option.
>      +    Currently, <uri> must be a filename.
>      +
>      +    Update copy_uri_to_file() to first inspect the URI for an HTTP(S) prefix
>      +    and use git-remote-https as the way to download the data at that URI.
>      +    Otherwise, check to see if file:// is present and modify the prefix
>      +    accordingly.
>       
>           Signed-off-by: Derrick Stolee <derrickstolee@github.com>
>       
>  6:  5ca02b841db ! 6:  a217e9a0640 fetch: add 'refs/bundle/' to log.excludeDecoration
>      @@ t/t5558-fetch-bundle-uri.sh: test_expect_success 'fetch file bundle' '
>       -	test_cmp expect actual
>       +	test_cmp expect actual &&
>       +
>      -+	test_config log.excludedecoration refs/bundle/
>      ++	test_config -C fetch-to log.excludedecoration refs/bundle/
>        '
>        
>        test_expect_success 'fetch file:// bundle' '
>      @@ t/t5558-fetch-bundle-uri.sh: test_expect_success 'fetch file:// bundle' '
>       -	test_cmp expect actual
>       +	test_cmp expect actual &&
>       +
>      -+	test_config log.excludedecoration refs/bundle/
>      ++	test_config -C fetch-file log.excludedecoration refs/bundle/
>        '
>        
>        #########################################################################
> 
> -- 
> gitgitgadget


Thanks for the series! I think this is a very promising design with just
a few clarifications and minor cleanups needed.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation
  2022-07-21 21:48   ` [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Josh Steadmon
@ 2022-07-21 21:56     ` Junio C Hamano
  0 siblings, 0 replies; 64+ messages in thread
From: Junio C Hamano @ 2022-07-21 21:56 UTC (permalink / raw)
  To: Josh Steadmon
  Cc: Derrick Stolee via GitGitGadget, git, me, newren, avarab,
	dyroneteng, Johannes.Schindelin, Derrick Stolee

Josh Steadmon <steadmon@google.com> writes:

> Thanks for the series! I think this is a very promising design with just
> a few clarifications and minor cleanups needed.

Thanks for reviewing.  I agree with the points you raised in your
messages.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/6] remote-curl: add 'get' capability
  2022-06-06 19:55 ` [PATCH 2/6] remote-curl: add 'get' capability Derrick Stolee via GitGitGadget
@ 2022-07-21 22:59   ` Junio C Hamano
  0 siblings, 0 replies; 64+ messages in thread
From: Junio C Hamano @ 2022-07-21 22:59 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +static void parse_get(struct strbuf *buf)
> +{
> +	struct http_get_options opts = { 0 };
> +	struct strbuf url = STRBUF_INIT;
> +	struct strbuf path = STRBUF_INIT;
> +	const char *p, *space;
> +
> +	if (!skip_prefix(buf->buf, "get ", &p))
> +		die(_("http transport does not support %s"), buf->buf);
> +
> +	space = strchr(p, ' ');
> +
> +	if (!space)
> +		die(_("protocol error: expected '<url> <path>', missing space"));
> +
> +	strbuf_add(&url, p, space - p);
> +	strbuf_addstr(&path, space + 1);
> +
> +	if (http_get_file(url.buf, path.buf, &opts))
> +		die(_("failed to download file at URL '%s'"), url.buf);

Micronit:

Do we plan to add some "interesting" options later to opts?

Otherwise, the third parameter to http_get_file() can safely be
NULL, I think.

> +	test_must_fail git remote-http $url $url <input 2>err &&

It has been a while since I worked with remote-http, but does this
need two $url?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 1/6] docs: document bundle URI standard
  2022-07-21 21:39     ` Josh Steadmon
@ 2022-07-22 13:15       ` Derrick Stolee
  2022-07-22 15:01       ` Derrick Stolee
  1 sibling, 0 replies; 64+ messages in thread
From: Derrick Stolee @ 2022-07-22 13:15 UTC (permalink / raw)
  To: Josh Steadmon, Derrick Stolee via GitGitGadget, git, gitster, me,
	newren, avarab, dyroneteng, Johannes.Schindelin

On 7/21/2022 5:39 PM, Josh Steadmon wrote:
> On 2022.06.29 20:40, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <derrickstolee@github.com>

>> +bundle.heuristic::
>> +	If this string-valued key exists, then the bundle list is designed to
>> +	work well with incremental `git fetch` commands. The heuristic signals
>> +	that there are additional keys available for each bundle that help
>> +	determine which subset of bundles the client should download.
> 
> To be clear, the values of `bundle.heuristic` are keys underneath the
> following `bundle.<id>` segments?

At the moment, the only planned heuristic is associated with a single
key (bundle.<id>.creationToken) but future heuristics could be more
complicated. The general idea is to say "these bundles were organized
with this heuristic in mind, so either take all of them (ignoring the
heuristic) or apply a specific algorithm to download a subset at a
time." The creationToken heuristic uses the total order on the bundles
to do a greedy algorithm for downloading the most-recent bundles until
the required commits are all satisfied.

>> +Related Work: Packfile URIs
>> +---------------------------
> 
> Thanks for including this section; my very first question before I even
> started reading the doc was "How does this compare to Packfile URIs?".
> 
> 
>> +
>> +The Git protocol already has a capability where the Git server can list
>> +a set of URLs along with the packfile response when serving a client
>> +request. The client is then expected to download the packfiles at those
>> +locations in order to have a complete understanding of the response.
>> +
>> +This mechanism is used by the Gerrit server (implemented with JGit) and
>> +has been effective at reducing CPU load and improving user performance for
>> +clones.
>> +
>> +A major downside to this mechanism is that the origin server needs to know
>> +_exactly_ what is in those packfiles, and the packfiles need to be available
>> +to the user for some time after the server has responded. This coupling
>> +between the origin and the packfile data is difficult to manage.
> 
> It was not immediately clear to me why bundle URIs would avoid these two
> downsides, but after thinking about it and discussing it, I believe this
> is the reasoning (please correct me if I'm wrong):
> 
> Bundle URIs are intended to be fully processed before negotiation
> happens, so the server can rely solely on the client's reported Haves /
> Wants to determine its response, as usual, and therefore doesn't need to
> know the bundle contents. Packfile URIs are not processed by the client
> before negotiation, so the server needs to be aware of the packfile
> contents in order to determine which additional objects to send, and
> can't rely solely on the Haves/Wants from the client.
> 
> If that's accurate, then it seems fine that when the bundle URI is
> provided by the user on the command-line, Git can download and process
> the bundle before even attempting to contact the server. But in future
> series, when we add the ability for the server to provide a URI via a
> capability, the client will have to pause after seeing the server's
> advertised URI, fetch and process the bundle, and then proceed with a
> fetch command. This seems fine assuming the bundles can be handled in a
> reasonable amount of time. And even if the connection breaks before the
> fetch command can be issued, presumably the client would not need to
> attempt to re-download the bundle a second time when the client makes a
> second connection to re-attempt the fetch.

After the server has advertised the bundle URIs, the client should
consider disconnecting while downloading bundles, then reconnect to
the server after that is complete.

For stateless connections (https) we need to reconnect every time, so
this isn't a problem.

I anticipate that in the long-term view of this feature, the server
advertised bundle URIs will usually point to an independent bundle
list at a static URI. In that situation, we can store the chosen
bundle list URI in Git config and check that bundle list for new data
before even connecting to the origin Git server. We would only
"rediscover" the URIs when either the bundle URI returns 404 and/or
the origin server lists new URIs.

Of course, there is the option that the origin Git server wants to
track and advertise the individual bundles directly, in which case the
client can't do this opportunistic bundle pre-check.

> For the point about the packfiles needing to be available for some time,
> this makes sense because the server's response is processed by the
> client before the referenced packfile is downloaded, but the response is
> incomplete without the packfile. But with bundle URIs, the server's
> response doesn't depend on whether or not the client actually processed
> the referenced bundle, only on what was negotiated. So the server's
> response is still valid and useful even if the client is unable to get
> the bundle.

You're pointing out the critical feature: the bundles are completely
optional. If the client fails to download the bundles due to something
like this race, then it will just negotiate for the origin Git server
to provide the data.

What's perhaps not entirely fair about this point is that the bundle
provider should keep this race in mind between serving a bundle list
and having the client request the associated bundles. Something like
a 24-hour window between updating the bundle list and deleting an
unreferenced bundle file should satisfy the vast majority of clients.

Thanks for the close look. While I removed them from the context, I've
made note of your smaller comments on things like typos and will fix
them in v3.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 3/6] bundle-uri: create basic file-copy logic
  2022-07-21 21:45     ` Josh Steadmon
@ 2022-07-22 13:18       ` Derrick Stolee
  0 siblings, 0 replies; 64+ messages in thread
From: Derrick Stolee @ 2022-07-22 13:18 UTC (permalink / raw)
  To: Josh Steadmon, Derrick Stolee via GitGitGadget, git, gitster, me,
	newren, avarab, dyroneteng, Johannes.Schindelin

On 7/21/2022 5:45 PM, Josh Steadmon wrote:
> On 2022.06.29 20:40, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <derrickstolee@github.com>
>> +static void find_temp_filename(struct strbuf *name)
>> +{
>> +	int fd;
>> +	/*
>> +	 * Find a temporary filename that is available. This is briefly
>> +	 * racy, but unlikely to collide.
>> +	 */
>> +	fd = odb_mkstemp(name, "bundles/tmp_uri_XXXXXX");
>> +	if (fd < 0)
>> +		die(_("failed to create temporary file"));
>> +	close(fd);
>> +	unlink(name->buf);
> 
> Is there a reason why we unlink() here? If we allow the empty file to
> remain on-disk until we write to it, wouldn't that prevent odb_mkstemp()
> from being racy?

I still need to test this, but that should work. Thanks!

>> +static int copy_uri_to_file(const char *uri, const char *file)
> 
> Nitpick: from a brief glance, it seems that most other copy* functions
> take the destination as the first parameter, and the source second. I
> don't feel strongly about it, because to me src followed by dst feels
> more natural, but perhaps we should be consistent with other functions.

Yeah, this is definitely my personal option that (src, dst) feels more
natural to me and I need to do a mental swap whenever dealing with the
standard methods. However, it's best to be consistent, and...
 
> 
>> +{
>> +	/* Copy as a file */
>> +	return copy_file(file, uri, 0444);

...we have exactly that standard usage right here.

>> +	if ((result = unbundle(r, &header, bundle_fd, &extra_index_pack_args)))
> 
> Can we just pass NULL here instead of creating an empty
> extra_index_pack_args?

This isn't the first time I've populated an option instead of just
passing NULL. I'll work on fixing that bad habit.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 6/6] fetch: add 'refs/bundle/' to log.excludeDecoration
  2022-07-21 21:47     ` Josh Steadmon
@ 2022-07-22 13:20       ` Derrick Stolee
  0 siblings, 0 replies; 64+ messages in thread
From: Derrick Stolee @ 2022-07-22 13:20 UTC (permalink / raw)
  To: Josh Steadmon, Derrick Stolee via GitGitGadget, git, gitster, me,
	newren, avarab, dyroneteng, Johannes.Schindelin

On 7/21/2022 5:47 PM, Josh Steadmon wrote:
> On 2022.06.29 20:40, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <derrickstolee@github.com>
>>
>> When fetching from a bundle URI, the branches of that bundle are stored
>> in a different ref namespace: refs/bundles/. This namespace is intended
>> to assist with later 'git fetch' negotiations with a Git server,
>> allowing the client to advertise which data it already has from a bundle
>> URI.
>>
>> These references can be confusing for a user when they appear as a
>> decoration in 'git log' output. Add "refs/bundles/" to the multi-valued
>> log.excludeDecoration config value. This is similar to the way
>> "refs/prefetch/" is hidden by background prefetch operations in 'git
>> maintenance' as added by 96eaffebb (maintenance: set
>> log.excludeDecoration durin prefetch, 2021-01-19).

>> +	git_config_set_multivar_gently("log.excludedecoration",
>> +					"refs/bundle/",
>> +					"refs/bundle/",
>> +					CONFIG_FLAGS_FIXED_VALUE |
>> +					CONFIG_FLAGS_MULTI_REPLACE);
>> +
> 
> I dislike the idea of modifying the user's config as a side effect here.
> Since it's a cosmetic issue, can we drop this patch and figure out
> better default values for log.excludedecoration?

You're right.

log.excludeDecoration was initially created precisely for enabling it
within the prefetch maintenance task, but it would be better to change
which decoration set is shown by default. I'll pull this patch out of
this series and bring out another one that rethinks this entire space.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 1/6] docs: document bundle URI standard
  2022-07-21 12:09     ` Matthew John Cheetham
@ 2022-07-22 13:52       ` Derrick Stolee
  2022-07-22 16:03       ` Derrick Stolee
  1 sibling, 0 replies; 64+ messages in thread
From: Derrick Stolee @ 2022-07-22 13:52 UTC (permalink / raw)
  To: Matthew John Cheetham, Derrick Stolee via GitGitGadget
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin, git

On 7/21/2022 8:09 AM, Matthew John Cheetham wrote:> I had a few questions and suggestions; below.
> 
> On 2022-06-29 21:40, Derrick Stolee via GitGitGadget wrote:> +Assuming a `200 OK` response from the server, the content at the URL is
>> +inspected. First, Git attempts to parse the file as a bundle file of
>> +version 2 or higher. If the file is not a bundle, then the file is parsed
>> +as a plain-text file using Git's config parser. The key-value pairs in
>> +that config file are expected to describe a list of bundle URIs. If
>> +neither of these parse attempts succeed, then Git will report an error to
>> +the user that the bundle URI provided erroneous data.
>> +
>> +Any other data provided by the server is considered erroneous.
> 
> I wonder if it may be worth considering adding an optional server requirement ("MAY" not "MUST") to provide a `Content-Type` header indicating if the response is a bundle list (or bundle) to skip straight to parsing the correct type of file? Eg "application/x-git-bundle-list"?
> 
> Even the simplest of content servers should be able to set Content-Type. If not falling back to 'try parse bundle else try parse bundle list' is still OK.
This is an interesting idea. We should keep it in mind for a future
extension, since the Git client needs to do some work to parse the
header and communicate that upwards to the logic that parses the file. 
>> +bundle.mode::
>> +    (Required) This value has one of two values: `all` and `any`. When `all`
>> +    is specified, then the client should expect to need all of the listed
>> +    bundle URIs that match their repository's requirements. When `any` is
>> +    specified, then the client should expect that any one of the bundle URIs
>> +    that match their repository's requirements will suffice. Typically, the
>> +    `any` option is used to list a number of different bundle servers
>> +    located in different geographies.
> 
> Do you forsee any future where we'd want or need to specify 'sets' of bundles where "all" _of_ "any" particular set would be required?
> Eg. there are 3 sets of bundles (A, B, C), and the client would need to download all bundles belonging to any of A, B, or C? Where ABC would be different geo-distributed sets?

The bundle.heuristic space is open-ended to allow for different ways to
scan the bundle list and download a subset, so that might be a way to
extend this in the future.

I think what you're hinting at is a single "global" bundle list that knows
about all of the bundles scattered across the world and it groups bundles
by geography, even though the list knows about multiple geos. Is that what
you mean?

> I guess what I'm getting at here is with this design (which I appreciate is intentionally flexible), there are several different ways a server could direct a client to bundles stored in a nearby geography:
> 
> 1. Serve an "all" bundle list with geo-located bundle URIs?
> 2. Serve a global "any" bundle list with each single bundle in a different geography (specified by `location`)
> 3. Serve a single bundle (not a list) with a different

I think this item 3 is incomplete. What were you going to say?

I'll assume for now that you just mean "3. Serve a single bundle (not a
list)."

> Are any of these going to be preferred over another for potential client optimisations?

I could do better in this document of giving clear examples of how a
bundle provider could organize bundles. As the client becomes more
sophisticated, then different organizations become "unlocked" as something
the client can understand and use efficiently. That also sets the stage
for how to add extensions: the client change can be paired with an example
bundle provider organization.

The bundle provider setup that I personally think will work best is:

 1. The origin Git server advertises a bundle list in "any" mode. Each URI
    is a static URI that corresponds to a different geography. The client
    picks the closest one and stores that URI in local config.

 2. Each static URI provides a bundle list in "all" mode with the
    creationToken heuristic. The bundles are sorted by creation time, and
    new bundles are created on roughly a daily basis, but it could be
    a wider time frame if not enough Git data is added every day. The
    creationTokens are added as appending to this order, but after the
    list has some fixed length, the oldest bundles are merged into a single
    bundle. That merged bundle is then assigned the maximum creationToken
    of the bundles used to create it.

This comes from experience in how the GVFS Cache Server prefetch packfiles
are created. To support those very-active repos, there's another layer of
"hourly" packs that are merged into "daily" packs; those daily packs are
merged into a giant "everything old" pack after 30 days. This means that
there is a maximum of 1 + 30 + 23 packs at any given time.

>> +
>> +bundle.heuristic::
>> +    If this string-valued key exists, then the bundle list is designed to
>> +    work well with incremental `git fetch` commands. The heuristic signals
>> +    that there are additional keys available for each bundle that help
>> +    determine which subset of bundles the client should download.
>> +
>> +The remaining keys include an `<id>` segment which is a server-designated
>> +name for each available bundle.
> 
> Case-sensitive ID? A-Za-z0-9 only? "Same as Git config rules"?

I would say "Same as Git config rules" in general, but we could try to be
more strict here, if we want.

>> +bundle.<id>.location::
>> +    This string value advertises a real-world location from where the bundle
>> +    URI is served. This can be used to present the user with an option for
>> +    which bundle URI to use or simply as an informative indicator of which
>> +    bundle URI was selected by Git. This is only valuable when
>> +    `bundle.mode` is `any`.
> 
> I assume `location` is just an opaque string that is just used for info or display purposes? Does it make sense for other 'display' type strings like 'name' or 'message'?

We should definitely give this key another look when we get around to
building the UX around choosing from a list in "any" mode with human-
readable information like this.

>> +Advertising Bundle URIs
>> +-----------------------
>> +
> ...
>> +The client could choose an arbitrary bundle URI as an option _or_ select
>> +the URI with best performance by some exploratory checks. It is up to the
>> +bundle provider to decide if having multiple URIs is preferable to a
>> +single URI that is geodistributed through server-side infrastructure.
> 
> Would it make sense for the client to pick the first bundle URI rather than an arbitrary one? The server could use information about the request (origin IP/geography) to provide a sorted list of URIs by physical distance to the client.
> 
> I guess if arbitrary is 'random' then this provides some client-side load balancing over multiple potential servers too. Interested in your thoughts behind what would be 'best practice' for a bundle server here.

I tend to think of the client as being the "smartest" participant in this
exchange. The origin Git server is just serving lines from its local config
and isn't thinking at all about the client's network topology. If instead
the bundle provider (not the Git server) organizes to have a single URI
listing all of the geographic options, then that server could do smarter
things, for sure. It might be able to do that reordering to take advantage
of the Git client picking the first option (before the client has the
capability to test the connections to all of them for the fastest link).

>> +If the bundle provider does not provide a heuristic, then the client
>> +should attempt to inspect the bundle headers before downloading the full
>> +bundle data in case the bundle tips already exist in the client
>> +repository.
> 
> Would this default behaviour also be considered another explicit heurisitic option? For example: `bundle.heuristic=default` or `inspect`.
> Is it ever likely that the default behaviour would change?

I think that if there is no explicit heuristic, then this is the only way
the client can avoid downloading all of the bundle content.

>> +* The client receives a response other than `200 OK` (such as `404 Not Found`,
>> +  `401 Not Authorized`, or `500 Internal Server Error`). The client should
>> +  use the `credential.helper` to attempt authentication after the first
>> +  `401 Not Authorized` response, but a second such response is a failure.
> 
> I'd probably say a 500 is not solvable with a different set of credentials, but potentially a retry or just `die`. Do we attempt to `credential_fill()` with anything other than a 401 (or maybe 403 or 404) elsewhere in Git?

This is poorly worded, and I should group 400-level errors in one bullet
and 500 errors as its own category. The implementation uses the same
"retry with credentials" logic that other HTTPS connections make, so the
logic should be shared there. This document should point to another place
that documents that contract, especially in case it changes in the future.

Thanks!
-Stolee

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 1/6] docs: document bundle URI standard
  2022-07-21 21:39     ` Josh Steadmon
  2022-07-22 13:15       ` Derrick Stolee
@ 2022-07-22 15:01       ` Derrick Stolee
  1 sibling, 0 replies; 64+ messages in thread
From: Derrick Stolee @ 2022-07-22 15:01 UTC (permalink / raw)
  To: Josh Steadmon, Derrick Stolee via GitGitGadget, git, gitster, me,
	newren, avarab, dyroneteng, Johannes.Schindelin

On 7/21/2022 5:39 PM, Josh Steadmon wrote:
> On 2022.06.29 20:40, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <derrickstolee@github.com>

>> +* A bundle download during a `git fetch` contains objects already in the
>> +  object database. This is probably unavoidable if we are using bundles
>> +  for fetches, since the client will almost always be slightly ahead of
>> +  the bundle servers after performing its "catch-up" fetch to the remote
>> +  server. This extra work is most wasteful when the client is fetching
>> +  much more frequently than the server is computing bundles, such as if
>> +  the client is using hourly prefetches with background maintenance, but
>> +  the server is computing bundles weekly. For this reason, the client
>> +  should not use bundle URIs for fetch unless the server has explicitly
>> +  recommended it through the `bundle.flags = forFetch` value.
> 
> The "Bundle Lists" section above doesn't mention `bundle.flags` at all;
> has this been replaced by `bundle.heuristic`?

You and Matthew both caught this. I first thought that absolutely this
was a replacement I missed. Then I re-checked my future steps and the
very last patch implements `bundle.flags = forFetch`. However, the
reason it exists is for the ability to incrementally fetch even if the
`bundle.heuristic` value is not set. That requires the "download only
the headers" feature, so should be delayed to that implementation.

I'll remove it from the design doc and instead implement the logic to
store the bundle URI for fetching if `bundle.heuristic=creationToken`,
which can be extended later.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 1/6] docs: document bundle URI standard
  2022-07-21 12:09     ` Matthew John Cheetham
  2022-07-22 13:52       ` Derrick Stolee
@ 2022-07-22 16:03       ` Derrick Stolee
  1 sibling, 0 replies; 64+ messages in thread
From: Derrick Stolee @ 2022-07-22 16:03 UTC (permalink / raw)
  To: Matthew John Cheetham, Derrick Stolee via GitGitGadget
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin, git

On 7/21/2022 8:09 AM, Matthew John Cheetham wrote:

>> +The remaining keys include an `<id>` segment which is a server-designated
>> +name for each available bundle.
> 
> Case-sensitive ID? A-Za-z0-9 only? "Same as Git config rules"?

I was thinking "same as Git config rules", but those rules are extremely
flexible, perhaps more than we want them to be. (For instance, the rules
allow specifying "remote.<id>.url" where <id> could be anything except
having a newline or null byte.

Perhaps we should start with "alphanumeric and `-`" like the section
name, just to be extra careful here. We can probably be fine parsing
more flexibly on the client, but better to be restrictive now and
relax later.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v3 0/2] bundle URIs: design doc and initial git fetch --bundle-uri implementation
  2022-06-29 20:40 ` [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
                     ` (6 preceding siblings ...)
  2022-07-21 21:48   ` [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Josh Steadmon
@ 2022-07-25 13:53   ` Derrick Stolee via GitGitGadget
  2022-07-25 13:53     ` [PATCH v3 1/2] docs: document bundle URI standard Derrick Stolee via GitGitGadget
                       ` (3 more replies)
  7 siblings, 4 replies; 64+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-07-25 13:53 UTC (permalink / raw)
  To: git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	SZEDER Gábor, Matthew John Cheetham, Josh Steadmon,
	Derrick Stolee

This is the first of series towards building the bundle URI feature as
discussed in previous RFCs, specifically pulled directly out of [5]:

[1]
https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
[2]
https://lore.kernel.org/git/cover-0.3-00000000000-20211025T211159Z-avarab@gmail.com/
[3]
https://lore.kernel.org/git/pull.1160.git.1645641063.gitgitgadget@gmail.com
[4]
https://lore.kernel.org/git/RFC-cover-v2-00.36-00000000000-20220418T165545Z-avarab@gmail.com/
[5]
https://lore.kernel.org/git/pull.1234.git.1653072042.gitgitgadget@gmail.com

THIS ONLY INCLUDES THE DESIGN DOCUMENT. See "Updates in v3". There are two
patches:

 1. The main design document that details the bundle URI standard and how
    the client interacts with the bundle data.
 2. An addendum to the design document that details one strategy for
    organizing bundles from the perspective of a bundle provider.

As outlined in [5], the next steps after this are:

 1. Add 'git clone --bundle-uri=' to run a 'git bundle fetch ' step before
    doing a fetch negotiation with the origin remote. [6]
 2. Allow parsing a bundle list as a config file at the given URI. The
    key-value format is unified with the protocol v2 verb (coming in (3)).
    [7]
 3. Implement the protocol v2 verb, re-using the bundle list logic from (2).
    Use this to auto-discover bundle URIs during 'git clone' (behind a
    config option). [8]
 4. Implement the 'creationToken' heuristic, allowing incremental 'git
    fetch' commands to download a bundle list from a configured URI, and
    only download bundles that are new based on the creation token values.
    [9]

I have prepared some of this work as pull requests on my personal fork so
curious readers can look ahead to where we are going:

[6] https://github.com/derrickstolee/git/pull/18 [7]
https://github.com/derrickstolee/git/pull/20 [8]
https://github.com/derrickstolee/git/pull/21 [9]
https://github.com/derrickstolee/git/pull/22

As mentioned in the design document, this is not all that is possible. For
instance, Ævar's suggestion to download only the bundle headers can be used
as a second heuristic (and as an augmentation of the timestamp heuristic).


Updates in v3
=============

 * This version only includes the design document. Thanks to all the
   reviewers for the significant attention that improves the doc a lot.
 * The second patch has an addition to the design document that details a
   potential way to organize bundles from the provider's perspective.
 * Based on some off-list feedback, I was going to switch git fetch
   --bundle-uri into git bundle fetch, but that has a major conflict with
   [10] which was just submitted.
 * I will move the git bundle fetch implementation into [6] which also has
   the git clone --bundle-uri implementation. [10]
   https://lore.kernel.org/git/20220725123857.2773963-1-szeder.dev@gmail.com/


Updates in v2
=============

 * The design document has been updated based on Junio's feedback.
 * The "bundle.list." keys are now just "bundle.".
 * The "timestamp" heuristic is now "creationToken".
 * More clarity on how Git parses data from the bundle URI.
 * Dropped some unnecessary bundle list keys (*.list, *.requires).

Thanks, -Stolee

Derrick Stolee (2):
  docs: document bundle URI standard
  bundle-uri: add example bundle organization

 Documentation/Makefile                 |   1 +
 Documentation/technical/bundle-uri.txt | 573 +++++++++++++++++++++++++
 2 files changed, 574 insertions(+)
 create mode 100644 Documentation/technical/bundle-uri.txt


base-commit: e72d93e88cb20b06e88e6e7d81bd1dc4effe453f
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1248%2Fderrickstolee%2Fbundle-redo%2Ffetch-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1248/derrickstolee/bundle-redo/fetch-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/1248

Range-diff vs v2:

 1:  d444042dc4d ! 1:  e0f003e1b5f docs: document bundle URI standard
     @@ Commit message
      
          Signed-off-by: Derrick Stolee <derrickstolee@github.com>
      
     + ## Documentation/Makefile ##
     +@@ Documentation/Makefile: TECH_DOCS += SubmittingPatches
     + TECH_DOCS += ToolsForGit
     + TECH_DOCS += technical/bitmap-format
     + TECH_DOCS += technical/bundle-format
     ++TECH_DOCS += technical/bundle-uri
     + TECH_DOCS += technical/cruft-packs
     + TECH_DOCS += technical/hash-function-transition
     + TECH_DOCS += technical/http-protocol
     +
       ## Documentation/technical/bundle-uri.txt (new) ##
      @@
      +Bundle URIs
      +===========
      +
     ++Git bundles are files that store a pack-file along with some extra metadata,
     ++including a set of refs and a (possibly empty) set of necessary commits. See
     ++linkgit:git-bundle[1] and link:bundle-format.txt[the bundle format] for more
     ++information.
     ++
      +Bundle URIs are locations where Git can download one or more bundles in
      +order to bootstrap the object database in advance of fetching the remaining
      +objects from a remote.
     @@ Documentation/technical/bundle-uri.txt (new)
      +	If this string-valued key exists, then the bundle list is designed to
      +	work well with incremental `git fetch` commands. The heuristic signals
      +	that there are additional keys available for each bundle that help
     -+	determine which subset of bundles the client should download.
     ++	determine which subset of bundles the client should download. The only
     ++  heuristic currently planned is `creationToken`.
      +
      +The remaining keys include an `<id>` segment which is a server-designated
     -+name for each available bundle.
     ++name for each available bundle. The `<id>` must contain only alphanumeric
     ++and `-` characters.
      +
      +bundle.<id>.uri::
      +	(Required) This string value is the URI for downloading bundle `<id>`.
     @@ Documentation/technical/bundle-uri.txt (new)
      +
      +Here is an example bundle list using the Git config format:
      +
     -+```
     -+[bundle]
     -+	version = 1
     -+	mode = all
     -+	heuristic = creationToken
     ++	[bundle]
     ++		version = 1
     ++		mode = all
     ++		heuristic = creationToken
      +
     -+[bundle "2022-02-09-1644442601-daily"]
     -+	uri = https://bundles.example.com/git/git/2022-02-09-1644442601-daily.bundle
     -+	timestamp = 1644442601
     ++	[bundle "2022-02-09-1644442601-daily"]
     ++		uri = https://bundles.example.com/git/git/2022-02-09-1644442601-daily.bundle
     ++		creationToken = 1644442601
      +
     -+[bundle "2022-02-02-1643842562"]
     -+	uri = https://bundles.example.com/git/git/2022-02-02-1643842562.bundle
     -+	timestamp = 1643842562
     ++	[bundle "2022-02-02-1643842562"]
     ++		uri = https://bundles.example.com/git/git/2022-02-02-1643842562.bundle
     ++		creationToken = 1643842562
      +
     -+[bundle "2022-02-09-1644442631-daily-blobless"]
     -+	uri = 2022-02-09-1644442631-daily-blobless.bundle
     -+	timestamp = 1644442631
     -+	filter = blob:none
     ++	[bundle "2022-02-09-1644442631-daily-blobless"]
     ++		uri = 2022-02-09-1644442631-daily-blobless.bundle
     ++		creationToken = 1644442631
     ++		filter = blob:none
      +
     -+[bundle "2022-02-02-1643842568-blobless"]
     -+	uri = /git/git/2022-02-02-1643842568-blobless.bundle
     -+	timestamp = 1643842568
     -+	filter = blob:none
     -+```
     ++	[bundle "2022-02-02-1643842568-blobless"]
     ++		uri = /git/git/2022-02-02-1643842568-blobless.bundle
     ++		creationToken = 1643842568
     ++		filter = blob:none
      +
      +This example uses `bundle.mode=all` as well as the
      +`bundle.<id>.creationToken` heuristic. It also uses the `bundle.<id>.filter`
     @@ Documentation/technical/bundle-uri.txt (new)
      +* The client fails to connect with a server at the given URI or a connection
      +  is lost without any chance to recover.
      +
     -+* The client receives a response other than `200 OK` (such as `404 Not Found`,
     -+  `401 Not Authorized`, or `500 Internal Server Error`). The client should
     -+  use the `credential.helper` to attempt authentication after the first
     -+  `401 Not Authorized` response, but a second such response is a failure.
     ++* The client receives a 400-level response (such as `404 Not Found` or
     ++  `401 Not Authorized`). The client should use the credential helper to
     ++  find and provide a credential for the URI, but match the semantics of
     ++  Git's other HTTP protocols in terms of handling specific 400-level
     ++  errors.
      +
     -+* The client receives data that is not parsable as a bundle or bundle list.
     ++* The server reports any other failure reponse.
      +
     -+* The bundle list describes a directed cycle in the
     -+  `bundle.<id>.requires` links.
     ++* The client receives data that is not parsable as a bundle or bundle list.
      +
      +* A bundle includes a filter that does not match expectations.
      +
      +* The client cannot unbundle the bundles because the prerequisite commit OIDs
     -+  are not in the object database and there are no more
     -+  `bundle.<id>.requires` links to follow.
     ++  are not in the object database and there are no more bundles to download.
      +
      +There are also situations that could be seen as wasteful, but are not
      +error conditions:
     @@ Documentation/technical/bundle-uri.txt (new)
      +  the client is using hourly prefetches with background maintenance, but
      +  the server is computing bundles weekly. For this reason, the client
      +  should not use bundle URIs for fetch unless the server has explicitly
     -+  recommended it through the `bundle.flags = forFetch` value.
     ++  recommended it through a `bundle.heuristic` value.
      +
      +Implementation Plan
      +-------------------
     @@ Documentation/technical/bundle-uri.txt (new)
      +   that the config format parsing feeds a list of key-value pairs into the
      +   bundle list logic.
      +
     -+3. Create the `bundle-uri` protocol v2 verb so Git servers can advertise
     ++3. Create the `bundle-uri` protocol v2 command so Git servers can advertise
      +   bundle URIs using the key-value pairs. Plug into the existing key-value
      +   input to the bundle list logic. Allow `git clone` to discover these
      +   bundle URIs and bootstrap the client repository from the bundle data.
 2:  0a2cf60437f < -:  ----------- remote-curl: add 'get' capability
 3:  abec47564fd < -:  ----------- bundle-uri: create basic file-copy logic
 4:  f6255ec5188 < -:  ----------- fetch: add --bundle-uri option
 5:  bfbd11b48bf < -:  ----------- bundle-uri: add support for http(s):// and file://
 6:  a217e9a0640 < -:  ----------- fetch: add 'refs/bundle/' to log.excludeDecoration
 -:  ----------- > 2:  a933471c3af bundle-uri: add example bundle organization

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v3 1/2] docs: document bundle URI standard
  2022-07-25 13:53   ` [PATCH v3 0/2] " Derrick Stolee via GitGitGadget
@ 2022-07-25 13:53     ` Derrick Stolee via GitGitGadget
  2022-07-28  1:23       ` tenglong.tl
  2022-07-25 13:53     ` [PATCH v3 2/2] bundle-uri: add example bundle organization Derrick Stolee via GitGitGadget
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 64+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-07-25 13:53 UTC (permalink / raw)
  To: git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	SZEDER Gábor, Matthew John Cheetham, Josh Steadmon,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

Introduce the idea of bundle URIs to the Git codebase through an
aspirational design document. This document includes the full design
intended to include the feature in its fully-implemented form. This will
take several steps as detailed in the Implementation Plan section.

By committing this document now, it can be used to motivate changes
necessary to reach these final goals. The design can still be altered as
new information is discovered.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/Makefile                 |   1 +
 Documentation/technical/bundle-uri.txt | 468 +++++++++++++++++++++++++
 2 files changed, 469 insertions(+)
 create mode 100644 Documentation/technical/bundle-uri.txt

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 4f801f4e4c9..e71ee915912 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -96,6 +96,7 @@ TECH_DOCS += SubmittingPatches
 TECH_DOCS += ToolsForGit
 TECH_DOCS += technical/bitmap-format
 TECH_DOCS += technical/bundle-format
+TECH_DOCS += technical/bundle-uri
 TECH_DOCS += technical/cruft-packs
 TECH_DOCS += technical/hash-function-transition
 TECH_DOCS += technical/http-protocol
diff --git a/Documentation/technical/bundle-uri.txt b/Documentation/technical/bundle-uri.txt
new file mode 100644
index 00000000000..af195f2b429
--- /dev/null
+++ b/Documentation/technical/bundle-uri.txt
@@ -0,0 +1,468 @@
+Bundle URIs
+===========
+
+Git bundles are files that store a pack-file along with some extra metadata,
+including a set of refs and a (possibly empty) set of necessary commits. See
+linkgit:git-bundle[1] and link:bundle-format.txt[the bundle format] for more
+information.
+
+Bundle URIs are locations where Git can download one or more bundles in
+order to bootstrap the object database in advance of fetching the remaining
+objects from a remote.
+
+One goal is to speed up clones and fetches for users with poor network
+connectivity to the origin server. Another benefit is to allow heavy users,
+such as CI build farms, to use local resources for the majority of Git data
+and thereby reducing the load on the origin server.
+
+To enable the bundle URI feature, users can specify a bundle URI using
+command-line options or the origin server can advertise one or more URIs
+via a protocol v2 capability.
+
+Design Goals
+------------
+
+The bundle URI standard aims to be flexible enough to satisfy multiple
+workloads. The bundle provider and the Git client have several choices in
+how they create and consume bundle URIs.
+
+* Bundles can have whatever name the server desires. This name could refer
+  to immutable data by using a hash of the bundle contents. However, this
+  means that a new URI will be needed after every update of the content.
+  This might be acceptable if the server is advertising the URI (and the
+  server is aware of new bundles being generated) but would not be
+  ergonomic for users using the command line option.
+
+* The bundles could be organized specifically for bootstrapping full
+  clones, but could also be organized with the intention of bootstrapping
+  incremental fetches. The bundle provider must decide on one of several
+  organization schemes to minimize client downloads during incremental
+  fetches, but the Git client can also choose whether to use bundles for
+  either of these operations.
+
+* The bundle provider can choose to support full clones, partial clones,
+  or both. The client can detect which bundles are appropriate for the
+  repository's partial clone filter, if any.
+
+* The bundle provider can use a single bundle (for clones only), or a
+  list of bundles. When using a list of bundles, the provider can specify
+  whether or not the client needs _all_ of the bundle URIs for a full
+  clone, or if _any_ one of the bundle URIs is sufficient. This allows the
+  bundle provider to use different URIs for different geographies.
+
+* The bundle provider can organize the bundles using heuristics, such as
+  creation tokens, to help the client prevent downloading bundles it does
+  not need. When the bundle provider does not provide these heuristics,
+  the client can use optimizations to minimize how much of the data is
+  downloaded.
+
+* The bundle provider does not need to be associated with the Git server.
+  The client can choose to use the bundle provider without it being
+  advertised by the Git server.
+
+* The client can choose to discover bundle providers that are advertised
+  by the Git server. This could happen during `git clone`, during
+  `git fetch`, both, or neither. The user can choose which combination
+  works best for them.
+
+* The client can choose to configure a bundle provider manually at any
+  time. The client can also choose to specify a bundle provider manually
+  as a command-line option to `git clone`.
+
+Each repository is different and every Git server has different needs.
+Hopefully the bundle URI feature is flexible enough to satisfy all needs.
+If not, then the feature can be extended through its versioning mechanism.
+
+Server requirements
+-------------------
+
+To provide a server-side implementation of bundle servers, no other parts
+of the Git protocol are required. This allows server maintainers to use
+static content solutions such as CDNs in order to serve the bundle files.
+
+At the current scope of the bundle URI feature, all URIs are expected to
+be HTTP(S) URLs where content is downloaded to a local file using a `GET`
+request to that URL. The server could include authentication requirements
+to those requests with the aim of triggering the configured credential
+helper for secure access. (Future extensions could use "file://" URIs or
+SSH URIs.)
+
+Assuming a `200 OK` response from the server, the content at the URL is
+inspected. First, Git attempts to parse the file as a bundle file of
+version 2 or higher. If the file is not a bundle, then the file is parsed
+as a plain-text file using Git's config parser. The key-value pairs in
+that config file are expected to describe a list of bundle URIs. If
+neither of these parse attempts succeed, then Git will report an error to
+the user that the bundle URI provided erroneous data.
+
+Any other data provided by the server is considered erroneous.
+
+Bundle Lists
+------------
+
+The Git server can advertise bundle URIs using a set of `key=value` pairs.
+A bundle URI can also serve a plain-text file in the Git config format
+containing these same `key=value` pairs. In both cases, we consider this
+to be a _bundle list_. The pairs specify information about the bundles
+that the client can use to make decisions for which bundles to download
+and which to ignore.
+
+A few keys focus on properties of the list itself.
+
+bundle.version::
+	(Required) This value provides a version number for the bundle
+	list. If a future Git change enables a feature that needs the Git
+	client to react to a new key in the bundle list file, then this version
+	will increment. The only current version number is 1, and if any other
+	value is specified then Git will fail to use this file.
+
+bundle.mode::
+	(Required) This value has one of two values: `all` and `any`. When `all`
+	is specified, then the client should expect to need all of the listed
+	bundle URIs that match their repository's requirements. When `any` is
+	specified, then the client should expect that any one of the bundle URIs
+	that match their repository's requirements will suffice. Typically, the
+	`any` option is used to list a number of different bundle servers
+	located in different geographies.
+
+bundle.heuristic::
+	If this string-valued key exists, then the bundle list is designed to
+	work well with incremental `git fetch` commands. The heuristic signals
+	that there are additional keys available for each bundle that help
+	determine which subset of bundles the client should download. The only
+  heuristic currently planned is `creationToken`.
+
+The remaining keys include an `<id>` segment which is a server-designated
+name for each available bundle. The `<id>` must contain only alphanumeric
+and `-` characters.
+
+bundle.<id>.uri::
+	(Required) This string value is the URI for downloading bundle `<id>`.
+	If the URI begins with a protocol (`http://` or `https://`) then the URI
+	is absolute. Otherwise, the URI is interpreted as relative to the URI
+	used for the bundle list. If the URI begins with `/`, then that relative
+	path is relative to the domain name used for the bundle list. (This use
+	of relative paths is intended to make it easier to distribute a set of
+	bundles across a large number of servers or CDNs with different domain
+	names.)
+
+bundle.<id>.filter::
+	This string value represents an object filter that should also appear in
+	the header of this bundle. The server uses this value to differentiate
+	different kinds of bundles from which the client can choose those that
+	match their object filters.
+
+bundle.<id>.creationToken::
+	This value is a nonnegative 64-bit integer used for sorting the bundles
+	the list. This is used to download a subset of bundles during a fetch
+	when `bundle.heuristic=creationToken`.
+
+bundle.<id>.location::
+	This string value advertises a real-world location from where the bundle
+	URI is served. This can be used to present the user with an option for
+	which bundle URI to use or simply as an informative indicator of which
+	bundle URI was selected by Git. This is only valuable when
+	`bundle.mode` is `any`.
+
+Here is an example bundle list using the Git config format:
+
+	[bundle]
+		version = 1
+		mode = all
+		heuristic = creationToken
+
+	[bundle "2022-02-09-1644442601-daily"]
+		uri = https://bundles.example.com/git/git/2022-02-09-1644442601-daily.bundle
+		creationToken = 1644442601
+
+	[bundle "2022-02-02-1643842562"]
+		uri = https://bundles.example.com/git/git/2022-02-02-1643842562.bundle
+		creationToken = 1643842562
+
+	[bundle "2022-02-09-1644442631-daily-blobless"]
+		uri = 2022-02-09-1644442631-daily-blobless.bundle
+		creationToken = 1644442631
+		filter = blob:none
+
+	[bundle "2022-02-02-1643842568-blobless"]
+		uri = /git/git/2022-02-02-1643842568-blobless.bundle
+		creationToken = 1643842568
+		filter = blob:none
+
+This example uses `bundle.mode=all` as well as the
+`bundle.<id>.creationToken` heuristic. It also uses the `bundle.<id>.filter`
+options to present two parallel sets of bundles: one for full clones and
+another for blobless partial clones.
+
+Suppose that this bundle list was found at the URI
+`https://bundles.example.com/git/git/` and so the two blobless bundles have
+the following fully-expanded URIs:
+
+* `https://bundles.example.com/git/git/2022-02-09-1644442631-daily-blobless.bundle`
+* `https://bundles.example.com/git/git/2022-02-02-1643842568-blobless.bundle`
+
+Advertising Bundle URIs
+-----------------------
+
+If a user knows a bundle URI for the repository they are cloning, then
+they can specify that URI manually through a command-line option. However,
+a Git host may want to advertise bundle URIs during the clone operation,
+helping users unaware of the feature.
+
+The only thing required for this feature is that the server can advertise
+one or more bundle URIs. This advertisement takes the form of a new
+protocol v2 capability specifically for discovering bundle URIs.
+
+The client could choose an arbitrary bundle URI as an option _or_ select
+the URI with best performance by some exploratory checks. It is up to the
+bundle provider to decide if having multiple URIs is preferable to a
+single URI that is geodistributed through server-side infrastructure.
+
+Cloning with Bundle URIs
+------------------------
+
+The primary need for bundle URIs is to speed up clones. The Git client
+will interact with bundle URIs according to the following flow:
+
+1. The user specifies a bundle URI with the `--bundle-uri` command-line
+   option _or_ the client discovers a bundle list advertised by the
+   Git server.
+
+2. If the downloaded data from a bundle URI is a bundle, then the client
+   inspects the bundle headers to check that the prerequisite commit OIDs
+   are present in the client repository. If some are missing, then the
+   client delays unbundling until other bundles have been unbundled,
+   making those OIDs present. When all required OIDs are present, the
+   client unbundles that data using a refspec. The default refspec is
+   `+refs/heads/*:refs/bundles/*`, but this can be configured. These refs
+   are stored so that later `git fetch` negotiations can communicate the
+   bundled refs as `have`s, reducing the size of the fetch over the Git
+   protocol. To allow pruning refs from this ref namespace, Git may
+   introduce a numbered namespace (such as `refs/bundles/<i>/*`) such that
+   stale bundle refs can be deleted.
+
+3. If the file is instead a bundle list, then the client inspects the
+   `bundle.mode` to see if the list is of the `all` or `any` form.
+
+   a. If `bundle.mode=all`, then the client considers all bundle
+      URIs. The list is reduced based on the `bundle.<id>.filter` options
+      matching the client repository's partial clone filter. Then, all
+      bundle URIs are requested. If the `bundle.<id>.creationToken`
+      heuristic is provided, then the bundles are downloaded in decreasing
+      order by the creation token, stopping when a bundle has all required
+      OIDs. The bundles can then be unbundled in increasing creation token
+      order. The client stores the latest creation token as a heuristic
+      for avoiding future downloads if the bundle list does not advertise
+      bundles with larger creation tokens.
+
+   b. If `bundle.mode=any`, then the client can choose any one of the
+      bundle URIs to inspect. The client can use a variety of ways to
+      choose among these URIs. The client can also fallback to another URI
+      if the initial choice fails to return a result.
+
+Note that during a clone we expect that all bundles will be required, and
+heuristics such as `bundle.<uri>.creationToken` can be used to download
+bundles in chronological order or in parallel.
+
+If a given bundle URI is a bundle list with a `bundle.heuristic`
+value, then the client can choose to store that URI as its chosen bundle
+URI. The client can then navigate directly to that URI during later `git
+fetch` calls.
+
+When downloading bundle URIs, the client can choose to inspect the initial
+content before committing to downloading the entire content. This may
+provide enough information to determine if the URI is a bundle list or
+a bundle. In the case of a bundle, the client may inspect the bundle
+header to determine that all advertised tips are already in the client
+repository and cancel the remaining download.
+
+Fetching with Bundle URIs
+-------------------------
+
+When the client fetches new data, it can decide to fetch from bundle
+servers before fetching from the origin remote. This could be done via a
+command-line option, but it is more likely useful to use a config value
+such as the one specified during the clone.
+
+The fetch operation follows the same procedure to download bundles from a
+bundle list (although we do _not_ want to use parallel downloads here). We
+expect that the process will end when all prerequisite commit OIDs in a
+thin bundle are already in the object database.
+
+When using the `creationToken` heuristic, the client can avoid downloading
+any bundles if their creation tokenss are not larger than the stored
+creation token. After fetching new bundles, Git updates this local
+creation token.
+
+If the bundle provider does not provide a heuristic, then the client
+should attempt to inspect the bundle headers before downloading the full
+bundle data in case the bundle tips already exist in the client
+repository.
+
+Error Conditions
+----------------
+
+If the Git client discovers something unexpected while downloading
+information according to a bundle URI or the bundle list found at that
+location, then Git can ignore that data and continue as if it was not
+given a bundle URI. The remote Git server is the ultimate source of truth,
+not the bundle URI.
+
+Here are a few example error conditions:
+
+* The client fails to connect with a server at the given URI or a connection
+  is lost without any chance to recover.
+
+* The client receives a 400-level response (such as `404 Not Found` or
+  `401 Not Authorized`). The client should use the credential helper to
+  find and provide a credential for the URI, but match the semantics of
+  Git's other HTTP protocols in terms of handling specific 400-level
+  errors.
+
+* The server reports any other failure reponse.
+
+* The client receives data that is not parsable as a bundle or bundle list.
+
+* A bundle includes a filter that does not match expectations.
+
+* The client cannot unbundle the bundles because the prerequisite commit OIDs
+  are not in the object database and there are no more bundles to download.
+
+There are also situations that could be seen as wasteful, but are not
+error conditions:
+
+* The downloaded bundles contain more information than is requested by
+  the clone or fetch request. A primary example is if the user requests
+  a clone with `--single-branch` but downloads bundles that store every
+  reachable commit from all `refs/heads/*` references. This might be
+  initially wasteful, but perhaps these objects will become reachable by
+  a later ref update that the client cares about.
+
+* A bundle download during a `git fetch` contains objects already in the
+  object database. This is probably unavoidable if we are using bundles
+  for fetches, since the client will almost always be slightly ahead of
+  the bundle servers after performing its "catch-up" fetch to the remote
+  server. This extra work is most wasteful when the client is fetching
+  much more frequently than the server is computing bundles, such as if
+  the client is using hourly prefetches with background maintenance, but
+  the server is computing bundles weekly. For this reason, the client
+  should not use bundle URIs for fetch unless the server has explicitly
+  recommended it through a `bundle.heuristic` value.
+
+Implementation Plan
+-------------------
+
+This design document is being submitted on its own as an aspirational
+document, with the goal of implementing all of the mentioned client
+features over the course of several patch series. Here is a potential
+outline for submitting these features:
+
+1. Integrate bundle URIs into `git clone` with a `--bundle-uri` option.
+   This will include a new `git fetch --bundle-uri` mode for use as the
+   implementation underneath `git clone`. The initial version here will
+   expect a single bundle at the given URI.
+
+2. Implement the ability to parse a bundle list from a bundle URI and
+   update the `git fetch --bundle-uri` logic to properly distinguish
+   between `bundle.mode` options. Specifically design the feature so
+   that the config format parsing feeds a list of key-value pairs into the
+   bundle list logic.
+
+3. Create the `bundle-uri` protocol v2 command so Git servers can advertise
+   bundle URIs using the key-value pairs. Plug into the existing key-value
+   input to the bundle list logic. Allow `git clone` to discover these
+   bundle URIs and bootstrap the client repository from the bundle data.
+   (This choice is an opt-in via a config option and a command-line
+   option.)
+
+4. Allow the client to understand the `bundle.flag=forFetch` configuration
+   and the `bundle.<id>.creationToken` heuristic. When `git clone`
+   discovers a bundle URI with `bundle.flag=forFetch`, it configures the
+   client repository to check that bundle URI during later `git fetch <remote>`
+   commands.
+
+5. Allow clients to discover bundle URIs during `git fetch` and configure
+   a bundle URI for later fetches if `bundle.flag=forFetch`.
+
+6. Implement the "inspect headers" heuristic to reduce data downloads when
+   the `bundle.<id>.creationToken` heuristic is not available.
+
+As these features are reviewed, this plan might be updated. We also expect
+that new designs will be discovered and implemented as this feature
+matures and becomes used in real-world scenarios.
+
+Related Work: Packfile URIs
+---------------------------
+
+The Git protocol already has a capability where the Git server can list
+a set of URLs along with the packfile response when serving a client
+request. The client is then expected to download the packfiles at those
+locations in order to have a complete understanding of the response.
+
+This mechanism is used by the Gerrit server (implemented with JGit) and
+has been effective at reducing CPU load and improving user performance for
+clones.
+
+A major downside to this mechanism is that the origin server needs to know
+_exactly_ what is in those packfiles, and the packfiles need to be available
+to the user for some time after the server has responded. This coupling
+between the origin and the packfile data is difficult to manage.
+
+Further, this implementation is extremely hard to make work with fetches.
+
+Related Work: GVFS Cache Servers
+--------------------------------
+
+The GVFS Protocol [2] is a set of HTTP endpoints designed independently of
+the Git project before Git's partial clone was created. One feature of this
+protocol is the idea of a "cache server" which can be colocated with build
+machines or developer offices to transfer Git data without overloading the
+central server.
+
+The endpoint that VFS for Git is famous for is the `GET /gvfs/objects/{oid}`
+endpoint, which allows downloading an object on-demand. This is a critical
+piece of the filesystem virtualization of that product.
+
+However, a more subtle need is the `GET /gvfs/prefetch?lastPackTimestamp=<t>`
+endpoint. Given an optional timestamp, the cache server responds with a list
+of precomputed packfiles containing the commits and trees that were introduced
+in those time intervals.
+
+The cache server computes these "prefetch" packfiles using the following
+strategy:
+
+1. Every hour, an "hourly" pack is generated with a given timestamp.
+2. Nightly, the previous 24 hourly packs are rolled up into a "daily" pack.
+3. Nightly, all prefetch packs more than 30 days old are rolled up into
+   one pack.
+
+When a user runs `gvfs clone` or `scalar clone` against a repo with cache
+servers, the client requests all prefetch packfiles, which is at most
+`24 + 30 + 1` packfiles downloading only commits and trees. The client
+then follows with a request to the origin server for the references, and
+attempts to checkout that tip reference. (There is an extra endpoint that
+helps get all reachable trees from a given commit, in case that commit
+was not already in a prefetch packfile.)
+
+During a `git fetch`, a hook requests the prefetch endpoint using the
+most-recent timestamp from a previously-downloaded prefetch packfile.
+Only the list of packfiles with later timestamps are downloaded. Most
+users fetch hourly, so they get at most one hourly prefetch pack. Users
+whose machines have been off or otherwise have not fetched in over 30 days
+might redownload all prefetch packfiles. This is rare.
+
+It is important to note that the clients always contact the origin server
+for the refs advertisement, so the refs are frequently "ahead" of the
+prefetched pack data. The missing objects are downloaded on-demand using
+the `GET gvfs/objects/{oid}` requests, when needed by a command such as
+`git checkout` or `git log`. Some Git optimizations disable checks that
+would cause these on-demand downloads to be too aggressive.
+
+See Also
+--------
+
+[1] https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
+    An earlier RFC for a bundle URI feature.
+
+[2] https://github.com/microsoft/VFSForGit/blob/master/Protocol.md
+    The GVFS Protocol
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v3 2/2] bundle-uri: add example bundle organization
  2022-07-25 13:53   ` [PATCH v3 0/2] " Derrick Stolee via GitGitGadget
  2022-07-25 13:53     ` [PATCH v3 1/2] docs: document bundle URI standard Derrick Stolee via GitGitGadget
@ 2022-07-25 13:53     ` Derrick Stolee via GitGitGadget
  2022-08-04 16:09       ` Matthew John Cheetham
  2022-07-25 20:05     ` [PATCH v3 0/2] bundle URIs: design doc and initial git fetch --bundle-uri implementation Josh Steadmon
  2022-08-09 13:12     ` [PATCH v4 0/2] bundle URIs: design doc Derrick Stolee via GitGitGadget
  3 siblings, 1 reply; 64+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-07-25 13:53 UTC (permalink / raw)
  To: git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	SZEDER Gábor, Matthew John Cheetham, Josh Steadmon,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The previous change introduced the bundle URI design document. It
creates a flexible set of options that allow bundle providers many ways
to organize Git object data and speed up clones and fetches. It is
particularly important that we have flexibility so we can apply future
advancements as new ideas for efficiently organizing Git data are
discovered.

However, the design document does not provide even an example of how
bundles could be organized, and that makes it difficult to envision how
the feature should work at the end of the implementation plan.

Add a section that details how a bundle provider could work, including
using the Git server advertisement for multiple geo-distributed servers.
This organization is based on the GVFS Cache Servers which have
successfully used similar ideas to provide fast object access and
reduced server load for very large repositories.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/technical/bundle-uri.txt | 105 +++++++++++++++++++++++++
 1 file changed, 105 insertions(+)

diff --git a/Documentation/technical/bundle-uri.txt b/Documentation/technical/bundle-uri.txt
index af195f2b429..5575a005177 100644
--- a/Documentation/technical/bundle-uri.txt
+++ b/Documentation/technical/bundle-uri.txt
@@ -349,6 +349,111 @@ error conditions:
   should not use bundle URIs for fetch unless the server has explicitly
   recommended it through a `bundle.heuristic` value.
 
+Example Bundle Provider organization
+------------------------------------
+
+The bundle URI feature is intentionally designed to be flexible to
+different ways a bundle provider wants to organize the object data.
+However, it can be helpful to have a complete organization model described
+here so providers can start from that base.
+
+This example organization is a simplified model of what is used by the
+GVFS Cache Servers (see section near the end of this document) which have
+been beneficial in speeding up clones and fetches for very large
+repositories, although using extra software outside of Git.
+
+The bundle provider deploys servers across multiple geographies. Each
+server manages its own bundle set. The server can track a number of Git
+repositories, but provides a bundle list for each based on a pattern. For
+example, when mirroring a repository at `https://<domain>/<org>/<repo>`
+the bundle server could have its bundle list available at
+`https://<server-url>/<domain>/<org>/<repo>`. The origin Git server can
+list all of these servers under the "any" mode:
+
+	[bundle]
+		version = 1
+		mode = any
+		
+	[bundle "eastus"]
+		uri = https://eastus.example.com/<domain>/<org>/<repo>
+		
+	[bundle "europe"]
+		uri = https://europe.example.com/<domain>/<org>/<repo>
+		
+	[bundle "apac"]
+		uri = https://apac.example.com/<domain>/<org>/<repo>
+
+This "list of lists" is static and only changes if a bundle server is
+added or removed.
+
+Each bundle server manages its own set of bundles. The initial bundle list
+contains only a single bundle, containing all of the objects received from
+cloning the repository from the origin server. The list uses the
+`creationToken` heuristic and a `creationToken` is made for the bundle
+based on the server's timestamp.
+
+The bundle server runs regularly-scheduled updates for the bundle list,
+such as once a day. During this task, the server fetches the latest
+contents from the origin server and generates a bundle containing the
+objects reachable from the latest origin refs, but not contained in a
+previously-computed bundle. This bundle is added to the list, with care
+that the `creationToken` is strictly greater than the previous maximum
+`creationToken`.
+
+When the bundle list grows too large, say more than 30 bundles, then the
+oldest "_N_ minus 30" bundles are combined into a single bundle. This
+bundle's `creationToken` is equal to the maximum `creationToken` among the
+merged bundles.
+
+An example bundle list is provided here, although it only has two daily
+bundles and not a full list of 30:
+
+	[bundle]
+		version = 1
+		mode = all
+		heuristic = creationToken
+
+	[bundle "2022-02-13-1644770820-daily"]
+		uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644770820-daily.bundle
+		creationToken = 1644770820
+
+	[bundle "2022-02-09-1644442601-daily"]
+		uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644442601-daily.bundle
+		creationToken = 1644442601
+
+	[bundle "2022-02-02-1643842562"]
+		uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-02-1643842562.bundle
+		creationToken = 1643842562
+
+To avoid storing and serving object data in perpetuity despite becoming
+unreachable in the origin server, this bundle merge can be more careful.
+Instead of taking an absolute union of the old bundles, instead the bundle
+can be created by looking at the newer bundles and ensuring that their
+necessary commits are all available in this merged bundle (or in another
+one of the newer bundles). This allows "expiring" object data that is not
+being used by new commits in this window of time. That data could be
+reintroduced by a later push.
+
+The intention of this data organization has two main goals. First, initial
+clones of the repository become faster by downloading precomputed object
+data from a closer source. Second, `git fetch` commands can be faster,
+especially if the client has not fetched for a few days. However, if a
+client does not fetch for 30 days, then the bundle list organization would
+cause redownloading a large amount of object data.
+
+One way to make this organization more useful to users who fetch frequently
+is to have more frequent bundle creation. For example, bundles could be
+created every hour, and then once a day those "hourly" bundles could be
+merged into a "daily" bundle. The daily bundles are merged into the
+oldest bundle after 30 days.
+
+It is recommened that this bundle strategy is repeated with the `blob:none`
+filter if clients of this repository are expecting to use blobless partial
+clones. This list of blobless bundles stays in the same list as the full
+bundles, but uses the `bundle.<id>.filter` key to separate the two groups.
+For very large repositories, the bundle provider may want to _only_ provide
+blobless bundles.
+
 Implementation Plan
 -------------------
 
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 0/2] bundle URIs: design doc and initial git fetch --bundle-uri implementation
  2022-07-25 13:53   ` [PATCH v3 0/2] " Derrick Stolee via GitGitGadget
  2022-07-25 13:53     ` [PATCH v3 1/2] docs: document bundle URI standard Derrick Stolee via GitGitGadget
  2022-07-25 13:53     ` [PATCH v3 2/2] bundle-uri: add example bundle organization Derrick Stolee via GitGitGadget
@ 2022-07-25 20:05     ` Josh Steadmon
  2022-08-09 13:12     ` [PATCH v4 0/2] bundle URIs: design doc Derrick Stolee via GitGitGadget
  3 siblings, 0 replies; 64+ messages in thread
From: Josh Steadmon @ 2022-07-25 20:05 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: git, gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	SZEDER Gábor, Matthew John Cheetham, Derrick Stolee

On 2022.07.25 13:53, Derrick Stolee via GitGitGadget wrote:
> This is the first of series towards building the bundle URI feature as
> discussed in previous RFCs, specifically pulled directly out of [5]:
> 
> [1]
> https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
> [2]
> https://lore.kernel.org/git/cover-0.3-00000000000-20211025T211159Z-avarab@gmail.com/
> [3]
> https://lore.kernel.org/git/pull.1160.git.1645641063.gitgitgadget@gmail.com
> [4]
> https://lore.kernel.org/git/RFC-cover-v2-00.36-00000000000-20220418T165545Z-avarab@gmail.com/
> [5]
> https://lore.kernel.org/git/pull.1234.git.1653072042.gitgitgadget@gmail.com
> 
> THIS ONLY INCLUDES THE DESIGN DOCUMENT. See "Updates in v3". There are two
> patches:
> 
>  1. The main design document that details the bundle URI standard and how
>     the client interacts with the bundle data.
>  2. An addendum to the design document that details one strategy for
>     organizing bundles from the perspective of a bundle provider.
> 
> As outlined in [5], the next steps after this are:
> 
>  1. Add 'git clone --bundle-uri=' to run a 'git bundle fetch ' step before
>     doing a fetch negotiation with the origin remote. [6]
>  2. Allow parsing a bundle list as a config file at the given URI. The
>     key-value format is unified with the protocol v2 verb (coming in (3)).
>     [7]
>  3. Implement the protocol v2 verb, re-using the bundle list logic from (2).
>     Use this to auto-discover bundle URIs during 'git clone' (behind a
>     config option). [8]
>  4. Implement the 'creationToken' heuristic, allowing incremental 'git
>     fetch' commands to download a bundle list from a configured URI, and
>     only download bundles that are new based on the creation token values.
>     [9]
> 
> I have prepared some of this work as pull requests on my personal fork so
> curious readers can look ahead to where we are going:
> 
> [6] https://github.com/derrickstolee/git/pull/18 [7]
> https://github.com/derrickstolee/git/pull/20 [8]
> https://github.com/derrickstolee/git/pull/21 [9]
> https://github.com/derrickstolee/git/pull/22
> 
> As mentioned in the design document, this is not all that is possible. For
> instance, Ævar's suggestion to download only the bundle headers can be used
> as a second heuristic (and as an augmentation of the timestamp heuristic).
> 
> 
> Updates in v3
> =============
> 
>  * This version only includes the design document. Thanks to all the
>    reviewers for the significant attention that improves the doc a lot.
>  * The second patch has an addition to the design document that details a
>    potential way to organize bundles from the provider's perspective.
>  * Based on some off-list feedback, I was going to switch git fetch
>    --bundle-uri into git bundle fetch, but that has a major conflict with
>    [10] which was just submitted.
>  * I will move the git bundle fetch implementation into [6] which also has
>    the git clone --bundle-uri implementation. [10]
>    https://lore.kernel.org/git/20220725123857.2773963-1-szeder.dev@gmail.com/
> 
> 
> Updates in v2
> =============
> 
>  * The design document has been updated based on Junio's feedback.
>  * The "bundle.list." keys are now just "bundle.".
>  * The "timestamp" heuristic is now "creationToken".
>  * More clarity on how Git parses data from the bundle URI.
>  * Dropped some unnecessary bundle list keys (*.list, *.requires).
> 
> Thanks, -Stolee
> 
> Derrick Stolee (2):
>   docs: document bundle URI standard
>   bundle-uri: add example bundle organization
> 
>  Documentation/Makefile                 |   1 +
>  Documentation/technical/bundle-uri.txt | 573 +++++++++++++++++++++++++
>  2 files changed, 574 insertions(+)
>  create mode 100644 Documentation/technical/bundle-uri.txt
> 
> 
> base-commit: e72d93e88cb20b06e88e6e7d81bd1dc4effe453f
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1248%2Fderrickstolee%2Fbundle-redo%2Ffetch-v3
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1248/derrickstolee/bundle-redo/fetch-v3
> Pull-Request: https://github.com/gitgitgadget/git/pull/1248
> 
> Range-diff vs v2:
> 
>  1:  d444042dc4d ! 1:  e0f003e1b5f docs: document bundle URI standard
>      @@ Commit message
>       
>           Signed-off-by: Derrick Stolee <derrickstolee@github.com>
>       
>      + ## Documentation/Makefile ##
>      +@@ Documentation/Makefile: TECH_DOCS += SubmittingPatches
>      + TECH_DOCS += ToolsForGit
>      + TECH_DOCS += technical/bitmap-format
>      + TECH_DOCS += technical/bundle-format
>      ++TECH_DOCS += technical/bundle-uri
>      + TECH_DOCS += technical/cruft-packs
>      + TECH_DOCS += technical/hash-function-transition
>      + TECH_DOCS += technical/http-protocol
>      +
>        ## Documentation/technical/bundle-uri.txt (new) ##
>       @@
>       +Bundle URIs
>       +===========
>       +
>      ++Git bundles are files that store a pack-file along with some extra metadata,
>      ++including a set of refs and a (possibly empty) set of necessary commits. See
>      ++linkgit:git-bundle[1] and link:bundle-format.txt[the bundle format] for more
>      ++information.
>      ++
>       +Bundle URIs are locations where Git can download one or more bundles in
>       +order to bootstrap the object database in advance of fetching the remaining
>       +objects from a remote.
>      @@ Documentation/technical/bundle-uri.txt (new)
>       +	If this string-valued key exists, then the bundle list is designed to
>       +	work well with incremental `git fetch` commands. The heuristic signals
>       +	that there are additional keys available for each bundle that help
>      -+	determine which subset of bundles the client should download.
>      ++	determine which subset of bundles the client should download. The only
>      ++  heuristic currently planned is `creationToken`.
>       +
>       +The remaining keys include an `<id>` segment which is a server-designated
>      -+name for each available bundle.
>      ++name for each available bundle. The `<id>` must contain only alphanumeric
>      ++and `-` characters.
>       +
>       +bundle.<id>.uri::
>       +	(Required) This string value is the URI for downloading bundle `<id>`.
>      @@ Documentation/technical/bundle-uri.txt (new)
>       +
>       +Here is an example bundle list using the Git config format:
>       +
>      -+```
>      -+[bundle]
>      -+	version = 1
>      -+	mode = all
>      -+	heuristic = creationToken
>      ++	[bundle]
>      ++		version = 1
>      ++		mode = all
>      ++		heuristic = creationToken
>       +
>      -+[bundle "2022-02-09-1644442601-daily"]
>      -+	uri = https://bundles.example.com/git/git/2022-02-09-1644442601-daily.bundle
>      -+	timestamp = 1644442601
>      ++	[bundle "2022-02-09-1644442601-daily"]
>      ++		uri = https://bundles.example.com/git/git/2022-02-09-1644442601-daily.bundle
>      ++		creationToken = 1644442601
>       +
>      -+[bundle "2022-02-02-1643842562"]
>      -+	uri = https://bundles.example.com/git/git/2022-02-02-1643842562.bundle
>      -+	timestamp = 1643842562
>      ++	[bundle "2022-02-02-1643842562"]
>      ++		uri = https://bundles.example.com/git/git/2022-02-02-1643842562.bundle
>      ++		creationToken = 1643842562
>       +
>      -+[bundle "2022-02-09-1644442631-daily-blobless"]
>      -+	uri = 2022-02-09-1644442631-daily-blobless.bundle
>      -+	timestamp = 1644442631
>      -+	filter = blob:none
>      ++	[bundle "2022-02-09-1644442631-daily-blobless"]
>      ++		uri = 2022-02-09-1644442631-daily-blobless.bundle
>      ++		creationToken = 1644442631
>      ++		filter = blob:none
>       +
>      -+[bundle "2022-02-02-1643842568-blobless"]
>      -+	uri = /git/git/2022-02-02-1643842568-blobless.bundle
>      -+	timestamp = 1643842568
>      -+	filter = blob:none
>      -+```
>      ++	[bundle "2022-02-02-1643842568-blobless"]
>      ++		uri = /git/git/2022-02-02-1643842568-blobless.bundle
>      ++		creationToken = 1643842568
>      ++		filter = blob:none
>       +
>       +This example uses `bundle.mode=all` as well as the
>       +`bundle.<id>.creationToken` heuristic. It also uses the `bundle.<id>.filter`
>      @@ Documentation/technical/bundle-uri.txt (new)
>       +* The client fails to connect with a server at the given URI or a connection
>       +  is lost without any chance to recover.
>       +
>      -+* The client receives a response other than `200 OK` (such as `404 Not Found`,
>      -+  `401 Not Authorized`, or `500 Internal Server Error`). The client should
>      -+  use the `credential.helper` to attempt authentication after the first
>      -+  `401 Not Authorized` response, but a second such response is a failure.
>      ++* The client receives a 400-level response (such as `404 Not Found` or
>      ++  `401 Not Authorized`). The client should use the credential helper to
>      ++  find and provide a credential for the URI, but match the semantics of
>      ++  Git's other HTTP protocols in terms of handling specific 400-level
>      ++  errors.
>       +
>      -+* The client receives data that is not parsable as a bundle or bundle list.
>      ++* The server reports any other failure reponse.
>       +
>      -+* The bundle list describes a directed cycle in the
>      -+  `bundle.<id>.requires` links.
>      ++* The client receives data that is not parsable as a bundle or bundle list.
>       +
>       +* A bundle includes a filter that does not match expectations.
>       +
>       +* The client cannot unbundle the bundles because the prerequisite commit OIDs
>      -+  are not in the object database and there are no more
>      -+  `bundle.<id>.requires` links to follow.
>      ++  are not in the object database and there are no more bundles to download.
>       +
>       +There are also situations that could be seen as wasteful, but are not
>       +error conditions:
>      @@ Documentation/technical/bundle-uri.txt (new)
>       +  the client is using hourly prefetches with background maintenance, but
>       +  the server is computing bundles weekly. For this reason, the client
>       +  should not use bundle URIs for fetch unless the server has explicitly
>      -+  recommended it through the `bundle.flags = forFetch` value.
>      ++  recommended it through a `bundle.heuristic` value.
>       +
>       +Implementation Plan
>       +-------------------
>      @@ Documentation/technical/bundle-uri.txt (new)
>       +   that the config format parsing feeds a list of key-value pairs into the
>       +   bundle list logic.
>       +
>      -+3. Create the `bundle-uri` protocol v2 verb so Git servers can advertise
>      ++3. Create the `bundle-uri` protocol v2 command so Git servers can advertise
>       +   bundle URIs using the key-value pairs. Plug into the existing key-value
>       +   input to the bundle list logic. Allow `git clone` to discover these
>       +   bundle URIs and bootstrap the client repository from the bundle data.
>  2:  0a2cf60437f < -:  ----------- remote-curl: add 'get' capability
>  3:  abec47564fd < -:  ----------- bundle-uri: create basic file-copy logic
>  4:  f6255ec5188 < -:  ----------- fetch: add --bundle-uri option
>  5:  bfbd11b48bf < -:  ----------- bundle-uri: add support for http(s):// and file://
>  6:  a217e9a0640 < -:  ----------- fetch: add 'refs/bundle/' to log.excludeDecoration
>  -:  ----------- > 2:  a933471c3af bundle-uri: add example bundle organization
> 
> -- 
> gitgitgadget

Looks good to me, thanks for the series!

Reviewed-by: Josh Steadmon <steadmon@google.com>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v3 1/2] docs: document bundle URI standard
  2022-07-25 13:53     ` [PATCH v3 1/2] docs: document bundle URI standard Derrick Stolee via GitGitGadget
@ 2022-07-28  1:23       ` tenglong.tl
  2022-08-01 13:42         ` Derrick Stolee
  0 siblings, 1 reply; 64+ messages in thread
From: tenglong.tl @ 2022-07-28  1:23 UTC (permalink / raw)
  To: gitgitgadget
  Cc: Johannes.Schindelin, avarab, derrickstolee, dyroneteng, git,
	gitster, me, mjcheetham, newren, steadmon, szeder.dev

Derrick Stolee <derrickstolee@github.com> writes:

> +bundle.heuristic::
> +	If this string-valued key exists, then the bundle list is designed to
> +	work well with incremental `git fetch` commands. The heuristic signals
> +	that there are additional keys available for each bundle that help
> +	determine which subset of bundles the client should download. The only
> +  heuristic currently planned is `creationToken`.

Seem like there is a small indent nit at the last line?

Thanks.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 1/2] docs: document bundle URI standard
  2022-07-28  1:23       ` tenglong.tl
@ 2022-08-01 13:42         ` Derrick Stolee
  0 siblings, 0 replies; 64+ messages in thread
From: Derrick Stolee @ 2022-08-01 13:42 UTC (permalink / raw)
  To: tenglong.tl, gitgitgadget
  Cc: Johannes.Schindelin, avarab, git, gitster, me, mjcheetham, newren,
	steadmon, szeder.dev

On 7/27/2022 9:23 PM, tenglong.tl wrote:
> Derrick Stolee <derrickstolee@github.com> writes:
> 
>> +bundle.heuristic::
>> +	If this string-valued key exists, then the bundle list is designed to
>> +	work well with incremental `git fetch` commands. The heuristic signals
>> +	that there are additional keys available for each bundle that help
>> +	determine which subset of bundles the client should download. The only
>> +  heuristic currently planned is `creationToken`.
> 
> Seem like there is a small indent nit at the last line?

Thanks for noticing. I had a setting malfunction and it is fixed now.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 2/2] bundle-uri: add example bundle organization
  2022-07-25 13:53     ` [PATCH v3 2/2] bundle-uri: add example bundle organization Derrick Stolee via GitGitGadget
@ 2022-08-04 16:09       ` Matthew John Cheetham
  2022-08-04 17:39         ` Derrick Stolee
  0 siblings, 1 reply; 64+ messages in thread
From: Matthew John Cheetham @ 2022-08-04 16:09 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	SZEDER Gábor, Josh Steadmon, Derrick Stolee, git

On 2022-07-25 14:53, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <derrickstolee@github.com>
> 
> The previous change introduced the bundle URI design document. It
> creates a flexible set of options that allow bundle providers many ways
> to organize Git object data and speed up clones and fetches. It is
> particularly important that we have flexibility so we can apply future
> advancements as new ideas for efficiently organizing Git data are
> discovered.
> 
> However, the design document does not provide even an example of how
> bundles could be organized, and that makes it difficult to envision how
> the feature should work at the end of the implementation plan.
> 
> Add a section that details how a bundle provider could work, including
> using the Git server advertisement for multiple geo-distributed servers.
> This organization is based on the GVFS Cache Servers which have
> successfully used similar ideas to provide fast object access and
> reduced server load for very large repositories.
Thanks! This patch is helpful guidance for bundle server implementors.
> +This example organization is a simplified model of what is used by the
> +GVFS Cache Servers (see section near the end of this document) which have
> +been beneficial in speeding up clones and fetches for very large
> +repositories, although using extra software outside of Git.

Nit: might be a good idea to use "VFS for Git" rather than the old name
"GVFS" [1].

> +The bundle provider deploys servers across multiple geographies. Each
> +server manages its own bundle set. The server can track a number of Git
> +repositories, but provides a bundle list for each based on a pattern. For
> +example, when mirroring a repository at `https://<domain>/<org>/<repo>`
> +the bundle server could have its bundle list available at
> +`https://<server-url>/<domain>/<org>/<repo>`. The origin Git server can
> +list all of these servers under the "any" mode:
> +
> +	[bundle]
> +		version = 1
> +		mode = any
> +		
> +	[bundle "eastus"]
> +		uri = https://eastus.example.com/<domain>/<org>/<repo>
> +		
> +	[bundle "europe"]
> +		uri = https://europe.example.com/<domain>/<org>/<repo>
> +		
> +	[bundle "apac"]
> +		uri = https://apac.example.com/<domain>/<org>/<repo>
> +
> +This "list of lists" is static and only changes if a bundle server is
> +added or removed.
> +
> +Each bundle server manages its own set of bundles. The initial bundle list
> +contains only a single bundle, containing all of the objects received from
> +cloning the repository from the origin server. The list uses the
> +`creationToken` heuristic and a `creationToken` is made for the bundle
> +based on the server's timestamp.

Just to confirm, in this example the origin server advertises a single
URL (over v2 protocol) that points to this example "list of lists"?

Remote -> 1 URL -> List(any/split by geo) -> List(all/split by time)

> +The bundle server runs regularly-scheduled updates for the bundle list,
> +such as once a day. During this task, the server fetches the latest
> +contents from the origin server and generates a bundle containing the
> +objects reachable from the latest origin refs, but not contained in a
> +previously-computed bundle. This bundle is added to the list, with care
> +that the `creationToken` is strictly greater than the previous maximum
> +`creationToken`.
> +
> +When the bundle list grows too large, say more than 30 bundles, then the
> +oldest "_N_ minus 30" bundles are combined into a single bundle. This
> +bundle's `creationToken` is equal to the maximum `creationToken` among the
> +merged bundles.
> +
> +An example bundle list is provided here, although it only has two daily
> +bundles and not a full list of 30:
> +
> +	[bundle]
> +		version = 1
> +		mode = all
> +		heuristic = creationToken
> +
> +	[bundle "2022-02-13-1644770820-daily"]
> +		uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644770820-daily.bundle
> +		creationToken = 1644770820
> +
> +	[bundle "2022-02-09-1644442601-daily"]
> +		uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644442601-daily.bundle
> +		creationToken = 1644442601
> +
> +	[bundle "2022-02-02-1643842562"]
> +		uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-02-1643842562.bundle
> +		creationToken = 1643842562
> +
> +To avoid storing and serving object data in perpetuity despite becoming
> +unreachable in the origin server, this bundle merge can be more careful.
> +Instead of taking an absolute union of the old bundles, instead the bundle
> +can be created by looking at the newer bundles and ensuring that their
> +necessary commits are all available in this merged bundle (or in another
> +one of the newer bundles). This allows "expiring" object data that is not
> +being used by new commits in this window of time. That data could be
> +reintroduced by a later push.
> +
> +The intention of this data organization has two main goals. First, initial
> +clones of the repository become faster by downloading precomputed object
> +data from a closer source. Second, `git fetch` commands can be faster,
> +especially if the client has not fetched for a few days. However, if a
> +client does not fetch for 30 days, then the bundle list organization would
> +cause redownloading a large amount of object data.
> +
> +One way to make this organization more useful to users who fetch frequently
> +is to have more frequent bundle creation. For example, bundles could be
> +created every hour, and then once a day those "hourly" bundles could be
> +merged into a "daily" bundle. The daily bundles are merged into the
> +oldest bundle after 30 days.
> +
> +It is recommened that this bundle strategy is repeated with the `blob:none`
> +filter if clients of this repository are expecting to use blobless partial
> +clones. This list of blobless bundles stays in the same list as the full
> +bundles, but uses the `bundle.<id>.filter` key to separate the two groups.
> +For very large repositories, the bundle provider may want to _only_ provide
> +blobless bundles.
> +
>  Implementation Plan
>  -------------------
>  
In general this looks good to me!

[1] https://github.com/microsoft/VFSForGit/issues/72

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 2/2] bundle-uri: add example bundle organization
  2022-08-04 16:09       ` Matthew John Cheetham
@ 2022-08-04 17:39         ` Derrick Stolee
  2022-08-04 20:29           ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 64+ messages in thread
From: Derrick Stolee @ 2022-08-04 17:39 UTC (permalink / raw)
  To: Matthew John Cheetham, Derrick Stolee via GitGitGadget
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	SZEDER Gábor, Josh Steadmon, git

On 8/4/2022 12:09 PM, Matthew John Cheetham wrote:
> On 2022-07-25 14:53, Derrick Stolee via GitGitGadget wrote:
>> From: Derrick Stolee <derrickstolee@github.com>
>>
>> The previous change introduced the bundle URI design document. It
>> creates a flexible set of options that allow bundle providers many ways
>> to organize Git object data and speed up clones and fetches. It is
>> particularly important that we have flexibility so we can apply future
>> advancements as new ideas for efficiently organizing Git data are
>> discovered.
>>
>> However, the design document does not provide even an example of how
>> bundles could be organized, and that makes it difficult to envision how
>> the feature should work at the end of the implementation plan.
>>
>> Add a section that details how a bundle provider could work, including
>> using the Git server advertisement for multiple geo-distributed servers.
>> This organization is based on the GVFS Cache Servers which have
>> successfully used similar ideas to provide fast object access and
>> reduced server load for very large repositories.
> Thanks! This patch is helpful guidance for bundle server implementors.
>> +This example organization is a simplified model of what is used by the
>> +GVFS Cache Servers (see section near the end of this document) which have
>> +been beneficial in speeding up clones and fetches for very large
>> +repositories, although using extra software outside of Git.
> 
> Nit: might be a good idea to use "VFS for Git" rather than the old name
> "GVFS" [1].

The rename from "GVFS" to "VFS for Git" is made even more confusing
because "the GVFS Protocol" keeps the name since it is independent of
the virtual filesystem part (and has "gvfs" in the API routes). In
particular, "the GVFS Cache Servers" provide a repository mirror using
the GVFS protocol and can be used by things like Scalar (when using
the microsoft/git fork).
 
>> +The bundle provider deploys servers across multiple geographies. Each
>> +server manages its own bundle set. The server can track a number of Git
>> +repositories, but provides a bundle list for each based on a pattern. For
>> +example, when mirroring a repository at `https://<domain>/<org>/<repo>`
>> +the bundle server could have its bundle list available at
>> +`https://<server-url>/<domain>/<org>/<repo>`. The origin Git server can
>> +list all of these servers under the "any" mode:
>> +
>> +	[bundle]
>> +		version = 1
>> +		mode = any
>> +		
>> +	[bundle "eastus"]
>> +		uri = https://eastus.example.com/<domain>/<org>/<repo>
>> +		
>> +	[bundle "europe"]
>> +		uri = https://europe.example.com/<domain>/<org>/<repo>
>> +		
>> +	[bundle "apac"]
>> +		uri = https://apac.example.com/<domain>/<org>/<repo>
>> +
>> +This "list of lists" is static and only changes if a bundle server is
>> +added or removed.
>> +
>> +Each bundle server manages its own set of bundles. The initial bundle list
>> +contains only a single bundle, containing all of the objects received from
>> +cloning the repository from the origin server. The list uses the
>> +`creationToken` heuristic and a `creationToken` is made for the bundle
>> +based on the server's timestamp.
> 
> Just to confirm, in this example the origin server advertises a single
> URL (over v2 protocol) that points to this example "list of lists"?

No, here the origin server provides the list of lists using the 'bundle-uri'
protocol v2 command. Using the config file format was an unfortunate choice
on my part because that actually uses "key=value" lines.

This could be more clear by using that format:

  bundle.version=1
  bundle.mode=any
  bundle.eastus.uri=https://eastus.example.com/<domain>/<org>/<repo>
  bundle.europe.uri=https://europe.example.com/<domain>/<org>/<repo>
  bundle.apac.uri=https://apac.example.com/<domain>/<org>/<repo>

> Remote -> 1 URL -> List(any/split by geo) -> List(all/split by time)

Adding the final link to each individual bundle, it would actually be:

  Remote (any/split by geo) -> List(all/split by time) -> Bundle

I can definitely update this part of the doc to be more clear.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 2/2] bundle-uri: add example bundle organization
  2022-08-04 17:39         ` Derrick Stolee
@ 2022-08-04 20:29           ` Ævar Arnfjörð Bjarmason
  2022-08-05 18:29             ` Derrick Stolee
  0 siblings, 1 reply; 64+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2022-08-04 20:29 UTC (permalink / raw)
  To: Derrick Stolee
  Cc: Matthew John Cheetham, Derrick Stolee via GitGitGadget, gitster,
	me, newren, dyroneteng, Johannes.Schindelin, SZEDER Gábor,
	Josh Steadmon, git


On Thu, Aug 04 2022, Derrick Stolee wrote:

> On 8/4/2022 12:09 PM, Matthew John Cheetham wrote:
>> On 2022-07-25 14:53, Derrick Stolee via GitGitGadget wrote:
>>> From: Derrick Stolee <derrickstolee@github.com>
>>>
>>> The previous change introduced the bundle URI design document. It
>>> creates a flexible set of options that allow bundle providers many ways
>>> to organize Git object data and speed up clones and fetches. It is
>>> particularly important that we have flexibility so we can apply future
>>> advancements as new ideas for efficiently organizing Git data are
>>> discovered.
>>>
>>> However, the design document does not provide even an example of how
>>> bundles could be organized, and that makes it difficult to envision how
>>> the feature should work at the end of the implementation plan.
>>>
>>> Add a section that details how a bundle provider could work, including
>>> using the Git server advertisement for multiple geo-distributed servers.
>>> This organization is based on the GVFS Cache Servers which have
>>> successfully used similar ideas to provide fast object access and
>>> reduced server load for very large repositories.
>> Thanks! This patch is helpful guidance for bundle server implementors.
>>> +This example organization is a simplified model of what is used by the
>>> +GVFS Cache Servers (see section near the end of this document) which have
>>> +been beneficial in speeding up clones and fetches for very large
>>> +repositories, although using extra software outside of Git.
>> 
>> Nit: might be a good idea to use "VFS for Git" rather than the old name
>> "GVFS" [1].
>
> The rename from "GVFS" to "VFS for Git" is made even more confusing
> because "the GVFS Protocol" keeps the name since it is independent of
> the virtual filesystem part (and has "gvfs" in the API routes). In
> particular, "the GVFS Cache Servers" provide a repository mirror using
> the GVFS protocol and can be used by things like Scalar (when using
> the microsoft/git fork).
>  
>>> +The bundle provider deploys servers across multiple geographies. Each
>>> +server manages its own bundle set. The server can track a number of Git
>>> +repositories, but provides a bundle list for each based on a pattern. For
>>> +example, when mirroring a repository at `https://<domain>/<org>/<repo>`
>>> +the bundle server could have its bundle list available at
>>> +`https://<server-url>/<domain>/<org>/<repo>`. The origin Git server can
>>> +list all of these servers under the "any" mode:
>>> +
>>> +	[bundle]
>>> +		version = 1
>>> +		mode = any
>>> +		
>>> +	[bundle "eastus"]
>>> +		uri = https://eastus.example.com/<domain>/<org>/<repo>
>>> +		
>>> +	[bundle "europe"]
>>> +		uri = https://europe.example.com/<domain>/<org>/<repo>
>>> +		
>>> +	[bundle "apac"]
>>> +		uri = https://apac.example.com/<domain>/<org>/<repo>
>>> +
>>> +This "list of lists" is static and only changes if a bundle server is
>>> +added or removed.
>>> +
>>> +Each bundle server manages its own set of bundles. The initial bundle list
>>> +contains only a single bundle, containing all of the objects received from
>>> +cloning the repository from the origin server. The list uses the
>>> +`creationToken` heuristic and a `creationToken` is made for the bundle
>>> +based on the server's timestamp.
>> 
>> Just to confirm, in this example the origin server advertises a single
>> URL (over v2 protocol) that points to this example "list of lists"?
>
> No, here the origin server provides the list of lists using the 'bundle-uri'
> protocol v2 command. Using the config file format was an unfortunate choice
> on my part because that actually uses "key=value" lines.
>
> This could be more clear by using that format:
>
>   bundle.version=1
>   bundle.mode=any
>   bundle.eastus.uri=https://eastus.example.com/<domain>/<org>/<repo>
>   bundle.europe.uri=https://europe.example.com/<domain>/<org>/<repo>
>   bundle.apac.uri=https://apac.example.com/<domain>/<org>/<repo>

[I've tried to stay away from the bundle-uri topic for a while, to give
others some space to comment]

On it generally: Your CL goes into some of the saga of it, but briefly
the design I put forward initially assumed that these sort of things
would be offloaded to other protocols.

So, just to take an example of a prominent URL from your "From"
address. AFAICT there isn't a eastus.api.github.com, or
europe.api.github.com, instead it just uses DNS load-balancing for
api.github.com.

See the different IPs you'll get e.g. at
https://www.whatsmydns.net/#A/api.github.com (or from many other such
geoloc-inspecting DNS lookup tools). You can also do the same with
multicast etc.

 We've had some back & fourths on that before. You clearly think this
sort of thing is needed in (some version of) a bundle-uri. I don't
really see why. This sort of load spreading by different DNS naming
hasn't been common in serious production use for a decade or two.

But let's leave that aside, and other things I think we've had diverging
ideas about before (e.g. your spec's explicit cache management, which I
imagined offloading to standard HTTP features).

I do think that:

1) This proposed version would be much stronger if it generally tried to
   justify the features it's putting forward. E.g. just in this case
   (but it applies more generally) it seems to be taken as a given that
   {eastus,europe,apac}.<domain> etc. is the natural way to do that sort
   of load-balancing.

   But the spec doesn't really go into it. Why would someone use that
   instead of setting up GeoDNS (or similar), why does it need to be in
   git's protocol, and not in DNS?

2) I'd really like it clarified in the doc whether it considers itself a
   "living document" amenable to change, or a "spec" that we have to
   stick to.

   I'd like it to be the former, and I think it should be prominently
   noted there (e.g. that it's "EXPERIMENTAL" or whatever).

   I don't think it's a good time investment to argue over every little
   detail of how and why some aspects of bundle-uri should look before
   any of it is in-tree, we can just start with some base functionality,
   and tweak it.

   So if e.g. we find (from real-world benchmarking etc.) that some
   feature of the current spec isn't required (such as this
   GeoDNS-alike) we should be able to remove it, and not say "we can't,
   that's part of 'the spec'".

   Or maybe we keep that (and change something else). The point is that
   I don't think we have the full overview *right now*, and it would be
   regrettable if we prematurely decreed certain things "stable" or
   "specc'd".

I suspect we agree on #2, since your CL mentions a PR that integrates
basically the docs had as part of [1]. So presumably you're aiming for
getting to the end of those PRs, followed by some phase where we attempt
to unify the two, which might mean stripping out some feature(s) from
one or both, and adding others.

Thanks again for pushing this forward!

1. https://lore.kernel.org/git/RFC-cover-v2-00.36-00000000000-20220418T165545Z-avarab@gmail.com/


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v3 2/2] bundle-uri: add example bundle organization
  2022-08-04 20:29           ` Ævar Arnfjörð Bjarmason
@ 2022-08-05 18:29             ` Derrick Stolee
  0 siblings, 0 replies; 64+ messages in thread
From: Derrick Stolee @ 2022-08-05 18:29 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Matthew John Cheetham, Derrick Stolee via GitGitGadget, gitster,
	me, newren, dyroneteng, Johannes.Schindelin, SZEDER Gábor,
	Josh Steadmon, git

On 8/4/2022 4:29 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Thu, Aug 04 2022, Derrick Stolee wrote:
> 
>> On 8/4/2022 12:09 PM, Matthew John Cheetham wrote:

>>> Just to confirm, in this example the origin server advertises a single
>>> URL (over v2 protocol) that points to this example "list of lists"?
>>
>> No, here the origin server provides the list of lists using the 'bundle-uri'
>> protocol v2 command. Using the config file format was an unfortunate choice
>> on my part because that actually uses "key=value" lines.
>>
>> This could be more clear by using that format:
>>
>>   bundle.version=1
>>   bundle.mode=any
>>   bundle.eastus.uri=https://eastus.example.com/<domain>/<org>/<repo>
>>   bundle.europe.uri=https://europe.example.com/<domain>/<org>/<repo>
>>   bundle.apac.uri=https://apac.example.com/<domain>/<org>/<repo>
> 
> [I've tried to stay away from the bundle-uri topic for a while, to give
> others some space to comment]
> 
> On it generally: Your CL goes into some of the saga of it, but briefly
> the design I put forward initially assumed that these sort of things
> would be offloaded to other protocols.
> 
> So, just to take an example of a prominent URL from your "From"
> address. AFAICT there isn't a eastus.api.github.com, or
> europe.api.github.com, instead it just uses DNS load-balancing for
> api.github.com.
...
>  We've had some back & fourths on that before. You clearly think this
> sort of thing is needed in (some version of) a bundle-uri. I don't
> really see why. This sort of load spreading by different DNS naming
> hasn't been common in serious production use for a decade or two.

The difference in our proposals is that yours _requires_ using DNS
load-balancing while mine does not require it.

The places where I have deviated from your design are almost entirely
because your design forces certain things on the side of the bundle
provider that I found unacceptable requirements.

> But let's leave that aside, and other things I think we've had diverging
> ideas about before (e.g. your spec's explicit cache management, which I
> imagined offloading to standard HTTP features).
> 
> I do think that:
> 
> 1) This proposed version would be much stronger if it generally tried to
>    justify the features it's putting forward. E.g. just in this case
>    (but it applies more generally) it seems to be taken as a given that
>    {eastus,europe,apac}.<domain> etc. is the natural way to do that sort
>    of load-balancing.
> 
>    But the spec doesn't really go into it. Why would someone use that
>    instead of setting up GeoDNS (or similar), why does it need to be in
>    git's protocol, and not in DNS?

Again, just because you _could_ do something using another technology
doesn't mean that we should require all users to do that to solve
their problems.

Everything is about trade-offs, and one of the most important things
that your proposal lacked was the ability to provide a bundle list that
updates independently of the origin Git server. That requires that the
Git server can advertise a location that contains a bundle list. Once
we have a way to parse a bundle list outside of the Git protocol and a
way to advertise bundle lists in the Git protocol, it is incredibly
easy to provide this ability to advertise multiple geographies in a
list of URIs. Here, the complexity of providing the flexibility is small
compared to the flexibility available to users.

> 2) I'd really like it clarified in the doc whether it considers itself a
>    "living document" amenable to change, or a "spec" that we have to
>    stick to.
> 
>    I'd like it to be the former, and I think it should be prominently
>    noted there (e.g. that it's "EXPERIMENTAL" or whatever).

The living document portion is because the code is not implemented at
the same time as this design document is merged. Having actual code
solidifies the spec a lot more than a design document does on its own.
Having the code implemented provides a way to test things, but also a
released version of the spec becomes something that others depend on.

I'd be fine starting off with "EXPERIMENTAL" just like the sparse-
checkout builtin has, but the longer the code is out in the wild, the
less that EXPERIMENTAL tag offers us to shift things.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v4 0/2] bundle URIs: design doc
  2022-07-25 13:53   ` [PATCH v3 0/2] " Derrick Stolee via GitGitGadget
                       ` (2 preceding siblings ...)
  2022-07-25 20:05     ` [PATCH v3 0/2] bundle URIs: design doc and initial git fetch --bundle-uri implementation Josh Steadmon
@ 2022-08-09 13:12     ` Derrick Stolee via GitGitGadget
  2022-08-09 13:12       ` [PATCH v4 1/2] docs: document bundle URI standard Derrick Stolee via GitGitGadget
                         ` (2 more replies)
  3 siblings, 3 replies; 64+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-08-09 13:12 UTC (permalink / raw)
  To: git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	SZEDER Gábor, Matthew John Cheetham, Josh Steadmon,
	Derrick Stolee

This is the first of series towards building the bundle URI feature as
discussed in previous RFCs, specifically pulled directly out of [5]:

[1]
https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/

[2]
https://lore.kernel.org/git/cover-0.3-00000000000-20211025T211159Z-avarab@gmail.com/

[3]
https://lore.kernel.org/git/pull.1160.git.1645641063.gitgitgadget@gmail.com

[4]
https://lore.kernel.org/git/RFC-cover-v2-00.36-00000000000-20220418T165545Z-avarab@gmail.com/

[5]
https://lore.kernel.org/git/pull.1234.git.1653072042.gitgitgadget@gmail.com

THIS ONLY INCLUDES THE DESIGN DOCUMENT. See "Updates in v3". There are two
patches:

 1. The main design document that details the bundle URI standard and how
    the client interacts with the bundle data.
 2. An addendum to the design document that details one strategy for
    organizing bundles from the perspective of a bundle provider.

As outlined in [5], the next steps after this are:

 1. Add 'git clone --bundle-uri=' to run a 'git bundle fetch ' step before
    doing a fetch negotiation with the origin remote. [6]
 2. Allow parsing a bundle list as a config file at the given URI. The
    key-value format is unified with the protocol v2 verb (coming in (3)).
    [7]
 3. Implement the protocol v2 verb, re-using the bundle list logic from (2).
    Use this to auto-discover bundle URIs during 'git clone' (behind a
    config option). [8]
 4. Implement the 'creationToken' heuristic, allowing incremental 'git
    fetch' commands to download a bundle list from a configured URI, and
    only download bundles that are new based on the creation token values.
    [9]

I have prepared some of this work as pull requests on my personal fork so
curious readers can look ahead to where we are going:

[6] https://github.com/derrickstolee/git/pull/18

[7] https://github.com/derrickstolee/git/pull/20

[8] https://github.com/derrickstolee/git/pull/21

[9] https://github.com/derrickstolee/git/pull/22

As mentioned in the design document, this is not all that is possible. For
instance, Ævar's suggestion to download only the bundle headers can be used
as a second heuristic (and as an augmentation of the timestamp heuristic).


Updates in v4
=============

 * Whitespace issue resolved.
 * Example bundle provider setup now uses the 'bundle-uri' protocol v2
   format when describing how the origin Git server advertises the static
   bundle servers.


Updates in v3
=============

 * This version only includes the design document. Thanks to all the
   reviewers for the significant attention that improves the doc a lot.
 * The second patch has an addition to the design document that details a
   potential way to organize bundles from the provider's perspective.
 * Based on some off-list feedback, I was going to switch git fetch
   --bundle-uri into git bundle fetch, but that has a major conflict with
   [10] which was just submitted.
 * I will move the git bundle fetch implementation into [6] which also has
   the git clone --bundle-uri implementation.

[10]
https://lore.kernel.org/git/20220725123857.2773963-1-szeder.dev@gmail.com/


Updates in v2
=============

 * The design document has been updated based on Junio's feedback.
 * The "bundle.list." keys are now just "bundle.".
 * The "timestamp" heuristic is now "creationToken".
 * More clarity on how Git parses data from the bundle URI.
 * Dropped some unnecessary bundle list keys (*.list, *.requires).

Thanks, -Stolee

Derrick Stolee (2):
  docs: document bundle URI standard
  bundle-uri: add example bundle organization

 Documentation/Makefile                 |   1 +
 Documentation/technical/bundle-uri.txt | 573 +++++++++++++++++++++++++
 2 files changed, 574 insertions(+)
 create mode 100644 Documentation/technical/bundle-uri.txt


base-commit: e72d93e88cb20b06e88e6e7d81bd1dc4effe453f
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1248%2Fderrickstolee%2Fbundle-redo%2Ffetch-v4
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1248/derrickstolee/bundle-redo/fetch-v4
Pull-Request: https://github.com/gitgitgadget/git/pull/1248

Range-diff vs v3:

 1:  e0f003e1b5f ! 1:  1bfac1f492a docs: document bundle URI standard
     @@ Documentation/technical/bundle-uri.txt (new)
      +	work well with incremental `git fetch` commands. The heuristic signals
      +	that there are additional keys available for each bundle that help
      +	determine which subset of bundles the client should download. The only
     -+  heuristic currently planned is `creationToken`.
     ++	heuristic currently planned is `creationToken`.
      +
      +The remaining keys include an `<id>` segment which is a server-designated
      +name for each available bundle. The `<id>` must contain only alphanumeric
 2:  a933471c3af ! 2:  a22c24aa85a bundle-uri: add example bundle organization
     @@ Documentation/technical/bundle-uri.txt: error conditions:
      +	[bundle]
      +		version = 1
      +		mode = any
     -+		
     ++
      +	[bundle "eastus"]
      +		uri = https://eastus.example.com/<domain>/<org>/<repo>
     -+		
     ++
      +	[bundle "europe"]
      +		uri = https://europe.example.com/<domain>/<org>/<repo>
     -+		
     ++
      +	[bundle "apac"]
      +		uri = https://apac.example.com/<domain>/<org>/<repo>
      +

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v4 1/2] docs: document bundle URI standard
  2022-08-09 13:12     ` [PATCH v4 0/2] bundle URIs: design doc Derrick Stolee via GitGitGadget
@ 2022-08-09 13:12       ` Derrick Stolee via GitGitGadget
  2022-10-04 19:48         ` Philip Oakley
  2022-08-09 13:12       ` [PATCH v4 2/2] bundle-uri: add example bundle organization Derrick Stolee via GitGitGadget
  2022-08-09 13:49       ` [PATCH v4 0/2] bundle URIs: design doc Phillip Wood
  2 siblings, 1 reply; 64+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-08-09 13:12 UTC (permalink / raw)
  To: git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	SZEDER Gábor, Matthew John Cheetham, Josh Steadmon,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

Introduce the idea of bundle URIs to the Git codebase through an
aspirational design document. This document includes the full design
intended to include the feature in its fully-implemented form. This will
take several steps as detailed in the Implementation Plan section.

By committing this document now, it can be used to motivate changes
necessary to reach these final goals. The design can still be altered as
new information is discovered.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/Makefile                 |   1 +
 Documentation/technical/bundle-uri.txt | 468 +++++++++++++++++++++++++
 2 files changed, 469 insertions(+)
 create mode 100644 Documentation/technical/bundle-uri.txt

diff --git a/Documentation/Makefile b/Documentation/Makefile
index 4f801f4e4c9..e71ee915912 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -96,6 +96,7 @@ TECH_DOCS += SubmittingPatches
 TECH_DOCS += ToolsForGit
 TECH_DOCS += technical/bitmap-format
 TECH_DOCS += technical/bundle-format
+TECH_DOCS += technical/bundle-uri
 TECH_DOCS += technical/cruft-packs
 TECH_DOCS += technical/hash-function-transition
 TECH_DOCS += technical/http-protocol
diff --git a/Documentation/technical/bundle-uri.txt b/Documentation/technical/bundle-uri.txt
new file mode 100644
index 00000000000..e6d63b868b8
--- /dev/null
+++ b/Documentation/technical/bundle-uri.txt
@@ -0,0 +1,468 @@
+Bundle URIs
+===========
+
+Git bundles are files that store a pack-file along with some extra metadata,
+including a set of refs and a (possibly empty) set of necessary commits. See
+linkgit:git-bundle[1] and link:bundle-format.txt[the bundle format] for more
+information.
+
+Bundle URIs are locations where Git can download one or more bundles in
+order to bootstrap the object database in advance of fetching the remaining
+objects from a remote.
+
+One goal is to speed up clones and fetches for users with poor network
+connectivity to the origin server. Another benefit is to allow heavy users,
+such as CI build farms, to use local resources for the majority of Git data
+and thereby reducing the load on the origin server.
+
+To enable the bundle URI feature, users can specify a bundle URI using
+command-line options or the origin server can advertise one or more URIs
+via a protocol v2 capability.
+
+Design Goals
+------------
+
+The bundle URI standard aims to be flexible enough to satisfy multiple
+workloads. The bundle provider and the Git client have several choices in
+how they create and consume bundle URIs.
+
+* Bundles can have whatever name the server desires. This name could refer
+  to immutable data by using a hash of the bundle contents. However, this
+  means that a new URI will be needed after every update of the content.
+  This might be acceptable if the server is advertising the URI (and the
+  server is aware of new bundles being generated) but would not be
+  ergonomic for users using the command line option.
+
+* The bundles could be organized specifically for bootstrapping full
+  clones, but could also be organized with the intention of bootstrapping
+  incremental fetches. The bundle provider must decide on one of several
+  organization schemes to minimize client downloads during incremental
+  fetches, but the Git client can also choose whether to use bundles for
+  either of these operations.
+
+* The bundle provider can choose to support full clones, partial clones,
+  or both. The client can detect which bundles are appropriate for the
+  repository's partial clone filter, if any.
+
+* The bundle provider can use a single bundle (for clones only), or a
+  list of bundles. When using a list of bundles, the provider can specify
+  whether or not the client needs _all_ of the bundle URIs for a full
+  clone, or if _any_ one of the bundle URIs is sufficient. This allows the
+  bundle provider to use different URIs for different geographies.
+
+* The bundle provider can organize the bundles using heuristics, such as
+  creation tokens, to help the client prevent downloading bundles it does
+  not need. When the bundle provider does not provide these heuristics,
+  the client can use optimizations to minimize how much of the data is
+  downloaded.
+
+* The bundle provider does not need to be associated with the Git server.
+  The client can choose to use the bundle provider without it being
+  advertised by the Git server.
+
+* The client can choose to discover bundle providers that are advertised
+  by the Git server. This could happen during `git clone`, during
+  `git fetch`, both, or neither. The user can choose which combination
+  works best for them.
+
+* The client can choose to configure a bundle provider manually at any
+  time. The client can also choose to specify a bundle provider manually
+  as a command-line option to `git clone`.
+
+Each repository is different and every Git server has different needs.
+Hopefully the bundle URI feature is flexible enough to satisfy all needs.
+If not, then the feature can be extended through its versioning mechanism.
+
+Server requirements
+-------------------
+
+To provide a server-side implementation of bundle servers, no other parts
+of the Git protocol are required. This allows server maintainers to use
+static content solutions such as CDNs in order to serve the bundle files.
+
+At the current scope of the bundle URI feature, all URIs are expected to
+be HTTP(S) URLs where content is downloaded to a local file using a `GET`
+request to that URL. The server could include authentication requirements
+to those requests with the aim of triggering the configured credential
+helper for secure access. (Future extensions could use "file://" URIs or
+SSH URIs.)
+
+Assuming a `200 OK` response from the server, the content at the URL is
+inspected. First, Git attempts to parse the file as a bundle file of
+version 2 or higher. If the file is not a bundle, then the file is parsed
+as a plain-text file using Git's config parser. The key-value pairs in
+that config file are expected to describe a list of bundle URIs. If
+neither of these parse attempts succeed, then Git will report an error to
+the user that the bundle URI provided erroneous data.
+
+Any other data provided by the server is considered erroneous.
+
+Bundle Lists
+------------
+
+The Git server can advertise bundle URIs using a set of `key=value` pairs.
+A bundle URI can also serve a plain-text file in the Git config format
+containing these same `key=value` pairs. In both cases, we consider this
+to be a _bundle list_. The pairs specify information about the bundles
+that the client can use to make decisions for which bundles to download
+and which to ignore.
+
+A few keys focus on properties of the list itself.
+
+bundle.version::
+	(Required) This value provides a version number for the bundle
+	list. If a future Git change enables a feature that needs the Git
+	client to react to a new key in the bundle list file, then this version
+	will increment. The only current version number is 1, and if any other
+	value is specified then Git will fail to use this file.
+
+bundle.mode::
+	(Required) This value has one of two values: `all` and `any`. When `all`
+	is specified, then the client should expect to need all of the listed
+	bundle URIs that match their repository's requirements. When `any` is
+	specified, then the client should expect that any one of the bundle URIs
+	that match their repository's requirements will suffice. Typically, the
+	`any` option is used to list a number of different bundle servers
+	located in different geographies.
+
+bundle.heuristic::
+	If this string-valued key exists, then the bundle list is designed to
+	work well with incremental `git fetch` commands. The heuristic signals
+	that there are additional keys available for each bundle that help
+	determine which subset of bundles the client should download. The only
+	heuristic currently planned is `creationToken`.
+
+The remaining keys include an `<id>` segment which is a server-designated
+name for each available bundle. The `<id>` must contain only alphanumeric
+and `-` characters.
+
+bundle.<id>.uri::
+	(Required) This string value is the URI for downloading bundle `<id>`.
+	If the URI begins with a protocol (`http://` or `https://`) then the URI
+	is absolute. Otherwise, the URI is interpreted as relative to the URI
+	used for the bundle list. If the URI begins with `/`, then that relative
+	path is relative to the domain name used for the bundle list. (This use
+	of relative paths is intended to make it easier to distribute a set of
+	bundles across a large number of servers or CDNs with different domain
+	names.)
+
+bundle.<id>.filter::
+	This string value represents an object filter that should also appear in
+	the header of this bundle. The server uses this value to differentiate
+	different kinds of bundles from which the client can choose those that
+	match their object filters.
+
+bundle.<id>.creationToken::
+	This value is a nonnegative 64-bit integer used for sorting the bundles
+	the list. This is used to download a subset of bundles during a fetch
+	when `bundle.heuristic=creationToken`.
+
+bundle.<id>.location::
+	This string value advertises a real-world location from where the bundle
+	URI is served. This can be used to present the user with an option for
+	which bundle URI to use or simply as an informative indicator of which
+	bundle URI was selected by Git. This is only valuable when
+	`bundle.mode` is `any`.
+
+Here is an example bundle list using the Git config format:
+
+	[bundle]
+		version = 1
+		mode = all
+		heuristic = creationToken
+
+	[bundle "2022-02-09-1644442601-daily"]
+		uri = https://bundles.example.com/git/git/2022-02-09-1644442601-daily.bundle
+		creationToken = 1644442601
+
+	[bundle "2022-02-02-1643842562"]
+		uri = https://bundles.example.com/git/git/2022-02-02-1643842562.bundle
+		creationToken = 1643842562
+
+	[bundle "2022-02-09-1644442631-daily-blobless"]
+		uri = 2022-02-09-1644442631-daily-blobless.bundle
+		creationToken = 1644442631
+		filter = blob:none
+
+	[bundle "2022-02-02-1643842568-blobless"]
+		uri = /git/git/2022-02-02-1643842568-blobless.bundle
+		creationToken = 1643842568
+		filter = blob:none
+
+This example uses `bundle.mode=all` as well as the
+`bundle.<id>.creationToken` heuristic. It also uses the `bundle.<id>.filter`
+options to present two parallel sets of bundles: one for full clones and
+another for blobless partial clones.
+
+Suppose that this bundle list was found at the URI
+`https://bundles.example.com/git/git/` and so the two blobless bundles have
+the following fully-expanded URIs:
+
+* `https://bundles.example.com/git/git/2022-02-09-1644442631-daily-blobless.bundle`
+* `https://bundles.example.com/git/git/2022-02-02-1643842568-blobless.bundle`
+
+Advertising Bundle URIs
+-----------------------
+
+If a user knows a bundle URI for the repository they are cloning, then
+they can specify that URI manually through a command-line option. However,
+a Git host may want to advertise bundle URIs during the clone operation,
+helping users unaware of the feature.
+
+The only thing required for this feature is that the server can advertise
+one or more bundle URIs. This advertisement takes the form of a new
+protocol v2 capability specifically for discovering bundle URIs.
+
+The client could choose an arbitrary bundle URI as an option _or_ select
+the URI with best performance by some exploratory checks. It is up to the
+bundle provider to decide if having multiple URIs is preferable to a
+single URI that is geodistributed through server-side infrastructure.
+
+Cloning with Bundle URIs
+------------------------
+
+The primary need for bundle URIs is to speed up clones. The Git client
+will interact with bundle URIs according to the following flow:
+
+1. The user specifies a bundle URI with the `--bundle-uri` command-line
+   option _or_ the client discovers a bundle list advertised by the
+   Git server.
+
+2. If the downloaded data from a bundle URI is a bundle, then the client
+   inspects the bundle headers to check that the prerequisite commit OIDs
+   are present in the client repository. If some are missing, then the
+   client delays unbundling until other bundles have been unbundled,
+   making those OIDs present. When all required OIDs are present, the
+   client unbundles that data using a refspec. The default refspec is
+   `+refs/heads/*:refs/bundles/*`, but this can be configured. These refs
+   are stored so that later `git fetch` negotiations can communicate the
+   bundled refs as `have`s, reducing the size of the fetch over the Git
+   protocol. To allow pruning refs from this ref namespace, Git may
+   introduce a numbered namespace (such as `refs/bundles/<i>/*`) such that
+   stale bundle refs can be deleted.
+
+3. If the file is instead a bundle list, then the client inspects the
+   `bundle.mode` to see if the list is of the `all` or `any` form.
+
+   a. If `bundle.mode=all`, then the client considers all bundle
+      URIs. The list is reduced based on the `bundle.<id>.filter` options
+      matching the client repository's partial clone filter. Then, all
+      bundle URIs are requested. If the `bundle.<id>.creationToken`
+      heuristic is provided, then the bundles are downloaded in decreasing
+      order by the creation token, stopping when a bundle has all required
+      OIDs. The bundles can then be unbundled in increasing creation token
+      order. The client stores the latest creation token as a heuristic
+      for avoiding future downloads if the bundle list does not advertise
+      bundles with larger creation tokens.
+
+   b. If `bundle.mode=any`, then the client can choose any one of the
+      bundle URIs to inspect. The client can use a variety of ways to
+      choose among these URIs. The client can also fallback to another URI
+      if the initial choice fails to return a result.
+
+Note that during a clone we expect that all bundles will be required, and
+heuristics such as `bundle.<uri>.creationToken` can be used to download
+bundles in chronological order or in parallel.
+
+If a given bundle URI is a bundle list with a `bundle.heuristic`
+value, then the client can choose to store that URI as its chosen bundle
+URI. The client can then navigate directly to that URI during later `git
+fetch` calls.
+
+When downloading bundle URIs, the client can choose to inspect the initial
+content before committing to downloading the entire content. This may
+provide enough information to determine if the URI is a bundle list or
+a bundle. In the case of a bundle, the client may inspect the bundle
+header to determine that all advertised tips are already in the client
+repository and cancel the remaining download.
+
+Fetching with Bundle URIs
+-------------------------
+
+When the client fetches new data, it can decide to fetch from bundle
+servers before fetching from the origin remote. This could be done via a
+command-line option, but it is more likely useful to use a config value
+such as the one specified during the clone.
+
+The fetch operation follows the same procedure to download bundles from a
+bundle list (although we do _not_ want to use parallel downloads here). We
+expect that the process will end when all prerequisite commit OIDs in a
+thin bundle are already in the object database.
+
+When using the `creationToken` heuristic, the client can avoid downloading
+any bundles if their creation tokenss are not larger than the stored
+creation token. After fetching new bundles, Git updates this local
+creation token.
+
+If the bundle provider does not provide a heuristic, then the client
+should attempt to inspect the bundle headers before downloading the full
+bundle data in case the bundle tips already exist in the client
+repository.
+
+Error Conditions
+----------------
+
+If the Git client discovers something unexpected while downloading
+information according to a bundle URI or the bundle list found at that
+location, then Git can ignore that data and continue as if it was not
+given a bundle URI. The remote Git server is the ultimate source of truth,
+not the bundle URI.
+
+Here are a few example error conditions:
+
+* The client fails to connect with a server at the given URI or a connection
+  is lost without any chance to recover.
+
+* The client receives a 400-level response (such as `404 Not Found` or
+  `401 Not Authorized`). The client should use the credential helper to
+  find and provide a credential for the URI, but match the semantics of
+  Git's other HTTP protocols in terms of handling specific 400-level
+  errors.
+
+* The server reports any other failure reponse.
+
+* The client receives data that is not parsable as a bundle or bundle list.
+
+* A bundle includes a filter that does not match expectations.
+
+* The client cannot unbundle the bundles because the prerequisite commit OIDs
+  are not in the object database and there are no more bundles to download.
+
+There are also situations that could be seen as wasteful, but are not
+error conditions:
+
+* The downloaded bundles contain more information than is requested by
+  the clone or fetch request. A primary example is if the user requests
+  a clone with `--single-branch` but downloads bundles that store every
+  reachable commit from all `refs/heads/*` references. This might be
+  initially wasteful, but perhaps these objects will become reachable by
+  a later ref update that the client cares about.
+
+* A bundle download during a `git fetch` contains objects already in the
+  object database. This is probably unavoidable if we are using bundles
+  for fetches, since the client will almost always be slightly ahead of
+  the bundle servers after performing its "catch-up" fetch to the remote
+  server. This extra work is most wasteful when the client is fetching
+  much more frequently than the server is computing bundles, such as if
+  the client is using hourly prefetches with background maintenance, but
+  the server is computing bundles weekly. For this reason, the client
+  should not use bundle URIs for fetch unless the server has explicitly
+  recommended it through a `bundle.heuristic` value.
+
+Implementation Plan
+-------------------
+
+This design document is being submitted on its own as an aspirational
+document, with the goal of implementing all of the mentioned client
+features over the course of several patch series. Here is a potential
+outline for submitting these features:
+
+1. Integrate bundle URIs into `git clone` with a `--bundle-uri` option.
+   This will include a new `git fetch --bundle-uri` mode for use as the
+   implementation underneath `git clone`. The initial version here will
+   expect a single bundle at the given URI.
+
+2. Implement the ability to parse a bundle list from a bundle URI and
+   update the `git fetch --bundle-uri` logic to properly distinguish
+   between `bundle.mode` options. Specifically design the feature so
+   that the config format parsing feeds a list of key-value pairs into the
+   bundle list logic.
+
+3. Create the `bundle-uri` protocol v2 command so Git servers can advertise
+   bundle URIs using the key-value pairs. Plug into the existing key-value
+   input to the bundle list logic. Allow `git clone` to discover these
+   bundle URIs and bootstrap the client repository from the bundle data.
+   (This choice is an opt-in via a config option and a command-line
+   option.)
+
+4. Allow the client to understand the `bundle.flag=forFetch` configuration
+   and the `bundle.<id>.creationToken` heuristic. When `git clone`
+   discovers a bundle URI with `bundle.flag=forFetch`, it configures the
+   client repository to check that bundle URI during later `git fetch <remote>`
+   commands.
+
+5. Allow clients to discover bundle URIs during `git fetch` and configure
+   a bundle URI for later fetches if `bundle.flag=forFetch`.
+
+6. Implement the "inspect headers" heuristic to reduce data downloads when
+   the `bundle.<id>.creationToken` heuristic is not available.
+
+As these features are reviewed, this plan might be updated. We also expect
+that new designs will be discovered and implemented as this feature
+matures and becomes used in real-world scenarios.
+
+Related Work: Packfile URIs
+---------------------------
+
+The Git protocol already has a capability where the Git server can list
+a set of URLs along with the packfile response when serving a client
+request. The client is then expected to download the packfiles at those
+locations in order to have a complete understanding of the response.
+
+This mechanism is used by the Gerrit server (implemented with JGit) and
+has been effective at reducing CPU load and improving user performance for
+clones.
+
+A major downside to this mechanism is that the origin server needs to know
+_exactly_ what is in those packfiles, and the packfiles need to be available
+to the user for some time after the server has responded. This coupling
+between the origin and the packfile data is difficult to manage.
+
+Further, this implementation is extremely hard to make work with fetches.
+
+Related Work: GVFS Cache Servers
+--------------------------------
+
+The GVFS Protocol [2] is a set of HTTP endpoints designed independently of
+the Git project before Git's partial clone was created. One feature of this
+protocol is the idea of a "cache server" which can be colocated with build
+machines or developer offices to transfer Git data without overloading the
+central server.
+
+The endpoint that VFS for Git is famous for is the `GET /gvfs/objects/{oid}`
+endpoint, which allows downloading an object on-demand. This is a critical
+piece of the filesystem virtualization of that product.
+
+However, a more subtle need is the `GET /gvfs/prefetch?lastPackTimestamp=<t>`
+endpoint. Given an optional timestamp, the cache server responds with a list
+of precomputed packfiles containing the commits and trees that were introduced
+in those time intervals.
+
+The cache server computes these "prefetch" packfiles using the following
+strategy:
+
+1. Every hour, an "hourly" pack is generated with a given timestamp.
+2. Nightly, the previous 24 hourly packs are rolled up into a "daily" pack.
+3. Nightly, all prefetch packs more than 30 days old are rolled up into
+   one pack.
+
+When a user runs `gvfs clone` or `scalar clone` against a repo with cache
+servers, the client requests all prefetch packfiles, which is at most
+`24 + 30 + 1` packfiles downloading only commits and trees. The client
+then follows with a request to the origin server for the references, and
+attempts to checkout that tip reference. (There is an extra endpoint that
+helps get all reachable trees from a given commit, in case that commit
+was not already in a prefetch packfile.)
+
+During a `git fetch`, a hook requests the prefetch endpoint using the
+most-recent timestamp from a previously-downloaded prefetch packfile.
+Only the list of packfiles with later timestamps are downloaded. Most
+users fetch hourly, so they get at most one hourly prefetch pack. Users
+whose machines have been off or otherwise have not fetched in over 30 days
+might redownload all prefetch packfiles. This is rare.
+
+It is important to note that the clients always contact the origin server
+for the refs advertisement, so the refs are frequently "ahead" of the
+prefetched pack data. The missing objects are downloaded on-demand using
+the `GET gvfs/objects/{oid}` requests, when needed by a command such as
+`git checkout` or `git log`. Some Git optimizations disable checks that
+would cause these on-demand downloads to be too aggressive.
+
+See Also
+--------
+
+[1] https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
+    An earlier RFC for a bundle URI feature.
+
+[2] https://github.com/microsoft/VFSForGit/blob/master/Protocol.md
+    The GVFS Protocol
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v4 2/2] bundle-uri: add example bundle organization
  2022-08-09 13:12     ` [PATCH v4 0/2] bundle URIs: design doc Derrick Stolee via GitGitGadget
  2022-08-09 13:12       ` [PATCH v4 1/2] docs: document bundle URI standard Derrick Stolee via GitGitGadget
@ 2022-08-09 13:12       ` Derrick Stolee via GitGitGadget
  2022-08-09 13:49       ` [PATCH v4 0/2] bundle URIs: design doc Phillip Wood
  2 siblings, 0 replies; 64+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2022-08-09 13:12 UTC (permalink / raw)
  To: git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	SZEDER Gábor, Matthew John Cheetham, Josh Steadmon,
	Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

The previous change introduced the bundle URI design document. It
creates a flexible set of options that allow bundle providers many ways
to organize Git object data and speed up clones and fetches. It is
particularly important that we have flexibility so we can apply future
advancements as new ideas for efficiently organizing Git data are
discovered.

However, the design document does not provide even an example of how
bundles could be organized, and that makes it difficult to envision how
the feature should work at the end of the implementation plan.

Add a section that details how a bundle provider could work, including
using the Git server advertisement for multiple geo-distributed servers.
This organization is based on the GVFS Cache Servers which have
successfully used similar ideas to provide fast object access and
reduced server load for very large repositories.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/technical/bundle-uri.txt | 105 +++++++++++++++++++++++++
 1 file changed, 105 insertions(+)

diff --git a/Documentation/technical/bundle-uri.txt b/Documentation/technical/bundle-uri.txt
index e6d63b868b8..c25c42378ab 100644
--- a/Documentation/technical/bundle-uri.txt
+++ b/Documentation/technical/bundle-uri.txt
@@ -349,6 +349,111 @@ error conditions:
   should not use bundle URIs for fetch unless the server has explicitly
   recommended it through a `bundle.heuristic` value.
 
+Example Bundle Provider organization
+------------------------------------
+
+The bundle URI feature is intentionally designed to be flexible to
+different ways a bundle provider wants to organize the object data.
+However, it can be helpful to have a complete organization model described
+here so providers can start from that base.
+
+This example organization is a simplified model of what is used by the
+GVFS Cache Servers (see section near the end of this document) which have
+been beneficial in speeding up clones and fetches for very large
+repositories, although using extra software outside of Git.
+
+The bundle provider deploys servers across multiple geographies. Each
+server manages its own bundle set. The server can track a number of Git
+repositories, but provides a bundle list for each based on a pattern. For
+example, when mirroring a repository at `https://<domain>/<org>/<repo>`
+the bundle server could have its bundle list available at
+`https://<server-url>/<domain>/<org>/<repo>`. The origin Git server can
+list all of these servers under the "any" mode:
+
+	[bundle]
+		version = 1
+		mode = any
+
+	[bundle "eastus"]
+		uri = https://eastus.example.com/<domain>/<org>/<repo>
+
+	[bundle "europe"]
+		uri = https://europe.example.com/<domain>/<org>/<repo>
+
+	[bundle "apac"]
+		uri = https://apac.example.com/<domain>/<org>/<repo>
+
+This "list of lists" is static and only changes if a bundle server is
+added or removed.
+
+Each bundle server manages its own set of bundles. The initial bundle list
+contains only a single bundle, containing all of the objects received from
+cloning the repository from the origin server. The list uses the
+`creationToken` heuristic and a `creationToken` is made for the bundle
+based on the server's timestamp.
+
+The bundle server runs regularly-scheduled updates for the bundle list,
+such as once a day. During this task, the server fetches the latest
+contents from the origin server and generates a bundle containing the
+objects reachable from the latest origin refs, but not contained in a
+previously-computed bundle. This bundle is added to the list, with care
+that the `creationToken` is strictly greater than the previous maximum
+`creationToken`.
+
+When the bundle list grows too large, say more than 30 bundles, then the
+oldest "_N_ minus 30" bundles are combined into a single bundle. This
+bundle's `creationToken` is equal to the maximum `creationToken` among the
+merged bundles.
+
+An example bundle list is provided here, although it only has two daily
+bundles and not a full list of 30:
+
+	[bundle]
+		version = 1
+		mode = all
+		heuristic = creationToken
+
+	[bundle "2022-02-13-1644770820-daily"]
+		uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644770820-daily.bundle
+		creationToken = 1644770820
+
+	[bundle "2022-02-09-1644442601-daily"]
+		uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644442601-daily.bundle
+		creationToken = 1644442601
+
+	[bundle "2022-02-02-1643842562"]
+		uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-02-1643842562.bundle
+		creationToken = 1643842562
+
+To avoid storing and serving object data in perpetuity despite becoming
+unreachable in the origin server, this bundle merge can be more careful.
+Instead of taking an absolute union of the old bundles, instead the bundle
+can be created by looking at the newer bundles and ensuring that their
+necessary commits are all available in this merged bundle (or in another
+one of the newer bundles). This allows "expiring" object data that is not
+being used by new commits in this window of time. That data could be
+reintroduced by a later push.
+
+The intention of this data organization has two main goals. First, initial
+clones of the repository become faster by downloading precomputed object
+data from a closer source. Second, `git fetch` commands can be faster,
+especially if the client has not fetched for a few days. However, if a
+client does not fetch for 30 days, then the bundle list organization would
+cause redownloading a large amount of object data.
+
+One way to make this organization more useful to users who fetch frequently
+is to have more frequent bundle creation. For example, bundles could be
+created every hour, and then once a day those "hourly" bundles could be
+merged into a "daily" bundle. The daily bundles are merged into the
+oldest bundle after 30 days.
+
+It is recommened that this bundle strategy is repeated with the `blob:none`
+filter if clients of this repository are expecting to use blobless partial
+clones. This list of blobless bundles stays in the same list as the full
+bundles, but uses the `bundle.<id>.filter` key to separate the two groups.
+For very large repositories, the bundle provider may want to _only_ provide
+blobless bundles.
+
 Implementation Plan
 -------------------
 
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 0/2] bundle URIs: design doc
  2022-08-09 13:12     ` [PATCH v4 0/2] bundle URIs: design doc Derrick Stolee via GitGitGadget
  2022-08-09 13:12       ` [PATCH v4 1/2] docs: document bundle URI standard Derrick Stolee via GitGitGadget
  2022-08-09 13:12       ` [PATCH v4 2/2] bundle-uri: add example bundle organization Derrick Stolee via GitGitGadget
@ 2022-08-09 13:49       ` Phillip Wood
  2022-08-09 15:50         ` Derrick Stolee
  2 siblings, 1 reply; 64+ messages in thread
From: Phillip Wood @ 2022-08-09 13:49 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	SZEDER Gábor, Matthew John Cheetham, Josh Steadmon,
	Derrick Stolee

Hi Stolee

On 09/08/2022 14:12, Derrick Stolee via GitGitGadget wrote:
> This is the first of series towards building the bundle URI feature as
> discussed in previous RFCs, specifically pulled directly out of [5]:
> 
> [1]
> https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
> 
> [2]
> https://lore.kernel.org/git/cover-0.3-00000000000-20211025T211159Z-avarab@gmail.com/
> 
> [3]
> https://lore.kernel.org/git/pull.1160.git.1645641063.gitgitgadget@gmail.com
> 
> [4]
> https://lore.kernel.org/git/RFC-cover-v2-00.36-00000000000-20220418T165545Z-avarab@gmail.com/
> 
> [5]
> https://lore.kernel.org/git/pull.1234.git.1653072042.gitgitgadget@gmail.com
> 
> THIS ONLY INCLUDES THE DESIGN DOCUMENT. See "Updates in v3". There are two
> patches:
> 
>   1. The main design document that details the bundle URI standard and how
>      the client interacts with the bundle data.
>   2. An addendum to the design document that details one strategy for
>      organizing bundles from the perspective of a bundle provider.

I thought the document was well written and left me with a good 
understanding of both the problem being addressed and the rationale for 
the solution. One small query - the document mentions CI farms as 
benefiting from this work but my impression is that those commonly use 
shallow clones which are (quite reasonably) not supported in this proposal.

Best Wishes

Phillip

> As outlined in [5], the next steps after this are:
> 
>   1. Add 'git clone --bundle-uri=' to run a 'git bundle fetch ' step before
>      doing a fetch negotiation with the origin remote. [6]
>   2. Allow parsing a bundle list as a config file at the given URI. The
>      key-value format is unified with the protocol v2 verb (coming in (3)).
>      [7]
>   3. Implement the protocol v2 verb, re-using the bundle list logic from (2).
>      Use this to auto-discover bundle URIs during 'git clone' (behind a
>      config option). [8]
>   4. Implement the 'creationToken' heuristic, allowing incremental 'git
>      fetch' commands to download a bundle list from a configured URI, and
>      only download bundles that are new based on the creation token values.
>      [9]
> 
> I have prepared some of this work as pull requests on my personal fork so
> curious readers can look ahead to where we are going:
> 
> [6] https://github.com/derrickstolee/git/pull/18
> 
> [7] https://github.com/derrickstolee/git/pull/20
> 
> [8] https://github.com/derrickstolee/git/pull/21
> 
> [9] https://github.com/derrickstolee/git/pull/22
> 
> As mentioned in the design document, this is not all that is possible. For
> instance, Ævar's suggestion to download only the bundle headers can be used
> as a second heuristic (and as an augmentation of the timestamp heuristic).
> 
> 
> Updates in v4
> =============
> 
>   * Whitespace issue resolved.
>   * Example bundle provider setup now uses the 'bundle-uri' protocol v2
>     format when describing how the origin Git server advertises the static
>     bundle servers.
> 
> 
> Updates in v3
> =============
> 
>   * This version only includes the design document. Thanks to all the
>     reviewers for the significant attention that improves the doc a lot.
>   * The second patch has an addition to the design document that details a
>     potential way to organize bundles from the provider's perspective.
>   * Based on some off-list feedback, I was going to switch git fetch
>     --bundle-uri into git bundle fetch, but that has a major conflict with
>     [10] which was just submitted.
>   * I will move the git bundle fetch implementation into [6] which also has
>     the git clone --bundle-uri implementation.
> 
> [10]
> https://lore.kernel.org/git/20220725123857.2773963-1-szeder.dev@gmail.com/
> 
> 
> Updates in v2
> =============
> 
>   * The design document has been updated based on Junio's feedback.
>   * The "bundle.list." keys are now just "bundle.".
>   * The "timestamp" heuristic is now "creationToken".
>   * More clarity on how Git parses data from the bundle URI.
>   * Dropped some unnecessary bundle list keys (*.list, *.requires).
> 
> Thanks, -Stolee
> 
> Derrick Stolee (2):
>    docs: document bundle URI standard
>    bundle-uri: add example bundle organization
> 
>   Documentation/Makefile                 |   1 +
>   Documentation/technical/bundle-uri.txt | 573 +++++++++++++++++++++++++
>   2 files changed, 574 insertions(+)
>   create mode 100644 Documentation/technical/bundle-uri.txt
> 
> 
> base-commit: e72d93e88cb20b06e88e6e7d81bd1dc4effe453f
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1248%2Fderrickstolee%2Fbundle-redo%2Ffetch-v4
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1248/derrickstolee/bundle-redo/fetch-v4
> Pull-Request: https://github.com/gitgitgadget/git/pull/1248
> 
> Range-diff vs v3:
> 
>   1:  e0f003e1b5f ! 1:  1bfac1f492a docs: document bundle URI standard
>       @@ Documentation/technical/bundle-uri.txt (new)
>        +	work well with incremental `git fetch` commands. The heuristic signals
>        +	that there are additional keys available for each bundle that help
>        +	determine which subset of bundles the client should download. The only
>       -+  heuristic currently planned is `creationToken`.
>       ++	heuristic currently planned is `creationToken`.
>        +
>        +The remaining keys include an `<id>` segment which is a server-designated
>        +name for each available bundle. The `<id>` must contain only alphanumeric
>   2:  a933471c3af ! 2:  a22c24aa85a bundle-uri: add example bundle organization
>       @@ Documentation/technical/bundle-uri.txt: error conditions:
>        +	[bundle]
>        +		version = 1
>        +		mode = any
>       -+		
>       ++
>        +	[bundle "eastus"]
>        +		uri = https://eastus.example.com/<domain>/<org>/<repo>
>       -+		
>       ++
>        +	[bundle "europe"]
>        +		uri = https://europe.example.com/<domain>/<org>/<repo>
>       -+		
>       ++
>        +	[bundle "apac"]
>        +		uri = https://apac.example.com/<domain>/<org>/<repo>
>        +
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 0/2] bundle URIs: design doc
  2022-08-09 13:49       ` [PATCH v4 0/2] bundle URIs: design doc Phillip Wood
@ 2022-08-09 15:50         ` Derrick Stolee
  2022-08-11 15:42           ` Phillip Wood
  0 siblings, 1 reply; 64+ messages in thread
From: Derrick Stolee @ 2022-08-09 15:50 UTC (permalink / raw)
  To: phillip.wood, Derrick Stolee via GitGitGadget, git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	SZEDER Gábor, Matthew John Cheetham, Josh Steadmon

On 8/9/2022 9:49 AM, Phillip Wood wrote:
> Hi Stolee
> 
> On 09/08/2022 14:12, Derrick Stolee via GitGitGadget wrote:
>> This is the first of series towards building the bundle URI feature as
>> discussed in previous RFCs, specifically pulled directly out of [5]:
>>
>> [1]
>> https://lore.kernel.org/git/RFC-cover-00.13-0000000000-20210805T150534Z-avarab@gmail.com/
>>
>> [2]
>> https://lore.kernel.org/git/cover-0.3-00000000000-20211025T211159Z-avarab@gmail.com/
>>
>> [3]
>> https://lore.kernel.org/git/pull.1160.git.1645641063.gitgitgadget@gmail.com
>>
>> [4]
>> https://lore.kernel.org/git/RFC-cover-v2-00.36-00000000000-20220418T165545Z-avarab@gmail.com/
>>
>> [5]
>> https://lore.kernel.org/git/pull.1234.git.1653072042.gitgitgadget@gmail.com
>>
>> THIS ONLY INCLUDES THE DESIGN DOCUMENT. See "Updates in v3". There are two
>> patches:
>>
>>   1. The main design document that details the bundle URI standard and how
>>      the client interacts with the bundle data.
>>   2. An addendum to the design document that details one strategy for
>>      organizing bundles from the perspective of a bundle provider.
> 
> I thought the document was well written and left me with a good understanding
> of both the problem being addressed and the rationale for the solution.

Thanks for the kind words!

> One small query - the document mentions CI farms as benefiting from this work
> but my impression is that those commonly use shallow clones which are (quite
> reasonably) not supported in this proposal.

There are two different kinds of CI farms.

The most common one is a SaaS CI system that provides machines on-demand,
but each run starts from some "clean" state. For example, GitHub Actions
runs CI builds of the Git project across a number of platforms. These
machines need the source at HEAD, but do not need the full history. Further,
they will erase the repository entirely at the end of the build, never
fetching from those repositories. Thus, a shallow clone makes sense to
minimize the data transfer. Bundles don't make sense here for multiple
reasons, including that bundles must be closed under reachability and do
not work for representing a shallow clone (see [1]). The other reason is
that CI builds typically are triggered immediately after the commit appears
on the origin Git server, so there is no time for a bundle provider to
create a bundle representing that shallow clone.

[1] https://github.com/git/git/blob/c50926e1f48891e2671e1830dbcd2912a4563450/Documentation/technical/bundle-format.txt#L65-L69

The less common one is a private build farm. These machines are long-lived
and controlled by the repository owner. They come pre-loaded with all of
the software needed to build the repository. The best practice in this
type of build farm is to keep a full clone of the repository in a well-
known location and use incremental fetches to update the client repositories
to download the commit necessary for the build. This type of build farm is
typically self-hosted, but could also be hosted by a cloud provider. The
bundle URI design allows ways to quickly bootstrap new build machines using
a bundle provider (probably co-located with the build machines) as well as
improving fetch times by creating frequent incremental bundles. The new
commit being built is unlikely to exist immediately in the bundles, but it
is unlikely to be too far ahead of any of the bundles.

While private build farms are less common, they do become necessary for
large projects. Engineering teams that have the resources to self-host a
build farm are likely to also have the resources to self-host a bundle
server. They may not have the connections or desire to advertise those
bundle server URIs from the origin Git server.

I hope this helps clarify my perspective as to why build farms using long-
lived copies of the repository could take advantage of bundle URIs.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 0/2] bundle URIs: design doc
  2022-08-09 15:50         ` Derrick Stolee
@ 2022-08-11 15:42           ` Phillip Wood
  0 siblings, 0 replies; 64+ messages in thread
From: Phillip Wood @ 2022-08-11 15:42 UTC (permalink / raw)
  To: Derrick Stolee, phillip.wood, Derrick Stolee via GitGitGadget,
	git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	SZEDER Gábor, Matthew John Cheetham, Josh Steadmon

Hi Stolee

On 09/08/2022 16:50, Derrick Stolee wrote:
>> One small query - the document mentions CI farms as benefiting from this work
>> but my impression is that those commonly use shallow clones which are (quite
>> reasonably) not supported in this proposal.
> 
> There are two different kinds of CI farms.
> 
> The most common one is a SaaS CI system that provides machines on-demand,
> but each run starts from some "clean" state. For example, GitHub Actions
> runs CI builds of the Git project across a number of platforms. These
> machines need the source at HEAD, but do not need the full history. Further,
> they will erase the repository entirely at the end of the build, never
> fetching from those repositories. Thus, a shallow clone makes sense to
> minimize the data transfer. Bundles don't make sense here for multiple
> reasons, including that bundles must be closed under reachability and do
> not work for representing a shallow clone (see [1]). The other reason is
> that CI builds typically are triggered immediately after the commit appears
> on the origin Git server, so there is no time for a bundle provider to
> create a bundle representing that shallow clone.
> 
> [1] https://github.com/git/git/blob/c50926e1f48891e2671e1830dbcd2912a4563450/Documentation/technical/bundle-format.txt#L65-L69
> 
> The less common one is a private build farm. These machines are long-lived
> and controlled by the repository owner. They come pre-loaded with all of
> the software needed to build the repository. The best practice in this
> type of build farm is to keep a full clone of the repository in a well-
> known location and use incremental fetches to update the client repositories
> to download the commit necessary for the build. This type of build farm is
> typically self-hosted, but could also be hosted by a cloud provider. The
> bundle URI design allows ways to quickly bootstrap new build machines using
> a bundle provider (probably co-located with the build machines) as well as
> improving fetch times by creating frequent incremental bundles. The new
> commit being built is unlikely to exist immediately in the bundles, but it
> is unlikely to be too far ahead of any of the bundles.
> 
> While private build farms are less common, they do become necessary for
> large projects. Engineering teams that have the resources to self-host a
> build farm are likely to also have the resources to self-host a bundle
> server. They may not have the connections or desire to advertise those
> bundle server URIs from the origin Git server.
> 
> I hope this helps clarify my perspective as to why build farms using long-
> lived copies of the repository could take advantage of bundle URIs.

Yes it does, thanks for taking the time to explain that.

Phillip

> Thanks,
> -Stolee

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v4 1/2] docs: document bundle URI standard
  2022-08-09 13:12       ` [PATCH v4 1/2] docs: document bundle URI standard Derrick Stolee via GitGitGadget
@ 2022-10-04 19:48         ` Philip Oakley
  0 siblings, 0 replies; 64+ messages in thread
From: Philip Oakley @ 2022-10-04 19:48 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget, git
  Cc: gitster, me, newren, avarab, dyroneteng, Johannes.Schindelin,
	SZEDER Gábor, Matthew John Cheetham, Josh Steadmon,
	Derrick Stolee

Hi Stolee,

While catching up with stuff I missed, I saw two minor nits for future
updates, when reading the docs on GfW.

On 09/08/2022 14:12, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <derrickstolee@github.com>
>
> Introduce the idea of bundle URIs to the Git codebase through an
> aspirational design document. This document includes the full design
> intended to include the feature in its fully-implemented form. This will
> take several steps as detailed in the Implementation Plan section.
>
> By committing this document now, it can be used to motivate changes
> necessary to reach these final goals. The design can still be altered as
> new information is discovered.
>
> Signed-off-by: Derrick Stolee <derrickstolee@github.com>
> ---
>  [...]
> diff --git a/Documentation/technical/bundle-uri.txt b/Documentation/technical/bundle-uri.txt
> new file mode 100644
> index 00000000000..e6d63b868b8
> --- /dev/null
> +++ b/Documentation/technical/bundle-uri.txt
> @@ -0,0 +1,468 @@
> +Bundle URIs
> +===========
> +

[...]


> +
> +bundle.<id>.creationToken::
> +	This value is a nonnegative 64-bit integer used for sorting the bundles
> +	the list. 

s/the //   .. an extraneous 'the' ?

> This is used to download a subset of bundles during a fetch
> +	when `bundle.heuristic=creationToken`.
> +
> +bundle.<id>.location::
> +	This string value advertises a real-world location from where the bundle
> +	URI is served. This can be used to present the user with an option for
> +	which bundle URI to use or simply as an informative indicator of which
> +	bundle URI was selected by Git. This is only valuable when
> +	`bundle.mode` is `any`.
> +

[...]
> +Cloning with Bundle URIs
> +------------------------
> +
> +The primary need for bundle URIs is to speed up clones. The Git client
> +will interact with bundle URIs according to the following flow:
> +
> +1. The user specifies a bundle URI with the `--bundle-uri` command-line
> +   option _or_ the client discovers a bundle list advertised by the
> +   Git server.
> +
> +2. If the downloaded data from a bundle URI is a bundle, then the client
> +   inspects the bundle headers to check that the prerequisite commit OIDs
> +   are present in the client repository. If some are missing, then the
> +   client delays unbundling until other bundles have been unbundled,
> +   making those OIDs present. When all required OIDs are present, the
> +   client unbundles that data using a refspec. The default refspec is
> +   `+refs/heads/*:refs/bundles/*`, but this can be configured. These refs
> +   are stored so that later `git fetch` negotiations can communicate the
> +   bundled refs as `have`s, reducing the size of the fetch over the Git

s/ave`s/ave` 's/  .. 
The backtick is mistaken for an apostrophe and then is misformatted for
the Git for Windows HTML versions, e.g.
https://github.com/gitster/git-htmldocs/blob/master/technical/bundle-uri.html#L1056

> +   protocol. To allow pruning refs from this ref namespace, Git may
> +   introduce a numbered namespace (such as `refs/bundles/<i>/*`) such that
> +   stale bundle refs can be deleted.
> +

[...]

Quite readable. Thanks.

Philip

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2022-10-04 19:48 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-06 19:55 [PATCH 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
2022-06-06 19:55 ` [PATCH 1/6] docs: document bundle URI standard Derrick Stolee via GitGitGadget
2022-06-06 22:18   ` Junio C Hamano
2022-06-08 19:20     ` Derrick Stolee
2022-06-08 19:27       ` Junio C Hamano
2022-06-08 20:44         ` Junio C Hamano
2022-06-08 20:39       ` Junio C Hamano
2022-06-08 20:52         ` Derrick Stolee
2022-06-07  0:33   ` Junio C Hamano
2022-06-08 19:46     ` Derrick Stolee
2022-06-08 21:01       ` Junio C Hamano
2022-06-09 16:00         ` Derrick Stolee
2022-06-09 17:56           ` Junio C Hamano
2022-06-09 18:27             ` Ævar Arnfjörð Bjarmason
2022-06-09 19:39             ` Derrick Stolee
2022-06-09 20:13               ` Junio C Hamano
2022-06-21 19:34       ` Derrick Stolee
2022-06-21 20:16         ` Junio C Hamano
2022-06-21 21:10           ` Derrick Stolee
2022-06-21 21:33             ` Junio C Hamano
2022-06-06 19:55 ` [PATCH 2/6] remote-curl: add 'get' capability Derrick Stolee via GitGitGadget
2022-07-21 22:59   ` Junio C Hamano
2022-06-06 19:55 ` [PATCH 3/6] bundle-uri: create basic file-copy logic Derrick Stolee via GitGitGadget
2022-06-06 19:55 ` [PATCH 4/6] fetch: add --bundle-uri option Derrick Stolee via GitGitGadget
2022-06-06 19:55 ` [PATCH 5/6] bundle-uri: add support for http(s):// and file:// Derrick Stolee via GitGitGadget
2022-06-06 19:55 ` [PATCH 6/6] fetch: add 'refs/bundle/' to log.excludeDecoration Derrick Stolee via GitGitGadget
2022-06-29 20:40 ` [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
2022-06-29 20:40   ` [PATCH v2 1/6] docs: document bundle URI standard Derrick Stolee via GitGitGadget
2022-07-18  9:20     ` SZEDER Gábor
2022-07-21 12:09     ` Matthew John Cheetham
2022-07-22 13:52       ` Derrick Stolee
2022-07-22 16:03       ` Derrick Stolee
2022-07-21 21:39     ` Josh Steadmon
2022-07-22 13:15       ` Derrick Stolee
2022-07-22 15:01       ` Derrick Stolee
2022-06-29 20:40   ` [PATCH v2 2/6] remote-curl: add 'get' capability Derrick Stolee via GitGitGadget
2022-07-21 21:41     ` Josh Steadmon
2022-06-29 20:40   ` [PATCH v2 3/6] bundle-uri: create basic file-copy logic Derrick Stolee via GitGitGadget
2022-07-21 21:45     ` Josh Steadmon
2022-07-22 13:18       ` Derrick Stolee
2022-06-29 20:40   ` [PATCH v2 4/6] fetch: add --bundle-uri option Derrick Stolee via GitGitGadget
2022-06-29 20:40   ` [PATCH v2 5/6] bundle-uri: add support for http(s):// and file:// Derrick Stolee via GitGitGadget
2022-06-29 20:40   ` [PATCH v2 6/6] fetch: add 'refs/bundle/' to log.excludeDecoration Derrick Stolee via GitGitGadget
2022-07-21 21:47     ` Josh Steadmon
2022-07-22 13:20       ` Derrick Stolee
2022-07-21 21:48   ` [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Josh Steadmon
2022-07-21 21:56     ` Junio C Hamano
2022-07-25 13:53   ` [PATCH v3 0/2] " Derrick Stolee via GitGitGadget
2022-07-25 13:53     ` [PATCH v3 1/2] docs: document bundle URI standard Derrick Stolee via GitGitGadget
2022-07-28  1:23       ` tenglong.tl
2022-08-01 13:42         ` Derrick Stolee
2022-07-25 13:53     ` [PATCH v3 2/2] bundle-uri: add example bundle organization Derrick Stolee via GitGitGadget
2022-08-04 16:09       ` Matthew John Cheetham
2022-08-04 17:39         ` Derrick Stolee
2022-08-04 20:29           ` Ævar Arnfjörð Bjarmason
2022-08-05 18:29             ` Derrick Stolee
2022-07-25 20:05     ` [PATCH v3 0/2] bundle URIs: design doc and initial git fetch --bundle-uri implementation Josh Steadmon
2022-08-09 13:12     ` [PATCH v4 0/2] bundle URIs: design doc Derrick Stolee via GitGitGadget
2022-08-09 13:12       ` [PATCH v4 1/2] docs: document bundle URI standard Derrick Stolee via GitGitGadget
2022-10-04 19:48         ` Philip Oakley
2022-08-09 13:12       ` [PATCH v4 2/2] bundle-uri: add example bundle organization Derrick Stolee via GitGitGadget
2022-08-09 13:49       ` [PATCH v4 0/2] bundle URIs: design doc Phillip Wood
2022-08-09 15:50         ` Derrick Stolee
2022-08-11 15:42           ` Phillip Wood

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).