git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Junio C Hamano <gitster@pobox.com>
To: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, me@ttaylorr.com, newren@gmail.com,
	avarab@gmail.com, dyroneteng@gmail.com,
	Johannes.Schindelin@gmx.de,
	Derrick Stolee <derrickstolee@github.com>
Subject: Re: [PATCH 1/6] docs: document bundle URI standard
Date: Mon, 06 Jun 2022 17:33:59 -0700	[thread overview]
Message-ID: <xmqqtu8x1fd4.fsf@gitster.g> (raw)
In-Reply-To: <e771b2971d092af5ea8a47eb708d03e34b284a0f.1654545325.git.gitgitgadget@gmail.com> (Derrick Stolee via GitGitGadget's message of "Mon, 06 Jun 2022 19:55:20 +0000")

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> +Cloning with Bundle URIs
> +------------------------
> +
> +The primary need for bundle URIs is to speed up clones. The Git client
> +will interact with bundle URIs according to the following flow:
> +
> +1. The user specifies a bundle URI with the `--bundle-uri` command-line
> +   option _or_ the client discovers a bundle list advertised by the
> +   Git server.
> +
> +2. If the downloaded data from a bundle URI is a bundle, then the client
> +   inspects the bundle headers to check that the negative commit OIDs are

Although "negative" would be understandable to pros, the commits
required to unbundle a bundle file are officially called
"prerequisite commits" (cf. "git bundle --help"), so that may be
easier to understand by ordinary readers.

> +   present in the client repository. If some are missing, then the client
> +   delays unbundling until other bundles have been unbundled, making those
> +   OIDs present. When all required OIDs are present, the client unbundles
> +   that data using a refspec. The default refspec is
> +   `+refs/heads/*:refs/bundles/*`, but this can be configured.

The refs/bundles/ appear in the document only here, and it is
unclear why we even want it (I am assuming this is against gc while
"git clone" is still running) or how we are going to retire it, if
ever.  If there are multiple bundle files involved in this "git clone",
to anchor objects that are necessary against "gc", don't we need to use
refs/bundles/<i>/* or something like that, where <i> is uniquely assigned
number locally?

> +3. If the file is instead a bundle list, then the client inspects the
> +   `bundle.list.mode` to see if the list is of the `all` or `any` form.

If the downloaded file is not a bundle (e.g. "git bundle list-heads"
barfs on it) and it is not parseable with our configuration parser,
do we error out, or do we pretend as if that bundle file or the TOC
did not exist (if the bundle list with mode=any at the higher level
has appropriate alternatives)?

> +   a. If `bundle.list.mode=all`, then the client considers all bundle
> +      URIs. The list is reduced based on the `bundle.<id>.filter` options
> +      matching the client repository's partial clone filter.

OK, this answers my earlier question nicely.  It probably means that
either the presentation order needs a bit of rethinking, or "we
group by .filter" needs to be mentioned a lot earlier.

> Then, all
> +      bundle URIs are requested. If the `bundle.<id>.timestamp` heuristic
> +      is provided, then the bundles are downloaded in reverse-
> +      chronological order, stopping when a bundle has all required OIDs.

Stop as soon as just one bundle has all the prerequisite objects, or
should we keep going until all bundles have their prerequisites
satisfied?  I presume it is the latter.

> +      The bundles can then be unbundled in chronological order. The client
> +      stores the latest timestamp as a heuristic for avoiding future
> +      downloads if the bundle list does not advertise newer bundles.

So we see a list, we start grabbing from new to old.  Newer ones
that are based on older ones may have dependencies, so we do not
unbndle until we have all the prerequisites for them.  The bundles
that satisfy their prerequisites are unbundled---that would give us
enough objects to play with.  What happens to the refs recorded in
them, though?

Is the timestamp per the serving host, or per the CDN host that
serve us bundle files, or...?  I guess it is premature to discuss it
here. "git clone" bootstraps from the advertisement made only by a
single serving host, so the single newest timestamp among the
bundles used from the bundle list is what we store here.  How that
timestamp is used is primarily of interest in future fetching, which
would be discussed later.

> +Fetching with Bundle URIs
> +-------------------------
> +
> +When the client fetches new data, it can decide to fetch from bundle
> +servers before fetching from the origin remote. This could be done via a
> +command-line option, but it is more likely useful to use a config value
> +such as the one specified during the clone.
> +
> +The fetch operation follows the same procedure to download bundles from a
> +bundle list (although we do _not_ want to use parallel downloads here). We
> +expect that the process will end when all negative commit OIDs in a thin
> +bundle are already in the object database.

I do not see why we do not want to use parallel download, though.
If our last bundle download was last month, and they have two newer
bundles since then, don't we want to grab both at the same time?
Wasn't that the point of recording the newest timestamp when "git
clone" grabbed bundles?

> +Error Conditions
> +----------------
> +
> +If the Git client discovers something unexpected while downloading
> +information according to a bundle URI or the bundle list found at that
> +location, then Git can ignore that data and continue as if it was not
> +given a bundle URI. The remote Git server is the ultimate source of truth,
> +not the bundle URI.
> +
> +Here are a few example error conditions:
> +
> +* The client fails to connect with a server at the given URI or a connection
> +  is lost without any chance to recover.
> +
> +* The client receives a response other than `200 OK` (such as `404 Not Found`,
> +  `401 Not Authorized`, or `500 Internal Server Error`). The client should
> +  use the `credential.helper` to attempt authentication after the first
> +  `401 Not Authorized` response, but a second such response is a failure.
> +
> +* The client receives data that is not parsable as a bundle or table of
> +  contents.

Is it an error if bundle.<id>.list and the contents disagree?

It is fine to call the possibility other than "a bundle file" "table
of contents", but then let's do so consistently throughout the document.
When we explain bundle.<id>.list, we should not call the other
possibility "list" but "table of contents", for example.

> +* The bundle list describes a directed cycle in the
> +  `bundle.<id>.requires` links.
> +
> +* A bundle includes a filter that does not match expectations.

Does this refer to a mismatch between the filter recorded in a
bundle and bundle.<id>.filter entry that described the bundle?

> +* The client cannot unbundle the bundles because the negative commit OIDs
> +  are not in the object database and there are no more
> +  `bundle.<id>.requires` links to follow.

Is a .requires link mandatory?  In a mode=all table of contents, we
should not have to have .requires at all.  In the above description
on how bundle files are downloaded and in what order in Clone and
Fetch operations, I didn't see any mention of .requires at all, but
I think there should be.  For example, the timestamp heuristics may
say the bundle A is the latest.  In a mode=any table of contents,
shouldn't bundles that contain prerequisite commits of the bundle A
be pointed by A's .requires fields?

> +4. Allow the client to understand the `bundle.list.forFetch` configuration
> +   and the `bundle.<id>.timestamp` heuristic. When `git clone` discovers a
> +   bundle URI with `bundle.list.forFetch=true`, it configures the client
> +   repository to check that bundle URI during later `git fetch <remote>`
> +   commands.

So bundle.list.forFetch is, unlike everything else we saw that
looked like a configuration variable in this document, a
configuration variable whose value is boolean?

Ah, no.  You mean the "git clone" sees a bundle URI, grabs it and
sees a table of contents, and in it, finds "bundle.forFetch" is set
to true?  Then "git fetch <remote>" is configured to also use bundle
URI?

It is unclear to me (with the information given here so far), why we
want this.  Isn't this something the responder to "git fetch" can
advertise over the wire?  If we leave a permanent record in the
resulting repository to do the bundle URI during 'fetch', wouldn't
it become more cumbersome to cancel (iow "no, you no longer want to
talk to me with bundle URI feature") from the server side?

> +5. Allow clients to discover bundle URIs during `git fetch` and configure
> +   a bundle URI for later fetches if `bundle.list.forFetch=true`.
> +
> +6. Implement the "inspect headers" heuristic to reduce data downloads when
> +   the `bundle.<id>.timestamp` heuristic is not available.

Sounds sensible, even though I do not offhand see why the "peek
header and stop" is any less useful when the timestamp heurisitc is
available.

> +A major downside to this mechanism is that the origin server needs to know
> +_exactly_ what is in those packfiles, and the packfiles need to be available
> +to the user for some time after the server has responded. This coupling
> +between the origin and the packfile data is difficult to manage.

Hmph.  I strongly suspect that there are Googlers on the list who
have been managing such JGit server installations.  Has this
"coupling" been difficult to manage for you guys in the real world?

> +Further, this implementation is extremely hard to make work with fetches.

IOW, they do this only for clones and not fetches?

> +Related Work: GVFS Cache Servers
> +--------------------------------
> ...
> +During a `git fetch`, a hook requests the prefetch endpoint using the
> +most-recent timestamp from a previously-downloaded prefetch packfile.
> +Only the list of packfiles with later timestamps are downloaded.

That sounds quite straight-forward.  Do you envision that their
incremental snapshot packfile chains can somehow be shared with the
bundle URI implementations?  Doesn't it make it more cumbersome that
this proposal uses the bundles as the encapsulation format, rather
than packfiles?  As you are sending extra pieces of information on
top of the payload in the form of table-of-contents already, I
wonder if bundle.<id>.uri should point at a bare packfile (instead
of a bundle), while multi-valued bundle.<id>.prerequisite give the
prerequisite objects?  The machinery that is already generating the
prefetch packfiles already know which packfile has what
prerequisites in it, so it rather looks simpler if the solution did
not involve bundles.

Thanks.  Here ends my two-part review on the doc.

  parent reply	other threads:[~2022-06-07  0:34 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-06 19:55 [PATCH 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
2022-06-06 19:55 ` [PATCH 1/6] docs: document bundle URI standard Derrick Stolee via GitGitGadget
2022-06-06 22:18   ` Junio C Hamano
2022-06-08 19:20     ` Derrick Stolee
2022-06-08 19:27       ` Junio C Hamano
2022-06-08 20:44         ` Junio C Hamano
2022-06-08 20:39       ` Junio C Hamano
2022-06-08 20:52         ` Derrick Stolee
2022-06-07  0:33   ` Junio C Hamano [this message]
2022-06-08 19:46     ` Derrick Stolee
2022-06-08 21:01       ` Junio C Hamano
2022-06-09 16:00         ` Derrick Stolee
2022-06-09 17:56           ` Junio C Hamano
2022-06-09 18:27             ` Ævar Arnfjörð Bjarmason
2022-06-09 19:39             ` Derrick Stolee
2022-06-09 20:13               ` Junio C Hamano
2022-06-21 19:34       ` Derrick Stolee
2022-06-21 20:16         ` Junio C Hamano
2022-06-21 21:10           ` Derrick Stolee
2022-06-21 21:33             ` Junio C Hamano
2022-06-06 19:55 ` [PATCH 2/6] remote-curl: add 'get' capability Derrick Stolee via GitGitGadget
2022-07-21 22:59   ` Junio C Hamano
2022-06-06 19:55 ` [PATCH 3/6] bundle-uri: create basic file-copy logic Derrick Stolee via GitGitGadget
2022-06-06 19:55 ` [PATCH 4/6] fetch: add --bundle-uri option Derrick Stolee via GitGitGadget
2022-06-06 19:55 ` [PATCH 5/6] bundle-uri: add support for http(s):// and file:// Derrick Stolee via GitGitGadget
2022-06-06 19:55 ` [PATCH 6/6] fetch: add 'refs/bundle/' to log.excludeDecoration Derrick Stolee via GitGitGadget
2022-06-29 20:40 ` [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
2022-06-29 20:40   ` [PATCH v2 1/6] docs: document bundle URI standard Derrick Stolee via GitGitGadget
2022-07-18  9:20     ` SZEDER Gábor
2022-07-21 12:09     ` Matthew John Cheetham
2022-07-22 13:52       ` Derrick Stolee
2022-07-22 16:03       ` Derrick Stolee
2022-07-21 21:39     ` Josh Steadmon
2022-07-22 13:15       ` Derrick Stolee
2022-07-22 15:01       ` Derrick Stolee
2022-06-29 20:40   ` [PATCH v2 2/6] remote-curl: add 'get' capability Derrick Stolee via GitGitGadget
2022-07-21 21:41     ` Josh Steadmon
2022-06-29 20:40   ` [PATCH v2 3/6] bundle-uri: create basic file-copy logic Derrick Stolee via GitGitGadget
2022-07-21 21:45     ` Josh Steadmon
2022-07-22 13:18       ` Derrick Stolee
2022-06-29 20:40   ` [PATCH v2 4/6] fetch: add --bundle-uri option Derrick Stolee via GitGitGadget
2022-06-29 20:40   ` [PATCH v2 5/6] bundle-uri: add support for http(s):// and file:// Derrick Stolee via GitGitGadget
2022-06-29 20:40   ` [PATCH v2 6/6] fetch: add 'refs/bundle/' to log.excludeDecoration Derrick Stolee via GitGitGadget
2022-07-21 21:47     ` Josh Steadmon
2022-07-22 13:20       ` Derrick Stolee
2022-07-21 21:48   ` [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Josh Steadmon
2022-07-21 21:56     ` Junio C Hamano
2022-07-25 13:53   ` [PATCH v3 0/2] " Derrick Stolee via GitGitGadget
2022-07-25 13:53     ` [PATCH v3 1/2] docs: document bundle URI standard Derrick Stolee via GitGitGadget
2022-07-28  1:23       ` tenglong.tl
2022-08-01 13:42         ` Derrick Stolee
2022-07-25 13:53     ` [PATCH v3 2/2] bundle-uri: add example bundle organization Derrick Stolee via GitGitGadget
2022-08-04 16:09       ` Matthew John Cheetham
2022-08-04 17:39         ` Derrick Stolee
2022-08-04 20:29           ` Ævar Arnfjörð Bjarmason
2022-08-05 18:29             ` Derrick Stolee
2022-07-25 20:05     ` [PATCH v3 0/2] bundle URIs: design doc and initial git fetch --bundle-uri implementation Josh Steadmon
2022-08-09 13:12     ` [PATCH v4 0/2] bundle URIs: design doc Derrick Stolee via GitGitGadget
2022-08-09 13:12       ` [PATCH v4 1/2] docs: document bundle URI standard Derrick Stolee via GitGitGadget
2022-10-04 19:48         ` Philip Oakley
2022-08-09 13:12       ` [PATCH v4 2/2] bundle-uri: add example bundle organization Derrick Stolee via GitGitGadget
2022-08-09 13:49       ` [PATCH v4 0/2] bundle URIs: design doc Phillip Wood
2022-08-09 15:50         ` Derrick Stolee
2022-08-11 15:42           ` Phillip Wood

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=xmqqtu8x1fd4.fsf@gitster.g \
    --to=gitster@pobox.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=avarab@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=dyroneteng@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=me@ttaylorr.com \
    --cc=newren@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).