git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Matthew John Cheetham <mjcheetham@outlook.com>
To: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: gitster@pobox.com, me@ttaylorr.com, newren@gmail.com,
	avarab@gmail.com, dyroneteng@gmail.com,
	Johannes.Schindelin@gmx.de, "SZEDER Gábor" <szeder.dev@gmail.com>,
	"Josh Steadmon" <steadmon@google.com>,
	"Derrick Stolee" <derrickstolee@github.com>,
	git@vger.kernel.org
Subject: Re: [PATCH v3 2/2] bundle-uri: add example bundle organization
Date: Thu, 4 Aug 2022 17:09:18 +0100	[thread overview]
Message-ID: <AS8PR03MB86898A2F7156918A390296CAC09F9@AS8PR03MB8689.eurprd03.prod.outlook.com> (raw)
In-Reply-To: <a933471c3afdd2c95d4115719c24d79e5e430b4d.1658757188.git.gitgitgadget@gmail.com>

On 2022-07-25 14:53, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <derrickstolee@github.com>
> 
> The previous change introduced the bundle URI design document. It
> creates a flexible set of options that allow bundle providers many ways
> to organize Git object data and speed up clones and fetches. It is
> particularly important that we have flexibility so we can apply future
> advancements as new ideas for efficiently organizing Git data are
> discovered.
> 
> However, the design document does not provide even an example of how
> bundles could be organized, and that makes it difficult to envision how
> the feature should work at the end of the implementation plan.
> 
> Add a section that details how a bundle provider could work, including
> using the Git server advertisement for multiple geo-distributed servers.
> This organization is based on the GVFS Cache Servers which have
> successfully used similar ideas to provide fast object access and
> reduced server load for very large repositories.
Thanks! This patch is helpful guidance for bundle server implementors.
> +This example organization is a simplified model of what is used by the
> +GVFS Cache Servers (see section near the end of this document) which have
> +been beneficial in speeding up clones and fetches for very large
> +repositories, although using extra software outside of Git.

Nit: might be a good idea to use "VFS for Git" rather than the old name
"GVFS" [1].

> +The bundle provider deploys servers across multiple geographies. Each
> +server manages its own bundle set. The server can track a number of Git
> +repositories, but provides a bundle list for each based on a pattern. For
> +example, when mirroring a repository at `https://<domain>/<org>/<repo>`
> +the bundle server could have its bundle list available at
> +`https://<server-url>/<domain>/<org>/<repo>`. The origin Git server can
> +list all of these servers under the "any" mode:
> +
> +	[bundle]
> +		version = 1
> +		mode = any
> +		
> +	[bundle "eastus"]
> +		uri = https://eastus.example.com/<domain>/<org>/<repo>
> +		
> +	[bundle "europe"]
> +		uri = https://europe.example.com/<domain>/<org>/<repo>
> +		
> +	[bundle "apac"]
> +		uri = https://apac.example.com/<domain>/<org>/<repo>
> +
> +This "list of lists" is static and only changes if a bundle server is
> +added or removed.
> +
> +Each bundle server manages its own set of bundles. The initial bundle list
> +contains only a single bundle, containing all of the objects received from
> +cloning the repository from the origin server. The list uses the
> +`creationToken` heuristic and a `creationToken` is made for the bundle
> +based on the server's timestamp.

Just to confirm, in this example the origin server advertises a single
URL (over v2 protocol) that points to this example "list of lists"?

Remote -> 1 URL -> List(any/split by geo) -> List(all/split by time)

> +The bundle server runs regularly-scheduled updates for the bundle list,
> +such as once a day. During this task, the server fetches the latest
> +contents from the origin server and generates a bundle containing the
> +objects reachable from the latest origin refs, but not contained in a
> +previously-computed bundle. This bundle is added to the list, with care
> +that the `creationToken` is strictly greater than the previous maximum
> +`creationToken`.
> +
> +When the bundle list grows too large, say more than 30 bundles, then the
> +oldest "_N_ minus 30" bundles are combined into a single bundle. This
> +bundle's `creationToken` is equal to the maximum `creationToken` among the
> +merged bundles.
> +
> +An example bundle list is provided here, although it only has two daily
> +bundles and not a full list of 30:
> +
> +	[bundle]
> +		version = 1
> +		mode = all
> +		heuristic = creationToken
> +
> +	[bundle "2022-02-13-1644770820-daily"]
> +		uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644770820-daily.bundle
> +		creationToken = 1644770820
> +
> +	[bundle "2022-02-09-1644442601-daily"]
> +		uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-09-1644442601-daily.bundle
> +		creationToken = 1644442601
> +
> +	[bundle "2022-02-02-1643842562"]
> +		uri = https://eastus.example.com/<domain>/<org>/<repo>/2022-02-02-1643842562.bundle
> +		creationToken = 1643842562
> +
> +To avoid storing and serving object data in perpetuity despite becoming
> +unreachable in the origin server, this bundle merge can be more careful.
> +Instead of taking an absolute union of the old bundles, instead the bundle
> +can be created by looking at the newer bundles and ensuring that their
> +necessary commits are all available in this merged bundle (or in another
> +one of the newer bundles). This allows "expiring" object data that is not
> +being used by new commits in this window of time. That data could be
> +reintroduced by a later push.
> +
> +The intention of this data organization has two main goals. First, initial
> +clones of the repository become faster by downloading precomputed object
> +data from a closer source. Second, `git fetch` commands can be faster,
> +especially if the client has not fetched for a few days. However, if a
> +client does not fetch for 30 days, then the bundle list organization would
> +cause redownloading a large amount of object data.
> +
> +One way to make this organization more useful to users who fetch frequently
> +is to have more frequent bundle creation. For example, bundles could be
> +created every hour, and then once a day those "hourly" bundles could be
> +merged into a "daily" bundle. The daily bundles are merged into the
> +oldest bundle after 30 days.
> +
> +It is recommened that this bundle strategy is repeated with the `blob:none`
> +filter if clients of this repository are expecting to use blobless partial
> +clones. This list of blobless bundles stays in the same list as the full
> +bundles, but uses the `bundle.<id>.filter` key to separate the two groups.
> +For very large repositories, the bundle provider may want to _only_ provide
> +blobless bundles.
> +
>  Implementation Plan
>  -------------------
>  
In general this looks good to me!

[1] https://github.com/microsoft/VFSForGit/issues/72

  reply	other threads:[~2022-08-04 16:09 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-06 19:55 [PATCH 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
2022-06-06 19:55 ` [PATCH 1/6] docs: document bundle URI standard Derrick Stolee via GitGitGadget
2022-06-06 22:18   ` Junio C Hamano
2022-06-08 19:20     ` Derrick Stolee
2022-06-08 19:27       ` Junio C Hamano
2022-06-08 20:44         ` Junio C Hamano
2022-06-08 20:39       ` Junio C Hamano
2022-06-08 20:52         ` Derrick Stolee
2022-06-07  0:33   ` Junio C Hamano
2022-06-08 19:46     ` Derrick Stolee
2022-06-08 21:01       ` Junio C Hamano
2022-06-09 16:00         ` Derrick Stolee
2022-06-09 17:56           ` Junio C Hamano
2022-06-09 18:27             ` Ævar Arnfjörð Bjarmason
2022-06-09 19:39             ` Derrick Stolee
2022-06-09 20:13               ` Junio C Hamano
2022-06-21 19:34       ` Derrick Stolee
2022-06-21 20:16         ` Junio C Hamano
2022-06-21 21:10           ` Derrick Stolee
2022-06-21 21:33             ` Junio C Hamano
2022-06-06 19:55 ` [PATCH 2/6] remote-curl: add 'get' capability Derrick Stolee via GitGitGadget
2022-07-21 22:59   ` Junio C Hamano
2022-06-06 19:55 ` [PATCH 3/6] bundle-uri: create basic file-copy logic Derrick Stolee via GitGitGadget
2022-06-06 19:55 ` [PATCH 4/6] fetch: add --bundle-uri option Derrick Stolee via GitGitGadget
2022-06-06 19:55 ` [PATCH 5/6] bundle-uri: add support for http(s):// and file:// Derrick Stolee via GitGitGadget
2022-06-06 19:55 ` [PATCH 6/6] fetch: add 'refs/bundle/' to log.excludeDecoration Derrick Stolee via GitGitGadget
2022-06-29 20:40 ` [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Derrick Stolee via GitGitGadget
2022-06-29 20:40   ` [PATCH v2 1/6] docs: document bundle URI standard Derrick Stolee via GitGitGadget
2022-07-18  9:20     ` SZEDER Gábor
2022-07-21 12:09     ` Matthew John Cheetham
2022-07-22 13:52       ` Derrick Stolee
2022-07-22 16:03       ` Derrick Stolee
2022-07-21 21:39     ` Josh Steadmon
2022-07-22 13:15       ` Derrick Stolee
2022-07-22 15:01       ` Derrick Stolee
2022-06-29 20:40   ` [PATCH v2 2/6] remote-curl: add 'get' capability Derrick Stolee via GitGitGadget
2022-07-21 21:41     ` Josh Steadmon
2022-06-29 20:40   ` [PATCH v2 3/6] bundle-uri: create basic file-copy logic Derrick Stolee via GitGitGadget
2022-07-21 21:45     ` Josh Steadmon
2022-07-22 13:18       ` Derrick Stolee
2022-06-29 20:40   ` [PATCH v2 4/6] fetch: add --bundle-uri option Derrick Stolee via GitGitGadget
2022-06-29 20:40   ` [PATCH v2 5/6] bundle-uri: add support for http(s):// and file:// Derrick Stolee via GitGitGadget
2022-06-29 20:40   ` [PATCH v2 6/6] fetch: add 'refs/bundle/' to log.excludeDecoration Derrick Stolee via GitGitGadget
2022-07-21 21:47     ` Josh Steadmon
2022-07-22 13:20       ` Derrick Stolee
2022-07-21 21:48   ` [PATCH v2 0/6] bundle URIs: design doc and initial git fetch --bundle-uri implementation Josh Steadmon
2022-07-21 21:56     ` Junio C Hamano
2022-07-25 13:53   ` [PATCH v3 0/2] " Derrick Stolee via GitGitGadget
2022-07-25 13:53     ` [PATCH v3 1/2] docs: document bundle URI standard Derrick Stolee via GitGitGadget
2022-07-28  1:23       ` tenglong.tl
2022-08-01 13:42         ` Derrick Stolee
2022-07-25 13:53     ` [PATCH v3 2/2] bundle-uri: add example bundle organization Derrick Stolee via GitGitGadget
2022-08-04 16:09       ` Matthew John Cheetham [this message]
2022-08-04 17:39         ` Derrick Stolee
2022-08-04 20:29           ` Ævar Arnfjörð Bjarmason
2022-08-05 18:29             ` Derrick Stolee
2022-07-25 20:05     ` [PATCH v3 0/2] bundle URIs: design doc and initial git fetch --bundle-uri implementation Josh Steadmon
2022-08-09 13:12     ` [PATCH v4 0/2] bundle URIs: design doc Derrick Stolee via GitGitGadget
2022-08-09 13:12       ` [PATCH v4 1/2] docs: document bundle URI standard Derrick Stolee via GitGitGadget
2022-10-04 19:48         ` Philip Oakley
2022-08-09 13:12       ` [PATCH v4 2/2] bundle-uri: add example bundle organization Derrick Stolee via GitGitGadget
2022-08-09 13:49       ` [PATCH v4 0/2] bundle URIs: design doc Phillip Wood
2022-08-09 15:50         ` Derrick Stolee
2022-08-11 15:42           ` Phillip Wood

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: http://vger.kernel.org/majordomo-info.html

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=AS8PR03MB86898A2F7156918A390296CAC09F9@AS8PR03MB8689.eurprd03.prod.outlook.com \
    --to=mjcheetham@outlook.com \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=avarab@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=dyroneteng@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=me@ttaylorr.com \
    --cc=newren@gmail.com \
    --cc=steadmon@google.com \
    --cc=szeder.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).