mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
From: Jeff Hostetler <>
	Jeff Hostetler <>
Subject: [PATCH] partial-clone: design doc
Date: Fri,  8 Dec 2017 19:26:36 +0000	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

From: Jeff Hostetler <>

First draft of design document for partial clone feature.

Signed-off-by: Jeff Hostetler <>
Signed-off-by: Jonathan Tan <>
 Documentation/technical/partial-clone.txt | 240 ++++++++++++++++++++++++++++++
 1 file changed, 240 insertions(+)
 create mode 100644 Documentation/technical/partial-clone.txt

diff --git a/Documentation/technical/partial-clone.txt b/Documentation/technical/partial-clone.txt
new file mode 100644
index 0000000..7ab39d8
--- /dev/null
+++ b/Documentation/technical/partial-clone.txt
@@ -0,0 +1,240 @@
+Partial Clone Design Notes
+The "Partial Clone" feature is a performance optimization for git that
+allows git to function without having a complete copy of the repository.
+During clone and fetch operations, git normally downloads the complete
+contents and history of the repository.  That is, during clone the client
+receives all of the commits, trees, and blobs in the repository into a
+local ODB.  Subsequent fetches extend the local ODB with any new objects.
+For large repositories, this can take significant time to download and
+large amounts of diskspace to store.
+The goal of this work is to allow git better handle extremely large
+repositories.  Often in these repositories there are many files that the
+user does not need such as ancient versions of source files, files in
+portions of the worktree outside of the user's work area, or large binary
+assets.  If we can avoid downloading such unneeded objects *in advance*
+during clone and fetch operations, we can decrease download times and
+reduce ODB disk usage.
+Partial clone is independent of and not intended to conflict with
+shallow-clone, refspec, or limited-ref mechanisms since these all operate
+at the DAG level whereas partial clone and fetch works *within* the set
+of commits already chosen for download.
+Design Overview
+Partial clone logically consists of the following parts:
+- A mechanism for the client to describe unneeded or unwanted objects to
+  the server.
+- A mechanism for the server to omit such unwanted objects from packfiles
+  sent to the client.
+- A mechanism for the client to gracefully handle missing objects (that
+  were previously omitted by the server).
+- A mechanism for the client to backfill missing objects as needed.
+Design Details
+- A new pack-protocol capability "filter" is added to the fetch-pack and
+  upload-pack negotiation.
+  This uses the existing capability discovery mechanism.
+  See "filter" in Documentation/technical/pack-protocol.txt.
+- Clients pass a "filter-spec" to clone and fetch which is passed to the
+  server to request filtering during packfile construction.
+  There are various filters available to accomodate different situations.
+  See "--filter=<filter-spec>" in Documentation/rev-list-options.txt.
+- On the server pack-objects applies the requested filter-spec as it
+  creates "filtered" packfiles for the client.
+  These filtered packfiles are incomplete in the traditional sense because
+  they may contain trees that reference blobs that the client does not have.
+==== How the local repository gracefully handles missing objects
+With partial clone, the fact that objects can be missing makes such
+repositories incompatible with older versions of Git, necessitating a
+repository extension (see the documentation of "extensions.partialClone"
+for more information).
+An object may be missing due to a partial clone or fetch, or missing due
+to repository corruption. To differentiate these cases, the local
+repository specially indicates packfiles obtained from the promisor
+remote. These "promisor packfiles" consist of a "<name>.promisor" file
+with arbitrary contents (like the "<name>.keep" files), in addition to
+their "<name>.pack" and "<name>.idx" files. (In the future, this ability
+may be extended to loose objects[a].)
+The local repository considers a "promisor object" to be an object that
+it knows (to the best of its ability) that the promisor remote has,
+either because the local repository has that object in one of its
+promisor packfiles, or because another promisor object refers to it. Git
+can then check if the missing object is a promisor object, and if yes,
+this situation is common and expected. This also means that there is no
+need to explicitly maintain an expensive-to-modify list of missing
+objects on the client.
+Almost all Git code currently expects any objects referred to by other
+objects to be present. Therefore, a fallback mechanism is added:
+whenever Git attempts to read an object that is found to be missing, it
+will attempt to fetch it from the promisor remote, expanding the subset
+of objects available locally, then reattempt the read. This allows
+objects to be "faulted in" from the promisor remote without complicated
+prediction algorithms. For efficiency reasons, no check as to whether
+the missing object is a promisor object is performed. This tends to be
+slow as objects are fetched one at a time.
+The fallback mechanism can be turned off and on through a global
+checkout (and any other command using unpack-trees) has been taught to
+batch blob fetching. rev-list has been taught to be able to print
+filtered or missing objects and can be used with more general batch
+fetch scripts. In the future, Git commands will be updated to batch such
+fetches or otherwise handle missing objects more efficiently.
+Fsck has been updated to be fully aware of promisor objects. The repack
+in GC has been updated to not touch promisor packfiles at all, and to
+only repack other objects.
+The global variable fetch_if_missing is used to control whether an
+object lookup will attempt to dynamically fetch a missing object or
+report an error.
+===== Fetching missing objects
+Fetching of objects is done using the existing transport mechanism using
+transport_fetch_refs(), setting a new transport option
+TRANS_OPT_NO_DEPENDENTS to indicate that only the objects themselves are
+desired, not any object that they refer to. Because some transports
+invoke fetch_pack() in the same process, fetch_pack() has been updated
+to not use any object flags when the corresponding argument
+(no_dependents) is set.
+The local repository sends a request with the hashes of all requested
+objects as "want" lines, and does not perform any packfile negotiation.
+It then receives a packfile.
+Because we are reusing the existing fetch-pack mechanism, fetching
+currently fetches all objects referred to by the requested objects, even
+though they are not necessary.
+Foot Notes
+[a] Remembering that loose objects are promisor objects is mainly
+    important for trees, since they may refer to promisor blobs that
+    the user does not have.  We do not need to mark loose blobs as
+    promisor because they do not refer to other objects.
+Current Limitations
+- The remote used for a partial clone (or the first partial fetch
+  following a regular clone) is marked as the "promisor remote".
+  We are currently limited to a single promisor remote and only that
+  remote may be used for subsequent partial fetches.
+- Dynamic object fetching will only ask the promisor remote for missing
+  objects.  We assume that the promisor remote has a complete view of the
+  repository and can satisfy all such requests.
+  Future work may lift this restriction when we figure out how to route
+  such requests.  The current assumption is that partial clone will not be
+  used for triangular workflows that would need that (at least initially).
+- Repack essentially treats promisor and non-promisor packfiles as 2
+  distinct partitions and does not mix them.  Repack currently only works
+  on non-promisor packfiles and loose objects.
+  Future work may let repack work to repack promisor packfiles (while
+  keeping them in a different partition from the others).
+- The current object filtering mechanism does not make use of packfile
+  bitmaps (when present).
+  We should allow this for filters that are not pathname-based.
+- Currently, dynamic object fetching invokes fetch-pack for each item
+  because most algorithms stumble upon a missing object and need to have
+  it resolved before continuing their work.  This may incur significant
+  overhead -- and multiple authentication requests -- if many objects are
+  needed.
+  We need to investigate use of a long-running process, such as proposed
+  in [5,6] to reduce process startup and overhead costs.
+  It would also be nice to use pack protocol V2 to also allow that long-running
+  process to make a series of requests over a single long-running connection.
+- We currently only promisor packfiles.  We need to add support for
+  promisor loose objects as described earlier.
+- Dynamic object fetching currently uses the existing pack protocol V0
+  which means that each object is requested via fetch-pack.  The server
+  will send a full set of info/refs when the connection is established.
+  If there are large number of refs, this may incur significant overhead.
+  We expect that protocol V2 will allow us to avoid this cost.
+- Every time the subject of "demand loading blobs" comes up it seems
+  that someone suggest that the server be allowed to "guess" and send
+  additional objects that may be related to the requested objects.
+  No work has gone into actually doing that; we're just documenting that
+  it is a common suggestion for a future enhancement.
+Related Links
+    Chromium work item for: Partial Clone 
+    Subject: [RFC] Add support for downloading blobs on demand
+    Date: Fri, 13 Jan 2017 10:52:53 -0500
+    Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
+    Date: Fri, 29 Sep 2017 13:11:36 -0700
+    Subject: Proposal for missing blob support in Git repos
+    Date: Wed, 26 Apr 2017 15:13:46 -0700
+    Subject: [PATCH 00/10] RFC Partial Clone and Fetch
+    Date: Wed,  8 Mar 2017 18:50:29 +0000
+    Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module
+    Date: Fri,  5 May 2017 11:27:52 -0400
+    Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand
+    Date: Fri, 14 Jul 2017 09:26:50 -0400

  reply	other threads:[~2017-12-08 19:26 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-08 19:26 [PATCH] Partial clone design document Jeff Hostetler
2017-12-08 19:26 ` Jeff Hostetler [this message]
2017-12-08 20:14   ` [PATCH] partial-clone: design doc Junio C Hamano
2017-12-13 22:34     ` Jeff Hostetler
2017-12-12 23:31   ` Philip Oakley
2017-12-12 23:57     ` Junio C Hamano
2017-12-13 13:17       ` Philip Oakley
2017-12-14 20:46         ` Jeff Hostetler
2017-12-14 20:32     ` Jeff Hostetler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

  List information:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
Code repositories for project(s) associated with this public inbox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).