git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
blob 7ab39d84e074b97fd0dfdefac2d15ca3269ab1d6 10187 bytes (raw)
name: Documentation/technical/partial-clone.txt 	 # note: path name is non-authoritative(*)

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
 
Partial Clone Design Notes
==========================

The "Partial Clone" feature is a performance optimization for git that
allows git to function without having a complete copy of the repository.

During clone and fetch operations, git normally downloads the complete
contents and history of the repository.  That is, during clone the client
receives all of the commits, trees, and blobs in the repository into a
local ODB.  Subsequent fetches extend the local ODB with any new objects.
For large repositories, this can take significant time to download and
large amounts of diskspace to store.

The goal of this work is to allow git better handle extremely large
repositories.  Often in these repositories there are many files that the
user does not need such as ancient versions of source files, files in
portions of the worktree outside of the user's work area, or large binary
assets.  If we can avoid downloading such unneeded objects *in advance*
during clone and fetch operations, we can decrease download times and
reduce ODB disk usage.


Non-Goals
---------

Partial clone is independent of and not intended to conflict with
shallow-clone, refspec, or limited-ref mechanisms since these all operate
at the DAG level whereas partial clone and fetch works *within* the set
of commits already chosen for download.


Design Overview
---------------

Partial clone logically consists of the following parts:

- A mechanism for the client to describe unneeded or unwanted objects to
  the server.

- A mechanism for the server to omit such unwanted objects from packfiles
  sent to the client.

- A mechanism for the client to gracefully handle missing objects (that
  were previously omitted by the server).

- A mechanism for the client to backfill missing objects as needed.


Design Details
--------------

- A new pack-protocol capability "filter" is added to the fetch-pack and
  upload-pack negotiation.

  This uses the existing capability discovery mechanism.
  See "filter" in Documentation/technical/pack-protocol.txt.

- Clients pass a "filter-spec" to clone and fetch which is passed to the
  server to request filtering during packfile construction.

  There are various filters available to accomodate different situations.
  See "--filter=<filter-spec>" in Documentation/rev-list-options.txt.

- On the server pack-objects applies the requested filter-spec as it
  creates "filtered" packfiles for the client.

  These filtered packfiles are incomplete in the traditional sense because
  they may contain trees that reference blobs that the client does not have.


==== How the local repository gracefully handles missing objects

With partial clone, the fact that objects can be missing makes such
repositories incompatible with older versions of Git, necessitating a
repository extension (see the documentation of "extensions.partialClone"
for more information).

An object may be missing due to a partial clone or fetch, or missing due
to repository corruption. To differentiate these cases, the local
repository specially indicates packfiles obtained from the promisor
remote. These "promisor packfiles" consist of a "<name>.promisor" file
with arbitrary contents (like the "<name>.keep" files), in addition to
their "<name>.pack" and "<name>.idx" files. (In the future, this ability
may be extended to loose objects[a].)

The local repository considers a "promisor object" to be an object that
it knows (to the best of its ability) that the promisor remote has,
either because the local repository has that object in one of its
promisor packfiles, or because another promisor object refers to it. Git
can then check if the missing object is a promisor object, and if yes,
this situation is common and expected. This also means that there is no
need to explicitly maintain an expensive-to-modify list of missing
objects on the client.

Almost all Git code currently expects any objects referred to by other
objects to be present. Therefore, a fallback mechanism is added:
whenever Git attempts to read an object that is found to be missing, it
will attempt to fetch it from the promisor remote, expanding the subset
of objects available locally, then reattempt the read. This allows
objects to be "faulted in" from the promisor remote without complicated
prediction algorithms. For efficiency reasons, no check as to whether
the missing object is a promisor object is performed. This tends to be
slow as objects are fetched one at a time.

The fallback mechanism can be turned off and on through a global
variable.

checkout (and any other command using unpack-trees) has been taught to
batch blob fetching. rev-list has been taught to be able to print
filtered or missing objects and can be used with more general batch
fetch scripts. In the future, Git commands will be updated to batch such
fetches or otherwise handle missing objects more efficiently.

Fsck has been updated to be fully aware of promisor objects. The repack
in GC has been updated to not touch promisor packfiles at all, and to
only repack other objects.

The global variable fetch_if_missing is used to control whether an
object lookup will attempt to dynamically fetch a missing object or
report an error.


===== Fetching missing objects

Fetching of objects is done using the existing transport mechanism using
transport_fetch_refs(), setting a new transport option
TRANS_OPT_NO_DEPENDENTS to indicate that only the objects themselves are
desired, not any object that they refer to. Because some transports
invoke fetch_pack() in the same process, fetch_pack() has been updated
to not use any object flags when the corresponding argument
(no_dependents) is set.

The local repository sends a request with the hashes of all requested
objects as "want" lines, and does not perform any packfile negotiation.
It then receives a packfile.

Because we are reusing the existing fetch-pack mechanism, fetching
currently fetches all objects referred to by the requested objects, even
though they are not necessary.



Foot Notes
----------

[a] Remembering that loose objects are promisor objects is mainly
    important for trees, since they may refer to promisor blobs that
    the user does not have.  We do not need to mark loose blobs as
    promisor because they do not refer to other objects.



Current Limitations
-------------------

- The remote used for a partial clone (or the first partial fetch
  following a regular clone) is marked as the "promisor remote".

  We are currently limited to a single promisor remote and only that
  remote may be used for subsequent partial fetches.

- Dynamic object fetching will only ask the promisor remote for missing
  objects.  We assume that the promisor remote has a complete view of the
  repository and can satisfy all such requests.

  Future work may lift this restriction when we figure out how to route
  such requests.  The current assumption is that partial clone will not be
  used for triangular workflows that would need that (at least initially).

- Repack essentially treats promisor and non-promisor packfiles as 2
  distinct partitions and does not mix them.  Repack currently only works
  on non-promisor packfiles and loose objects.

  Future work may let repack work to repack promisor packfiles (while
  keeping them in a different partition from the others).

- The current object filtering mechanism does not make use of packfile
  bitmaps (when present).

  We should allow this for filters that are not pathname-based.

- Currently, dynamic object fetching invokes fetch-pack for each item
  because most algorithms stumble upon a missing object and need to have
  it resolved before continuing their work.  This may incur significant
  overhead -- and multiple authentication requests -- if many objects are
  needed.

  We need to investigate use of a long-running process, such as proposed
  in [5,6] to reduce process startup and overhead costs.

  It would also be nice to use pack protocol V2 to also allow that long-running
  process to make a series of requests over a single long-running connection.

- We currently only promisor packfiles.  We need to add support for
  promisor loose objects as described earlier.

- Dynamic object fetching currently uses the existing pack protocol V0
  which means that each object is requested via fetch-pack.  The server
  will send a full set of info/refs when the connection is established.
  If there are large number of refs, this may incur significant overhead.

  We expect that protocol V2 will allow us to avoid this cost.

- Every time the subject of "demand loading blobs" comes up it seems
  that someone suggest that the server be allowed to "guess" and send
  additional objects that may be related to the requested objects.

  No work has gone into actually doing that; we're just documenting that
  it is a common suggestion for a future enhancement.


Related Links
-------------
[0] https://bugs.chromium.org/p/git/issues/detail?id=2
    Chromium work item for: Partial Clone 

[1] https://public-inbox.org/git/20170113155253.1644-1-benpeart@microsoft.com/
    Subject: [RFC] Add support for downloading blobs on demand
    Date: Fri, 13 Jan 2017 10:52:53 -0500

[2] https://public-inbox.org/git/cover.1506714999.git.jonathantanmy@google.com/
    Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
    Date: Fri, 29 Sep 2017 13:11:36 -0700

[3] https://public-inbox.org/git/20170426221346.25337-1-jonathantanmy@google.com/
    Subject: Proposal for missing blob support in Git repos
    Date: Wed, 26 Apr 2017 15:13:46 -0700

[4] https://public-inbox.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/
    Subject: [PATCH 00/10] RFC Partial Clone and Fetch
    Date: Wed,  8 Mar 2017 18:50:29 +0000


[5] https://public-inbox.org/git/20170505152802.6724-1-benpeart@microsoft.com/
    Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module
    Date: Fri,  5 May 2017 11:27:52 -0400

[6] https://public-inbox.org/git/20170714132651.170708-1-benpeart@microsoft.com/
    Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand
    Date: Fri, 14 Jul 2017 09:26:50 -0400

debug log:

solving 7ab39d8 ...
found 7ab39d8 in https://public-inbox.org/git/20171208192636.13678-2-git@jeffhostetler.com/

applying [1/1] https://public-inbox.org/git/20171208192636.13678-2-git@jeffhostetler.com/
diff --git a/Documentation/technical/partial-clone.txt b/Documentation/technical/partial-clone.txt
new file mode 100644
index 0000000..7ab39d8

1:221: trailing whitespace.
    Chromium work item for: Partial Clone 
Checking patch Documentation/technical/partial-clone.txt...
Applied patch Documentation/technical/partial-clone.txt cleanly.
warning: 1 line adds whitespace errors.

index at:
100644 7ab39d84e074b97fd0dfdefac2d15ca3269ab1d6	Documentation/technical/partial-clone.txt

(*) Git path names are given by the tree(s) the blob belongs to.
    Blobs themselves have no identifier aside from the hash of its contents.^

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).