[PATCH] Partial clone design document

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH] Partial clone design document
@ 2017-12-08 19:26 Jeff Hostetler
  2017-12-08 19:26 ` [PATCH] partial-clone: design doc Jeff Hostetler
  0 siblings, 1 reply; 9+ messages in thread
From: Jeff Hostetler @ 2017-12-08 19:26 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

This patch contains a design document that Jonathan Tan and I have
been working on that describes the partial clone feature currently
under development.

Since edits to this document are independent of the code, I did not
include in the part 1,2,3 patch series.

Please let us know if there are any major sections we should add
or any areas that need clarification.

Thanks!

Jeff Hostetler (1):
  partial-clone: design doc

 Documentation/technical/partial-clone.txt | 240 ++++++++++++++++++++++++++++++
 1 file changed, 240 insertions(+)
 create mode 100644 Documentation/technical/partial-clone.txt

-- 
2.9.3

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] partial-clone: design doc
  2017-12-08 19:26 [PATCH] Partial clone design document Jeff Hostetler
@ 2017-12-08 19:26 ` Jeff Hostetler
  2017-12-08 20:14   ` Junio C Hamano
  2017-12-12 23:31   ` Philip Oakley
  0 siblings, 2 replies; 9+ messages in thread
From: Jeff Hostetler @ 2017-12-08 19:26 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

First draft of design document for partial clone feature.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/technical/partial-clone.txt | 240 ++++++++++++++++++++++++++++++
 1 file changed, 240 insertions(+)
 create mode 100644 Documentation/technical/partial-clone.txt

diff --git a/Documentation/technical/partial-clone.txt b/Documentation/technical/partial-clone.txt
new file mode 100644
index 0000000..7ab39d8
--- /dev/null
+++ b/Documentation/technical/partial-clone.txt
@@ -0,0 +1,240 @@
+Partial Clone Design Notes
+==========================
+
+The "Partial Clone" feature is a performance optimization for git that
+allows git to function without having a complete copy of the repository.
+
+During clone and fetch operations, git normally downloads the complete
+contents and history of the repository.  That is, during clone the client
+receives all of the commits, trees, and blobs in the repository into a
+local ODB.  Subsequent fetches extend the local ODB with any new objects.
+For large repositories, this can take significant time to download and
+large amounts of diskspace to store.
+
+The goal of this work is to allow git better handle extremely large
+repositories.  Often in these repositories there are many files that the
+user does not need such as ancient versions of source files, files in
+portions of the worktree outside of the user's work area, or large binary
+assets.  If we can avoid downloading such unneeded objects *in advance*
+during clone and fetch operations, we can decrease download times and
+reduce ODB disk usage.
+
+
+Non-Goals
+---------
+
+Partial clone is independent of and not intended to conflict with
+shallow-clone, refspec, or limited-ref mechanisms since these all operate
+at the DAG level whereas partial clone and fetch works *within* the set
+of commits already chosen for download.
+
+
+Design Overview
+---------------
+
+Partial clone logically consists of the following parts:
+
+- A mechanism for the client to describe unneeded or unwanted objects to
+  the server.
+
+- A mechanism for the server to omit such unwanted objects from packfiles
+  sent to the client.
+
+- A mechanism for the client to gracefully handle missing objects (that
+  were previously omitted by the server).
+
+- A mechanism for the client to backfill missing objects as needed.
+
+
+Design Details
+--------------
+
+- A new pack-protocol capability "filter" is added to the fetch-pack and
+  upload-pack negotiation.
+
+  This uses the existing capability discovery mechanism.
+  See "filter" in Documentation/technical/pack-protocol.txt.
+
+- Clients pass a "filter-spec" to clone and fetch which is passed to the
+  server to request filtering during packfile construction.
+
+  There are various filters available to accomodate different situations.
+  See "--filter=<filter-spec>" in Documentation/rev-list-options.txt.
+
+- On the server pack-objects applies the requested filter-spec as it
+  creates "filtered" packfiles for the client.
+
+  These filtered packfiles are incomplete in the traditional sense because
+  they may contain trees that reference blobs that the client does not have.
+
+
+==== How the local repository gracefully handles missing objects
+
+With partial clone, the fact that objects can be missing makes such
+repositories incompatible with older versions of Git, necessitating a
+repository extension (see the documentation of "extensions.partialClone"
+for more information).
+
+An object may be missing due to a partial clone or fetch, or missing due
+to repository corruption. To differentiate these cases, the local
+repository specially indicates packfiles obtained from the promisor
+remote. These "promisor packfiles" consist of a "<name>.promisor" file
+with arbitrary contents (like the "<name>.keep" files), in addition to
+their "<name>.pack" and "<name>.idx" files. (In the future, this ability
+may be extended to loose objects[a].)
+
+The local repository considers a "promisor object" to be an object that
+it knows (to the best of its ability) that the promisor remote has,
+either because the local repository has that object in one of its
+promisor packfiles, or because another promisor object refers to it. Git
+can then check if the missing object is a promisor object, and if yes,
+this situation is common and expected. This also means that there is no
+need to explicitly maintain an expensive-to-modify list of missing
+objects on the client.
+
+Almost all Git code currently expects any objects referred to by other
+objects to be present. Therefore, a fallback mechanism is added:
+whenever Git attempts to read an object that is found to be missing, it
+will attempt to fetch it from the promisor remote, expanding the subset
+of objects available locally, then reattempt the read. This allows
+objects to be "faulted in" from the promisor remote without complicated
+prediction algorithms. For efficiency reasons, no check as to whether
+the missing object is a promisor object is performed. This tends to be
+slow as objects are fetched one at a time.
+
+The fallback mechanism can be turned off and on through a global
+variable.
+
+checkout (and any other command using unpack-trees) has been taught to
+batch blob fetching. rev-list has been taught to be able to print
+filtered or missing objects and can be used with more general batch
+fetch scripts. In the future, Git commands will be updated to batch such
+fetches or otherwise handle missing objects more efficiently.
+
+Fsck has been updated to be fully aware of promisor objects. The repack
+in GC has been updated to not touch promisor packfiles at all, and to
+only repack other objects.
+
+The global variable fetch_if_missing is used to control whether an
+object lookup will attempt to dynamically fetch a missing object or
+report an error.
+
+
+===== Fetching missing objects
+
+Fetching of objects is done using the existing transport mechanism using
+transport_fetch_refs(), setting a new transport option
+TRANS_OPT_NO_DEPENDENTS to indicate that only the objects themselves are
+desired, not any object that they refer to. Because some transports
+invoke fetch_pack() in the same process, fetch_pack() has been updated
+to not use any object flags when the corresponding argument
+(no_dependents) is set.
+
+The local repository sends a request with the hashes of all requested
+objects as "want" lines, and does not perform any packfile negotiation.
+It then receives a packfile.
+
+Because we are reusing the existing fetch-pack mechanism, fetching
+currently fetches all objects referred to by the requested objects, even
+though they are not necessary.
+
+
+
+Foot Notes
+----------
+
+[a] Remembering that loose objects are promisor objects is mainly
+    important for trees, since they may refer to promisor blobs that
+    the user does not have.  We do not need to mark loose blobs as
+    promisor because they do not refer to other objects.
+
+
+
+Current Limitations
+-------------------
+
+- The remote used for a partial clone (or the first partial fetch
+  following a regular clone) is marked as the "promisor remote".
+
+  We are currently limited to a single promisor remote and only that
+  remote may be used for subsequent partial fetches.
+
+- Dynamic object fetching will only ask the promisor remote for missing
+  objects.  We assume that the promisor remote has a complete view of the
+  repository and can satisfy all such requests.
+
+  Future work may lift this restriction when we figure out how to route
+  such requests.  The current assumption is that partial clone will not be
+  used for triangular workflows that would need that (at least initially).
+
+- Repack essentially treats promisor and non-promisor packfiles as 2
+  distinct partitions and does not mix them.  Repack currently only works
+  on non-promisor packfiles and loose objects.
+
+  Future work may let repack work to repack promisor packfiles (while
+  keeping them in a different partition from the others).
+
+- The current object filtering mechanism does not make use of packfile
+  bitmaps (when present).
+
+  We should allow this for filters that are not pathname-based.
+
+- Currently, dynamic object fetching invokes fetch-pack for each item
+  because most algorithms stumble upon a missing object and need to have
+  it resolved before continuing their work.  This may incur significant
+  overhead -- and multiple authentication requests -- if many objects are
+  needed.
+
+  We need to investigate use of a long-running process, such as proposed
+  in [5,6] to reduce process startup and overhead costs.
+
+  It would also be nice to use pack protocol V2 to also allow that long-running
+  process to make a series of requests over a single long-running connection.
+
+- We currently only promisor packfiles.  We need to add support for
+  promisor loose objects as described earlier.
+
+- Dynamic object fetching currently uses the existing pack protocol V0
+  which means that each object is requested via fetch-pack.  The server
+  will send a full set of info/refs when the connection is established.
+  If there are large number of refs, this may incur significant overhead.
+
+  We expect that protocol V2 will allow us to avoid this cost.
+
+- Every time the subject of "demand loading blobs" comes up it seems
+  that someone suggest that the server be allowed to "guess" and send
+  additional objects that may be related to the requested objects.
+
+  No work has gone into actually doing that; we're just documenting that
+  it is a common suggestion for a future enhancement.
+
+
+Related Links
+-------------
+[0] https://bugs.chromium.org/p/git/issues/detail?id=2
+    Chromium work item for: Partial Clone 
+
+[1] https://public-inbox.org/git/20170113155253.1644-1-benpeart@microsoft.com/
+    Subject: [RFC] Add support for downloading blobs on demand
+    Date: Fri, 13 Jan 2017 10:52:53 -0500
+
+[2] https://public-inbox.org/git/cover.1506714999.git.jonathantanmy@google.com/
+    Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
+    Date: Fri, 29 Sep 2017 13:11:36 -0700
+
+[3] https://public-inbox.org/git/20170426221346.25337-1-jonathantanmy@google.com/
+    Subject: Proposal for missing blob support in Git repos
+    Date: Wed, 26 Apr 2017 15:13:46 -0700
+
+[4] https://public-inbox.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/
+    Subject: [PATCH 00/10] RFC Partial Clone and Fetch
+    Date: Wed,  8 Mar 2017 18:50:29 +0000
+
+
+[5] https://public-inbox.org/git/20170505152802.6724-1-benpeart@microsoft.com/
+    Subject: [PATCH v7 00/10] refactor the filter process code into a reusable module
+    Date: Fri,  5 May 2017 11:27:52 -0400
+
+[6] https://public-inbox.org/git/20170714132651.170708-1-benpeart@microsoft.com/
+    Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on demand
+    Date: Fri, 14 Jul 2017 09:26:50 -0400
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] partial-clone: design doc
  2017-12-08 19:26 ` [PATCH] partial-clone: design doc Jeff Hostetler
@ 2017-12-08 20:14   ` Junio C Hamano
  2017-12-13 22:34     ` Jeff Hostetler
  2017-12-12 23:31   ` Philip Oakley
  1 sibling, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2017-12-08 20:14 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, peff, jonathantanmy, Jeff Hostetler

Jeff Hostetler <git@jeffhostetler.com> writes:

> From: Jeff Hostetler <jeffhost@microsoft.com>
>
> First draft of design document for partial clone feature.
>
> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---

Thanks.

> +Non-Goals
> +---------
> +
> +Partial clone is independent of and not intended to conflict with
> +shallow-clone, refspec, or limited-ref mechanisms since these all operate
> +at the DAG level whereas partial clone and fetch works *within* the set
> +of commits already chosen for download.

It probably is not a huge deal (simply because it is about
"Non-Goals") but I have no idea what "refspec" and "limited-ref
mechanism" refer to in the above sentence, and I suspect many others
share the same puzzlement.

> +An object may be missing due to a partial clone or fetch, or missing due
> +to repository corruption. To differentiate these cases, the local
> +repository specially indicates packfiles obtained from the promisor
> +remote. These "promisor packfiles" consist of a "<name>.promisor" file
> +with arbitrary contents (like the "<name>.keep" files), in addition to
> +their "<name>.pack" and "<name>.idx" files. (In the future, this ability
> +may be extended to loose objects[a].)
> + ...
> +Foot Notes
> +----------
> +
> +[a] Remembering that loose objects are promisor objects is mainly
> +    important for trees, since they may refer to promisor blobs that
> +    the user does not have.  We do not need to mark loose blobs as
> +    promisor because they do not refer to other objects.

I fail to see any logical link between the "loose" and "tree".
Putting it differently, I do not see why "tree" is so special.

A promisor pack that contains a tree but lacks blobs the tree refers
to would be sufficient to let us remember that these missing blobs
are not corruption.  A loose commit or a tag that is somehow marked
as obtained from a promisor, if it can serve just like a commit or a
tag in a promisor pack to promise its direct pointee, would equally
be useful (if very inefficient).

In any case, I suspect "since they may refer to promisor blobs" is a
typo of "since they may refer to promised blobs".

> +- Currently, dynamic object fetching invokes fetch-pack for each item
> +  because most algorithms stumble upon a missing object and need to have
> +  it resolved before continuing their work.  This may incur significant
> +  overhead -- and multiple authentication requests -- if many objects are
> +  needed.
> +
> +  We need to investigate use of a long-running process, such as proposed
> +  in [5,6] to reduce process startup and overhead costs.

Also perhaps in some operations we can enumerate the objects we will
need upfront and ask for them in one go (e.g. "git log -p A..B" may
internally want to do "rev-list --objects A..B" to enumerate trees
and blobs that we may lack upfront).  I do not think having the
other side guess is a good idea, though.

> +- We currently only promisor packfiles.  We need to add support for
> +  promisor loose objects as described earlier.

The earlier description was not convincing enough to feel the need
to me; at least not yet.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] partial-clone: design doc
  2017-12-08 19:26 ` [PATCH] partial-clone: design doc Jeff Hostetler
  2017-12-08 20:14   ` Junio C Hamano
@ 2017-12-12 23:31   ` Philip Oakley
  2017-12-12 23:57     ` Junio C Hamano
  2017-12-14 20:32     ` Jeff Hostetler
  1 sibling, 2 replies; 9+ messages in thread
From: Philip Oakley @ 2017-12-12 23:31 UTC (permalink / raw)
  To: Jeff Hostetler, git; +Cc: gitster, peff, jonathantanmy, Jeff Hostetler

From: "Jeff Hostetler" <git@jeffhostetler.com>
> From: Jeff Hostetler <jeffhost@microsoft.com>
>
> First draft of design document for partial clone feature.
>
> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
> ---
> Documentation/technical/partial-clone.txt | 240 
> ++++++++++++++++++++++++++++++
> 1 file changed, 240 insertions(+)
> create mode 100644 Documentation/technical/partial-clone.txt
>
> diff --git a/Documentation/technical/partial-clone.txt 
> b/Documentation/technical/partial-clone.txt
> new file mode 100644
> index 0000000..7ab39d8
> --- /dev/null
> +++ b/Documentation/technical/partial-clone.txt
> @@ -0,0 +1,240 @@
> +Partial Clone Design Notes
> +==========================
> +
> +The "Partial Clone" feature is a performance optimization for git that
> +allows git to function without having a complete copy of the repository.
> +

I think it would be worthwhile at least listing the issues that make the 
'optimisation' necessary, and then the available factors that make the 
optimisation possible. This helps for future adjustments when those issues 
and factors change.

I think the issues are:
* the size of the repository that is being cloned, both in the width of a 
commit (you mentioned 100M trees) and the time (hours to days) / size to 
clone over the connection.

While the supporting factor is:
* the remote is always on-line and available for on-demand object fetching 
(seconds)

The solution choice then should fall out fairly obviously, and we can 
separate out the other optimisations that are based on other views about the 
issues. E.g. my desire for a solution in the off-line case.

In fact the current design, apart from some terminology, does look well 
matched, with only a couple of places that would be affected.

The airplane-mode expectations of a partial clone should also be stated.


> +During clone and fetch operations, git normally downloads the complete
> +contents and history of the repository.  That is, during clone the client
> +receives all of the commits, trees, and blobs in the repository into a
> +local ODB.  Subsequent fetches extend the local ODB with any new objects.
> +For large repositories, this can take significant time to download and
> +large amounts of diskspace to store.
> +
> +The goal of this work is to allow git better handle extremely large
> +repositories.

Shouln't this goal be nearer the top?

>        Often in these repositories there are many files that the
> +user does not need such as ancient versions of source files, files in
> +portions of the worktree outside of the user's work area, or large binary
> +assets.  If we can avoid downloading such unneeded objects *in advance*
> +during clone and fetch operations, we can decrease download times and
> +reduce ODB disk usage.
> +

Does this need to distinguish between the shallow clone mechanism for 
reducing the cloning of old history from the desire for a width wise partial 
clone of only the users narrow work area, and/or without large files/blobs?

> +
> +Non-Goals
> +---------
> +
> +Partial clone is independent of and not intended to conflict with
> +shallow-clone, refspec, or limited-ref mechanisms since these all operate
> +at the DAG level whereas partial clone and fetch works *within* the set
> +of commits already chosen for download.
> +
> +
> +Design Overview
> +---------------
> +
> +Partial clone logically consists of the following parts:
> +
> +- A mechanism for the client to describe unneeded or unwanted objects to
> +  the server.
> +
> +- A mechanism for the server to omit such unwanted objects from packfiles
> +  sent to the client.
> +
> +- A mechanism for the client to gracefully handle missing objects (that
> +  were previously omitted by the server).
> +
> +- A mechanism for the client to backfill missing objects as needed.
> +
> +
> +Design Details
> +--------------
> +
> +- A new pack-protocol capability "filter" is added to the fetch-pack and
> +  upload-pack negotiation.
> +
> +  This uses the existing capability discovery mechanism.
> +  See "filter" in Documentation/technical/pack-protocol.txt.
> +
> +- Clients pass a "filter-spec" to clone and fetch which is passed to the
> +  server to request filtering during packfile construction.
> +
> +  There are various filters available to accomodate different situations.
> +  See "--filter=<filter-spec>" in Documentation/rev-list-options.txt.
> +
> +- On the server pack-objects applies the requested filter-spec as it
> +  creates "filtered" packfiles for the client.
> +
> +  These filtered packfiles are incomplete in the traditional sense 
> because
> +  they may contain trees that reference blobs that the client does not 
> have.

Is a comment needed here noting that currently, IIUC, the complete trees are 
fetched in the packfiles, it's just the un-necessary blobs that are omitted 
?

> +
> +
> +==== How the local repository gracefully handles missing objects
> +
> +With partial clone, the fact that objects can be missing makes such
> +repositories incompatible with older versions of Git, necessitating a
> +repository extension (see the documentation of "extensions.partialClone"
> +for more information).
> +
> +An object may be missing due to a partial clone or fetch, or missing due
> +to repository corruption. To differentiate these cases, the local
> +repository specially indicates packfiles obtained from the promisor

s/packfiles/filtered packfiles/ ?

> +remote. These "promisor packfiles" consist of a "<name>.promisor" file
> +with arbitrary contents (like the "<name>.keep" files), in addition to
> +their "<name>.pack" and "<name>.idx" files. (In the future, this ability
> +may be extended to loose objects[a].)
> +
> +The local repository considers a "promisor object" to be an object that
> +it knows (to the best of its ability) that the promisor remote has,

s/has/is expected to have/  or /has promised it has/ ?

> +either because the local repository has that object in one of its
> +promisor packfiles, or because another promisor object refers to it. Git
> +can then check if the missing object is a promisor object, and if yes,
> +this situation is common and expected. This also means that there is no
> +need to explicitly maintain an expensive-to-modify list of missing

I didn't get what part the "expensive-to-modify" was referring to. Or why 
whatever it is is expensive?

> +objects on the client.
> +
> +Almost all Git code currently expects any objects referred to by other
> +objects to be present. Therefore, a fallback mechanism is added:
> +whenever Git attempts to read an object that is found to be missing, it
> +will attempt to fetch it from the promisor remote, expanding the subset
> +of objects available locally, then reattempt the read. This allows
> +objects to be "faulted in" from the promisor remote without complicated
> +prediction algorithms. For efficiency reasons, no check as to whether
> +the missing object is a promisor object is performed. This tends to be
> +slow as objects are fetched one at a time.
> +
> +The fallback mechanism can be turned off and on through a global
> +variable.

Perhaps name the variable?

> +
> +checkout (and any other command using unpack-trees) has been taught to

s/checkout/`git-checkout`/  to show that a proper sentence has started, 
maybe?

> +batch blob fetching. rev-list has been taught to be able to print
> +filtered or missing objects and can be used with more general batch
> +fetch scripts. In the future, Git commands will be updated to batch such
> +fetches or otherwise handle missing objects more efficiently.
> +
> +Fsck has been updated to be fully aware of promisor objects. The repack
> +in GC has been updated to not touch promisor packfiles at all, and to
> +only repack other objects.
> +
> +The global variable fetch_if_missing is used to control whether an
> +object lookup will attempt to dynamically fetch a missing object or
> +report an error.

Is this also the airplane mode control?

> +
> +
> +===== Fetching missing objects
> +
> +Fetching of objects is done using the existing transport mechanism using
> +transport_fetch_refs(), setting a new transport option
> +TRANS_OPT_NO_DEPENDENTS to indicate that only the objects themselves are
> +desired, not any object that they refer to. Because some transports
> +invoke fetch_pack() in the same process, fetch_pack() has been updated
> +to not use any object flags when the corresponding argument
> +(no_dependents) is set.
> +
> +The local repository sends a request with the hashes of all requested
> +objects as "want" lines, and does not perform any packfile negotiation.
> +It then receives a packfile.
> +
> +Because we are reusing the existing fetch-pack mechanism, fetching
> +currently fetches all objects referred to by the requested objects, even
> +though they are not necessary.
> +
> +
> +
> +Foot Notes
> +----------
> +
> +[a] Remembering that loose objects are promisor objects is mainly
> +    important for trees, since they may refer to promisor blobs that
> +    the user does not have.  We do not need to mark loose blobs as
> +    promisor because they do not refer to other objects.
> +
> +
> +
> +Current Limitations
> +-------------------
> +
> +- The remote used for a partial clone (or the first partial fetch
> +  following a regular clone) is marked as the "promisor remote".
> +
> +  We are currently limited to a single promisor remote and only that
> +  remote may be used for subsequent partial fetches.
> +
> +- Dynamic object fetching will only ask the promisor remote for missing
> +  objects.  We assume that the promisor remote has a complete view of the
> +  repository and can satisfy all such requests.
> +
> +  Future work may lift this restriction when we figure out how to route

Could the "Future Work: " items have a consistent style? It makes it easier 
to see the expectation of the likely development.

> +  such requests.  The current assumption is that partial clone will not 
> be
> +  used for triangular workflows that would need that (at least 
> initially).
> +
> +- Repack essentially treats promisor and non-promisor packfiles as 2
> +  distinct partitions and does not mix them.  Repack currently only works
> +  on non-promisor packfiles and loose objects.
> +
> +  Future work may let repack work to repack promisor packfiles (while
> +  keeping them in a different partition from the others).
> +
> +- The current object filtering mechanism does not make use of packfile
> +  bitmaps (when present).
> +
> +  We should allow this for filters that are not pathname-based.
> +
> +- Currently, dynamic object fetching invokes fetch-pack for each item
> +  because most algorithms stumble upon a missing object and need to have
> +  it resolved before continuing their work.  This may incur significant
> +  overhead -- and multiple authentication requests -- if many objects are
> +  needed.

I think this is one of the points of distinction between the always 
connected partial clone and the potential 'airplane mode' narrow clone where 
missing objects are not [cannot be] fetched on-the-fly.

[my 'solution' is, when requested, to expand the oid to a short distinct 
stub file, and let them stand in the place of the real (missing) 
file/directories, and then let all the regular commands act on those stubs, 
e.g. diffs just show the changed oid embedded in the stub, etc. However that 
is all orthogonal to this design doc.]

> +
> +  We need to investigate use of a long-running process, such as proposed
> +  in [5,6] to reduce process startup and overhead costs.
> +
> +  It would also be nice to use pack protocol V2 to also allow that 
> long-running
> +  process to make a series of requests over a single long-running 
> connection.
> +
> +- We currently only promisor packfiles.

Is there a missing word or phrase, I couldn't parse this.

>                   We need to add support for
> +  promisor loose objects as described earlier.
> +
> +- Dynamic object fetching currently uses the existing pack protocol V0
> +  which means that each object is requested via fetch-pack.  The server
> +  will send a full set of info/refs when the connection is established.
> +  If there are large number of refs, this may incur significant overhead.
> +
> +  We expect that protocol V2 will allow us to avoid this cost.
> +
> +- Every time the subject of "demand loading blobs" comes up it seems
> +  that someone suggest that the server be allowed to "guess" and send
> +  additional objects that may be related to the requested objects.
> +
> +  No work has gone into actually doing that; we're just documenting that
> +  it is a common suggestion for a future enhancement.
> +

Thanks for the write up.
--
Philip

> +
> +Related Links
> +-------------
> +[0] https://bugs.chromium.org/p/git/issues/detail?id=2
> +    Chromium work item for: Partial Clone
> +
> +[1] 
> https://public-inbox.org/git/20170113155253.1644-1-benpeart@microsoft.com/
> +    Subject: [RFC] Add support for downloading blobs on demand
> +    Date: Fri, 13 Jan 2017 10:52:53 -0500
> +
> +[2] 
> https://public-inbox.org/git/cover.1506714999.git.jonathantanmy@google.com/
> +    Subject: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 
> patches)
> +    Date: Fri, 29 Sep 2017 13:11:36 -0700
> +
> +[3] 
> https://public-inbox.org/git/20170426221346.25337-1-jonathantanmy@google.com/
> +    Subject: Proposal for missing blob support in Git repos
> +    Date: Wed, 26 Apr 2017 15:13:46 -0700
> +
> +[4] 
> https://public-inbox.org/git/1488999039-37631-1-git-send-email-git@jeffhostetler.com/
> +    Subject: [PATCH 00/10] RFC Partial Clone and Fetch
> +    Date: Wed,  8 Mar 2017 18:50:29 +0000
> +
> +
> +[5] 
> https://public-inbox.org/git/20170505152802.6724-1-benpeart@microsoft.com/
> +    Subject: [PATCH v7 00/10] refactor the filter process code into a 
> reusable module
> +    Date: Fri,  5 May 2017 11:27:52 -0400
> +
> +[6] 
> https://public-inbox.org/git/20170714132651.170708-1-benpeart@microsoft.com/
> +    Subject: [RFC/PATCH v2 0/1] Add support for downloading blobs on 
> demand
> +    Date: Fri, 14 Jul 2017 09:26:50 -0400
> -- 
> 2.9.3
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] partial-clone: design doc
  2017-12-12 23:31   ` Philip Oakley
@ 2017-12-12 23:57     ` Junio C Hamano
  2017-12-13 13:17       ` Philip Oakley
  2017-12-14 20:32     ` Jeff Hostetler
  1 sibling, 1 reply; 9+ messages in thread
From: Junio C Hamano @ 2017-12-12 23:57 UTC (permalink / raw)
  To: Philip Oakley; +Cc: Jeff Hostetler, git, peff, jonathantanmy, Jeff Hostetler

"Philip Oakley" <philipoakley@iee.org> writes:

>> +  These filtered packfiles are incomplete in the traditional sense
>> because
>> +  they may contain trees that reference blobs that the client does
>> not have.
>
> Is a comment needed here noting that currently, IIUC, the complete
> trees are fetched in the packfiles, it's just the un-necessary blobs
> that are omitted ?

I probably am misreading what you meant to say, but the above
statement with "currently" taken literally to mean the system
without JeffH's changes, is false.

When the receiver says it has commit A and the sender wants to send
a commit B (because the receiver said it does not have it, and it
wants it), trees in A are not sent in the pack the sender sends to
give objects sufficient to complete B, which the receiver wanted to
have, even if B also has those trees.  If you fetch from me twice
and between that time Documentation/ directory did not change, the
second fetch will not have the tree object that corresponds to that
hierarchy (and of course no blobs and sub trees inside it).

So "the complete trees are fetched" is not true.  What is true (and
what matters more in JeffH's document) is that fetching is done in
such a way that objects resulting in the receiving repository are
complete in the current system that does not allow promised objects.
If some objects resulting in the receiving repository are incomplete,
the current system considers that we corrupted the repository.

The promise mechanism says that it is fine for the receiving end to
lack blobs, trees or commits, as long as the promisor repository
tells it that these "missing" objects can be obtained from it later.
The way the receiving end which notices that it does not have an
otherwise required blob, tree or commit is one promised by the
promisor repository is to see if it is referenced by a pack that
came from such a promisor repository.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] partial-clone: design doc
  2017-12-12 23:57     ` Junio C Hamano
@ 2017-12-13 13:17       ` Philip Oakley
  2017-12-14 20:46         ` Jeff Hostetler
  0 siblings, 1 reply; 9+ messages in thread
From: Philip Oakley @ 2017-12-13 13:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff Hostetler, git, peff, jonathantanmy, Jeff Hostetler

From: "Junio C Hamano" <gitster@pobox.com>
> "Philip Oakley" <philipoakley@iee.org> writes:
>
>>> +  These filtered packfiles are incomplete in the traditional sense
>>> because
>>> +  they may contain trees that reference blobs that the client does
>>> not have.
>>
>> Is a comment needed here noting that currently, IIUC, the complete
>> trees are fetched in the packfiles, it's just the un-necessary blobs
>> that are omitted ?
>
> I probably am misreading what you meant to say, but the above
> statement with "currently" taken literally to mean the system
> without JeffH's changes, is false.

I was meaning the current JeffH's V6 series, rather than the last Git 
release.

In one of the previous discussions Jeff had noted that (at that time) his 
partial design would provide a full set of trees for the selected commits 
(excluding the trees already available locally), but only a few of the file 
blobs (based on the filter spec).

So yes, I should have been clearer to avoid talking at cross purposes.

>
> When the receiver says it has commit A and the sender wants to send
> a commit B (because the receiver said it does not have it, and it
> wants it), trees in A are not sent in the pack the sender sends to
> give objects sufficient to complete B, which the receiver wanted to
> have, even if B also has those trees.  If you fetch from me twice
> and between that time Documentation/ directory did not change, the
> second fetch will not have the tree object that corresponds to that
> hierarchy (and of course no blobs and sub trees inside it).

Though, after the fetch has completed (v2.15 Git), the receiver will have 
the 'full set of trees and blobs'. In Jeff's design (V6) the reciever would 
still have a full set of trees, but only a partial set of the blobs. So my 
viewpoint was not of the pack file but of the receiver's object store after 
the fetch.

>
> So "the complete trees are fetched" is not true.  What is true (and
> what matters more in JeffH's document) is that fetching is done in
> such a way that objects resulting in the receiving repository are
> complete in the current system that does not allow promised objects.
> If some objects resulting in the receiving repository are incomplete,
> the current system considers that we corrupted the repository.
>
> The promise mechanism says that it is fine for the receiving end to
> lack blobs, trees or commits, as long as the promisor repository
> tells it that these "missing" objects can be obtained from it later.

True. (though I'm not sure exactly how Jeff decides about commits - I 
thought theye were not part of this optimisation)

> The way the receiving end which notices that it does not have an
> otherwise required blob, tree or commit is one promised by the
> promisor repository is to see if it is referenced by a pack that
> came from such a promisor repository.

.. and marked as such with the ".promisor" extension.
>
>
Thanks. 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] partial-clone: design doc
  2017-12-08 20:14   ` Junio C Hamano
@ 2017-12-13 22:34     ` Jeff Hostetler
  0 siblings, 0 replies; 9+ messages in thread
From: Jeff Hostetler @ 2017-12-13 22:34 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, jonathantanmy, Jeff Hostetler



On 12/8/2017 3:14 PM, Junio C Hamano wrote:
> Jeff Hostetler <git@jeffhostetler.com> writes:
> 
>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>
>> First draft of design document for partial clone feature.
>>
>> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
>> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
>> ---
> 
> Thanks.
> 
>> +Non-Goals
>> +---------
>> +
>> +Partial clone is independent of and not intended to conflict with
>> +shallow-clone, refspec, or limited-ref mechanisms since these all operate
>> +at the DAG level whereas partial clone and fetch works *within* the set
>> +of commits already chosen for download.
> 
> It probably is not a huge deal (simply because it is about
> "Non-Goals") but I have no idea what "refspec" and "limited-ref
> mechanism" refer to in the above sentence, and I suspect many others
> share the same puzzlement.

I'll reword this.  There was a question on the list earlier about
having a filter for commits in addition to ones for blobs and trees.

I just wanted to emphasize that we already have ways to filter or
limit commits using --shallow-* or --single-branch in clone and 1 or
more '<refspec>' args in fetch.

  
>> +An object may be missing due to a partial clone or fetch, or missing due
>> +to repository corruption. To differentiate these cases, the local
>> +repository specially indicates packfiles obtained from the promisor
>> +remote. These "promisor packfiles" consist of a "<name>.promisor" file
>> +with arbitrary contents (like the "<name>.keep" files), in addition to
>> +their "<name>.pack" and "<name>.idx" files. (In the future, this ability
>> +may be extended to loose objects[a].)
>> + ...
>> +Foot Notes
>> +----------
>> +
>> +[a] Remembering that loose objects are promisor objects is mainly
>> +    important for trees, since they may refer to promisor blobs that
>> +    the user does not have.  We do not need to mark loose blobs as
>> +    promisor because they do not refer to other objects.
> 
> I fail to see any logical link between the "loose" and "tree".
> Putting it differently, I do not see why "tree" is so special.
> 
> A promisor pack that contains a tree but lacks blobs the tree refers
> to would be sufficient to let us remember that these missing blobs
> are not corruption.  A loose commit or a tag that is somehow marked
> as obtained from a promisor, if it can serve just like a commit or a
> tag in a promisor pack to promise its direct pointee, would equally
> be useful (if very inefficient).
> 
> In any case, I suspect "since they may refer to promisor blobs" is a
> typo of "since they may refer to promised blobs".

right. good point. i was only thinking about the tree==>blob
relationship.


> 
>> +- Currently, dynamic object fetching invokes fetch-pack for each item
>> +  because most algorithms stumble upon a missing object and need to have
>> +  it resolved before continuing their work.  This may incur significant
>> +  overhead -- and multiple authentication requests -- if many objects are
>> +  needed.
>> +
>> +  We need to investigate use of a long-running process, such as proposed
>> +  in [5,6] to reduce process startup and overhead costs.
> 
> Also perhaps in some operations we can enumerate the objects we will
> need upfront and ask for them in one go (e.g. "git log -p A..B" may
> internally want to do "rev-list --objects A..B" to enumerate trees
> and blobs that we may lack upfront).  I do not think having the
> other side guess is a good idea, though.

right.

> 
>> +- We currently only promisor packfiles.  We need to add support for
>> +  promisor loose objects as described earlier.
> 
> The earlier description was not convincing enough to feel the need
> to me; at least not yet.

It seems like we need it if a promisor packfile gets unpacked for any
reason.  But right, I'm not sure how urgent it is.


Thanks
Jeff



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] partial-clone: design doc
  2017-12-12 23:31   ` Philip Oakley
  2017-12-12 23:57     ` Junio C Hamano
@ 2017-12-14 20:32     ` Jeff Hostetler
  1 sibling, 0 replies; 9+ messages in thread
From: Jeff Hostetler @ 2017-12-14 20:32 UTC (permalink / raw)
  To: Philip Oakley, git; +Cc: gitster, peff, jonathantanmy, Jeff Hostetler


Sorry, I didn't see this message in my inbox when I posted V2 of the
design doc.  I'll address questions here and update the doc as necessary.


On 12/12/2017 6:31 PM, Philip Oakley wrote:
> From: "Jeff Hostetler" <git@jeffhostetler.com>
>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>
>> First draft of design document for partial clone feature.
>>
>> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
>> Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
>> ---
>> Documentation/technical/partial-clone.txt | 240 ++++++++++++++++++++++++++++++
>> 1 file changed, 240 insertions(+)
>> create mode 100644 Documentation/technical/partial-clone.txt
>>
>> diff --git a/Documentation/technical/partial-clone.txt b/Documentation/technical/partial-clone.txt
>> new file mode 100644
>> index 0000000..7ab39d8
>> --- /dev/null
>> +++ b/Documentation/technical/partial-clone.txt
>> @@ -0,0 +1,240 @@
>> +Partial Clone Design Notes
>> +==========================
>> +
>> +The "Partial Clone" feature is a performance optimization for git that
>> +allows git to function without having a complete copy of the repository.
>> +
> 
> I think it would be worthwhile at least listing the issues that make the 'optimisation' necessary, and then the available factors that make the optimisation possible. This helps for future adjustments when those issues and factors change.
> 
> I think the issues are:
> * the size of the repository that is being cloned, both in the width of a commit (you mentioned 100M trees) and the time (hours to days) / size to clone over the connection.
> 
> While the supporting factor is:
> * the remote is always on-line and available for on-demand object fetching (seconds)
> 
> The solution choice then should fall out fairly obviously, and we can separate out the other optimisations that are based on other views about the issues. E.g. my desire for a solution in the off-line case.
> 
> In fact the current design, apart from some terminology, does look well matched, with only a couple of places that would be affected.
> 
> The airplane-mode expectations of a partial clone should also be stated.

Good points.  I'll try to work these into V3.
  
  
>> +During clone and fetch operations, git normally downloads the complete
>> +contents and history of the repository.  That is, during clone the client
>> +receives all of the commits, trees, and blobs in the repository into a
>> +local ODB.  Subsequent fetches extend the local ODB with any new objects.
>> +For large repositories, this can take significant time to download and
>> +large amounts of diskspace to store.
>> +
>> +The goal of this work is to allow git better handle extremely large
>> +repositories.
> 
> Shouln't this goal be nearer the top?

maybe. i'll see about reordering the paragraphs in the introduction.


> 
>>        Often in these repositories there are many files that the
>> +user does not need such as ancient versions of source files, files in
>> +portions of the worktree outside of the user's work area, or large binary
>> +assets.  If we can avoid downloading such unneeded objects *in advance*
>> +during clone and fetch operations, we can decrease download times and
>> +reduce ODB disk usage.
>> +
> 
> Does this need to distinguish between the shallow clone mechanism for reducing the cloning of old history from the desire for a width wise partial clone of only the users narrow work area, and/or without large files/blobs?

I tried to state in the next section that partial clone is independent of
shallow clone.  That is, our stuff works on filtering *within* the
set of commits received.  The existing shallow clone and have/wants
commit limiting features still apply.  I didn't go into detail on the
specific filters, because they are documented elsewhere and I view them
as an expandable set.  The primary goal here is to describe how we
handle missing objects without regard to why an object is missing.

  
>> +
>> +Non-Goals
>> +---------
>> +
>> +Partial clone is independent of and not intended to conflict with
>> +shallow-clone, refspec, or limited-ref mechanisms since these all operate
>> +at the DAG level whereas partial clone and fetch works *within* the set
>> +of commits already chosen for download.
>> +
[...]
>> +Design Details
>> +--------------
[...]
>> +  These filtered packfiles are incomplete in the traditional sense because
>> +  they may contain trees that reference blobs that the client does not have.
> 
> Is a comment needed here noting that currently, IIUC, the complete trees are fetched in the packfiles, it's just the un-necessary blobs that are omitted ?

Currently, we have filters to omit unwanted blobs.  Later, we hope to
add other filters to omit trees too.  My point was that the packfiles
are incomplete (have missing objects).  I'll reword the above statement
a little.


>> +
>> +
>> +==== How the local repository gracefully handles missing objects
>> +
>> +With partial clone, the fact that objects can be missing makes such
>> +repositories incompatible with older versions of Git, necessitating a
>> +repository extension (see the documentation of "extensions.partialClone"
>> +for more information).
>> +
>> +An object may be missing due to a partial clone or fetch, or missing due
>> +to repository corruption. To differentiate these cases, the local
>> +repository specially indicates packfiles obtained from the promisor
> 
> s/packfiles/filtered packfiles/ ?

got it.

> 
>> +remote. These "promisor packfiles" consist of a "<name>.promisor" file
>> +with arbitrary contents (like the "<name>.keep" files), in addition to
>> +their "<name>.pack" and "<name>.idx" files. (In the future, this ability
>> +may be extended to loose objects[a].)
>> +
>> +The local repository considers a "promisor object" to be an object that
>> +it knows (to the best of its ability) that the promisor remote has,
> 
> s/has/is expected to have/  or /has promised it has/ ?

got it.

> 
>> +either because the local repository has that object in one of its
>> +promisor packfiles, or because another promisor object refers to it. Git
>> +can then check if the missing object is a promisor object, and if yes,
>> +this situation is common and expected. This also means that there is no
>> +need to explicitly maintain an expensive-to-modify list of missing
> 
> I didn't get what part the "expensive-to-modify" was referring to. Or why whatever it is is expensive?

I'll add a paragraph to explain that comment.

  
>> +objects on the client.
>> +
>> +Almost all Git code currently expects any objects referred to by other
>> +objects to be present. Therefore, a fallback mechanism is added:
>> +whenever Git attempts to read an object that is found to be missing, it
>> +will attempt to fetch it from the promisor remote, expanding the subset
>> +of objects available locally, then reattempt the read. This allows
>> +objects to be "faulted in" from the promisor remote without complicated
>> +prediction algorithms. For efficiency reasons, no check as to whether
>> +the missing object is a promisor object is performed. This tends to be
>> +slow as objects are fetched one at a time.
>> +
>> +The fallback mechanism can be turned off and on through a global
>> +variable.
> 
> Perhaps name the variable?

got it.

  
>> +
>> +checkout (and any other command using unpack-trees) has been taught to
> 
> s/checkout/`git-checkout`/  to show that a proper sentence has started, maybe?
> 
>> +batch blob fetching. rev-list has been taught to be able to print
>> +filtered or missing objects and can be used with more general batch
>> +fetch scripts. In the future, Git commands will be updated to batch such
>> +fetches or otherwise handle missing objects more efficiently.
>> +
>> +Fsck has been updated to be fully aware of promisor objects. The repack
>> +in GC has been updated to not touch promisor packfiles at all, and to
>> +only repack other objects.
>> +
>> +The global variable fetch_if_missing is used to control whether an
>> +object lookup will attempt to dynamically fetch a missing object or
>> +report an error.
> 
> Is this also the airplane mode control?

This can be used by fsck or other commands to gently try to load an object
and get an error rather than implicitly attempting fetch it.

  
>> +Current Limitations
>> +-------------------
>> +
>> +- The remote used for a partial clone (or the first partial fetch
>> +  following a regular clone) is marked as the "promisor remote".
>> +
>> +  We are currently limited to a single promisor remote and only that
>> +  remote may be used for subsequent partial fetches.
>> +
>> +- Dynamic object fetching will only ask the promisor remote for missing
>> +  objects.  We assume that the promisor remote has a complete view of the
>> +  repository and can satisfy all such requests.
>> +
>> +  Future work may lift this restriction when we figure out how to route
> 
> Could the "Future Work: " items have a consistent style? It makes it easier to see the expectation of the likely development.

sure. i'll break this into 2 sections.

  
>> +  such requests.  The current assumption is that partial clone will not be
>> +  used for triangular workflows that would need that (at least initially).
>> +
>> +- Repack essentially treats promisor and non-promisor packfiles as 2
>> +  distinct partitions and does not mix them.  Repack currently only works
>> +  on non-promisor packfiles and loose objects.
>> +
>> +  Future work may let repack work to repack promisor packfiles (while
>> +  keeping them in a different partition from the others).
>> +
>> +- The current object filtering mechanism does not make use of packfile
>> +  bitmaps (when present).
>> +
>> +  We should allow this for filters that are not pathname-based.
>> +
>> +- Currently, dynamic object fetching invokes fetch-pack for each item
>> +  because most algorithms stumble upon a missing object and need to have
>> +  it resolved before continuing their work.  This may incur significant
>> +  overhead -- and multiple authentication requests -- if many objects are
>> +  needed.
> 
> I think this is one of the points of distinction between the always connected partial clone and the potential 'airplane mode' narrow clone where missing objects are not [cannot be] fetched on-the-fly.
> 
> [my 'solution' is, when requested, to expand the oid to a short distinct stub file, and let them stand in the place of the real (missing) file/directories, and then let all the regular commands act on those stubs, e.g. diffs just show the changed oid embedded in the stub, etc. However that is all orthogonal to this design doc.]

I'm not really prepared to discuss your proposal (no offense, I just haven't
kept up with it).  Also, it might be fine for git-diff to print stub file
meta-data, but commands that actually need the file contents (like GCC) will
still fail, so I'm not sure how useful this is.  But again, I've not kept up
with your proposal, so I might be missing something here.

[...]
> Thanks for the write up.
> -- 
> Philip

Thanks for the comments,
Jeff


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] partial-clone: design doc
  2017-12-13 13:17       ` Philip Oakley
@ 2017-12-14 20:46         ` Jeff Hostetler
  0 siblings, 0 replies; 9+ messages in thread
From: Jeff Hostetler @ 2017-12-14 20:46 UTC (permalink / raw)
  To: Philip Oakley, Junio C Hamano; +Cc: git, peff, jonathantanmy, Jeff Hostetler



On 12/13/2017 8:17 AM, Philip Oakley wrote:
> From: "Junio C Hamano" <gitster@pobox.com>
>> "Philip Oakley" <philipoakley@iee.org> writes:
>>
>>>> +  These filtered packfiles are incomplete in the traditional sense
>>>> because
>>>> +  they may contain trees that reference blobs that the client does
>>>> not have.
>>>
>>> Is a comment needed here noting that currently, IIUC, the complete
>>> trees are fetched in the packfiles, it's just the un-necessary blobs
>>> that are omitted ?
>>
>> I probably am misreading what you meant to say, but the above
>> statement with "currently" taken literally to mean the system
>> without JeffH's changes, is false.
> 
> I was meaning the current JeffH's V6 series, rather than the last Git release.
> 
> In one of the previous discussions Jeff had noted that (at that time) his partial design would provide a full set of trees for the selected commits (excluding the trees already available locally), but only a few of the file blobs (based on the filter spec).
> 
> So yes, I should have been clearer to avoid talking at cross purposes.

Right, we build upon the existing thin-pack capabilities such that a
fetch following a clone gets a packfile that assumes the client already
has all of the objects in the "edge".  So a fetch would not need to
receive trees and blobs that are already present in the edge commits.

What we are adding here is a way to filter/restrict even further the
set of objects sent to the client.

> 
>>
>> When the receiver says it has commit A and the sender wants to send
>> a commit B (because the receiver said it does not have it, and it
>> wants it), trees in A are not sent in the pack the sender sends to
>> give objects sufficient to complete B, which the receiver wanted to
>> have, even if B also has those trees.  If you fetch from me twice
>> and between that time Documentation/ directory did not change, the
>> second fetch will not have the tree object that corresponds to that
>> hierarchy (and of course no blobs and sub trees inside it).
> 
> Though, after the fetch has completed (v2.15 Git), the receiver will have the 'full set of trees and blobs'. In Jeff's design (V6) the reciever would still have a full set of trees, but only a partial set of the blobs. So my viewpoint was not of the pack file but of the receiver's object store after the fetch.

Currently (with our changes) the receiver will have all of the trees
and only some of the blobs.  If we later add another filter that can
filter trees, the client will also have missing but referenced trees too.

  
>>
>> So "the complete trees are fetched" is not true.  What is true (and
>> what matters more in JeffH's document) is that fetching is done in
>> such a way that objects resulting in the receiving repository are
>> complete in the current system that does not allow promised objects.
>> If some objects resulting in the receiving repository are incomplete,
>> the current system considers that we corrupted the repository.
>>
>> The promise mechanism says that it is fine for the receiving end to
>> lack blobs, trees or commits, as long as the promisor repository
>> tells it that these "missing" objects can be obtained from it later.
> 
> True. (though I'm not sure exactly how Jeff decides about commits - I thought theye were not part of this optimisation)

I've not talked about commit filtering -- mainly because we already
have such machinery in shallow-clone -- and I did not want to mess
with the haves/wants computations.

But it will work with missing commits, because of the way object lookup
happens a missing commit will trigger the fetch-object code just like it
does for missing blobs.  The ODB layer doesn't really care what type of
object it is -- just that it is missing and needs to be dynamically fetched.
  
Thanks
Jeff

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-12-14 20:46 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-08 19:26 [PATCH] Partial clone design document Jeff Hostetler
2017-12-08 19:26 ` [PATCH] partial-clone: design doc Jeff Hostetler
2017-12-08 20:14   ` Junio C Hamano
2017-12-13 22:34     ` Jeff Hostetler
2017-12-12 23:31   ` Philip Oakley
2017-12-12 23:57     ` Junio C Hamano
2017-12-13 13:17       ` Philip Oakley
2017-12-14 20:46         ` Jeff Hostetler
2017-12-14 20:32     ` Jeff Hostetler

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).