git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 00/10] RFC Partial Clone and Fetch
@ 2017-03-08 17:37 Jeff Hostetler
  0 siblings, 0 replies; 25+ messages in thread
From: Jeff Hostetler @ 2017-03-08 17:37 UTC (permalink / raw)
  To: git
  Cc: jeffhost, peff, gitster, markbt, benpeart, jonathantanmy,
	Jeff Hostetler

From: Jeff Hostetler <git@jeffhostetler.com>


[RFC] Partial Clone and Fetch
=============================


This is a WIP RFC for a partial clone and fetch feature wherein the client
can request that the server omit various blobs from the packfile during
clone and fetch.  Clients can later request omitted blobs (either from a
modified upload-pack-like request to the server or via a completely
independent mechanism).

The purpose here is to reduce the size of packfile downloads and help
git scale to extremely large repos.

I use the term "partial" here to refer to a portion of one or more commits
and to avoid use of loaded terms like "sparse", "lazy", "narrow", and "skeleton".

The concept of a partial clone/fetch is independent of and can complement
the existing shallow-clone, refspec, and limited-ref filtering mechanisms
since these all filter at the DAG level whereas the work described here
works *within* the set of commits already chosen for download.


A. Requesting a Partial Clone/Fetch
===================================

Clone, fetch, and fetch-pack will accept one or more new "partial"
command line arguments as described below.  The fetch-pack/upload-pack
protocol will be extended to include these new arguments.  Upload-pack
and pack-objects will be updated accordingly.  Pack-objects will filter
out the unwanted blobs as it is building the packfile.  Rev-list and
index-pack will be updated to not complain when missing blobs are
detected in the received packfile.

[1] "--partial-by-size=<n>[kmg]"
Where <n> is a non-negative integer with an optional unit.

Request that only blobs smaller than this be included in the packfile.
The client might use this to implement an alternate LFS or ODB mechanism
for large blobs, such as suggested in:
    https://public-inbox.org/git/20161130210420.15982-1-chriscool@tuxfamily.org/

A special case of size zero would omit all blobs and is similar to the
commits-and-trees-only feature described in:
    https://public-inbox.org/git/20170113155253.1644-1-benpeart@microsoft.com/

[2] "--partial-special"
Request that special files, such as ".gitignore" and .gitattributes",
be included.

[3] *TODO* "--partial-by-profile=<sparse-checkout-path>"
Where <sparse-checkout-path> is verson-controlled file in the repository
(either present in the requested commit or the default HEAD on the server).

    [I envision a ".gitsparse/<path>" hierarchy where teams can store
     common sparse-checkout profiles.  And then they can reference
     them from their private ".git/info/sparse-checkout" files.]

Pack-objects will use this file and the sparse-checkout rules to only
include blobs in the packfile that would be needed to do the corresponding
sparse-checkout (and let the client avoid having to demand-load their
entire enlistment).


When multiple "partial" options are given, they are treated as a simple OR
giving the union of the blobs selected.

The patch series describes the changes to the fetch-pack/upload-pack
protocol:
    Documentation/technical/pack-protocol.txt
    Documentation/technical/protocol-capabilities.txt


B. Issues Backfilling Omitted Blobs
===================================

Ideally, if the client only does "--partial-by-profile" fetches, it
should not need to fetch individual missing blobs, but we have to allow
for it to handle the other commands and other unexpected issues.

There are 3 orthogonal concepts here:  when, how and where?


[1] When:
(1a) a pre-command or hook to identify needed blobs and pre-fetch them
before allowing the actual command to start;
(1b) a dry-run mode for the command to likewise pre-fetch them; or
(1c) "fault" them in as necessary in read_object() while the command is
running and without any pre-fetch (either synchronously or asynchronously
and with/without a helper process).

Ideas for (1c) are being addressed in the following threads:
    https://public-inbox.org/git/20170113155253.1644-1-benpeart@microsoft.com/
    https://public-inbox.org/git/20170117184258.sd7h2hkv27w52gzt@sigill.intra.peff.net/
    https://public-inbox.org/git/20161130210420.15982-1-chriscool@tuxfamily.org/
so I won't consider them here.

Ideas (1a) and (1b) have the advantage that they try to obtain all
required blobs before allowing an operation to start, so there is
less opportunity to leave the user in a weird state.

The best solution may be a combination of (1a) and (1b) and may depend
on the individual command.  However, (1b) will further complicate the
source in the existing commands, so in some cases it may be simpler to
just take the ideas and implement stand-alone pre-commands.

For now, I'm going to limit this RFC to (1a).


[2] How:
(2a) augment the existing git protocols to include blob requests;
(2b) local external process (such as a database client or a local bulk
fetch daemon);

Ideas for (2b) are being addressed in the above threads, so I won't
consider them here.

So I'm going to limit this RFC to (2a).


[3] Where:
(3a) the same remote server used for the partial clone/fetch;
(3b) anywhere else, such as a proxy server or Azure or S3 blob store.

There's no reason that the client should be limited to going back to
the same server, but I'm not going to consider it here, so I'm going
to limit this RFC to (3a).



C. New Blob-Fetch Protocol (2a)
===============================

*TODO* A new pair of commands, such as fetch-blob-pack and upload-blob-pack,
will be created to let the client request a batch of blobs and receive a
packfile.  A protocol similar to the fetch-pack/upload-pack will be spoken
between them.  (This avoids complicating the existing protocol and the work
of enumerating the refs.)  Upload-blob-pack will use pack-objects to build
the packfile.

It is also more efficient than requesting a single blob at a time using
the existing fetch-pack/upload-pack mechanism (with the various allow
unreachable options).

*TODO* The new request protocol will be defined in the patch series.
It will include: a list of the desired blob SHAs.  Possibly also the commit
SHA, branch name, and pathname of each blob (or whatever is necessary to let
the server address the reachability concerns).  Possibly also the last
known SHA for each blob to allow for deltafication in the packfile.


D. Pre-fetching Blobs (1a)
==========================

On the client side, one or more special commands will be created to assemble
the list of blobs needed for an operation and passed to fetch-blob-pack.


Checkout Example:  After running a command like:
    'clone --partial-by-size=1m --no-checkout'

and before doing an actual checkout, we need a command to essentially do:
    (1) "ls-tree -r <tree-ish>",
    (2) filter that by the sparse-checkout currently in effect,
    (3) filter that for missing blobs,
    (4) and pass the resulting list to fetch-blob-pack.

Afterwards, checkout should complete without faulting.

A new "git ls-partial <treeish>" command has been created to do
steps 1 thru 3 and print the resulting list of SHAs on stdout.


E. Unresolved Thoughts
======================

*TODO* The server should optionally return (in a side-band?) a list 
of the blobs that it omitted from the packfile (and possibly the sizes
or sha1_object_info() data for them) during the fetch-pack/upload-pack
operation.  This would allow the client to distinguish from invalid
SHAs and missing ones.  Size information would allow the client to
maybe choose between various servers.

*TODO* The partial clone arguments should be recorded in ".git/info/"
so that subsequent fetch commands can inherit them and rev-list/index-pack
know to not complain by default.

*TODO* Update GC like rev-list to not complain when there are missing blobs.

*TODO* Extend ls-partial to include the "-m" and 3 tree-ish arguments
like read-tree, so we can pre-fetch for merges that may require file
merges (that may or may not be within our sparse-checkout).

*TODO* I also need to review the RFC that Mark Thomas submitted over
the weekend:
    https://public-inbox.org/git/20170304191901.9622-1-markbt%40efaref.net/t





Jeff Hostetler (10):
  pack-objects: eat CR in addition to LF after fgets.
  pack-objects: add --partial-by-size=n --partial-special
  pack-objects: test for --partial-by-size --partial-special
  upload-pack: add partial (sparse) fetch
  fetch-pack: add partial-by-size and partial-special
  rev-list: add --allow-partial option to relax connectivity checks
  index-pack: add --allow-partial option to relax blob existence checks
  fetch: add partial-by-size and partial-special arguments
  clone: add partial-by-size and partial-special arguments
  ls-partial: created command to list missing blobs

 Documentation/technical/pack-protocol.txt         |  14 ++
 Documentation/technical/protocol-capabilities.txt |   7 +
 Makefile                                          |   2 +
 builtin.h                                         |   1 +
 builtin/clone.c                                   |  26 ++
 builtin/fetch-pack.c                              |   9 +
 builtin/fetch.c                                   |  26 +-
 builtin/index-pack.c                              |  20 +-
 builtin/ls-partial.c                              | 110 +++++++++
 builtin/pack-objects.c                            |  64 ++++-
 builtin/rev-list.c                                |  22 +-
 connected.c                                       |   3 +
 connected.h                                       |   3 +
 fetch-pack.c                                      |  17 ++
 fetch-pack.h                                      |   2 +
 git.c                                             |   1 +
 partial-utils.c                                   | 279 ++++++++++++++++++++++
 partial-utils.h                                   |  93 ++++++++
 t/5316-pack-objects-partial.sh                    |  72 ++++++
 transport.c                                       |   8 +
 transport.h                                       |   8 +
 upload-pack.c                                     |  32 ++-
 22 files changed, 813 insertions(+), 6 deletions(-)
 create mode 100644 builtin/ls-partial.c
 create mode 100644 partial-utils.c
 create mode 100644 partial-utils.h
 create mode 100644 t/5316-pack-objects-partial.sh

-- 
2.7.4


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 00/10] RFC Partial Clone and Fetch
@ 2017-03-08 18:50 git
  2017-03-08 18:50 ` [PATCH 01/10] pack-objects: eat CR in addition to LF after fgets git
                   ` (12 more replies)
  0 siblings, 13 replies; 25+ messages in thread
From: git @ 2017-03-08 18:50 UTC (permalink / raw)
  To: git; +Cc: jeffhost, peff, gitster, markbt, benpeart, jonathantanmy

From: Jeff Hostetler <jeffhost@microsoft.com>


[RFC] Partial Clone and Fetch
=============================


This is a WIP RFC for a partial clone and fetch feature wherein the client
can request that the server omit various blobs from the packfile during
clone and fetch.  Clients can later request omitted blobs (either from a
modified upload-pack-like request to the server or via a completely
independent mechanism).

The purpose here is to reduce the size of packfile downloads and help
git scale to extremely large repos.

I use the term "partial" here to refer to a portion of one or more commits
and to avoid use of loaded terms like "sparse", "lazy", "narrow", and "skeleton".

The concept of a partial clone/fetch is independent of and can complement
the existing shallow-clone, refspec, and limited-ref filtering mechanisms
since these all filter at the DAG level whereas the work described here
works *within* the set of commits already chosen for download.


A. Requesting a Partial Clone/Fetch
===================================

Clone, fetch, and fetch-pack will accept one or more new "partial"
command line arguments as described below.  The fetch-pack/upload-pack
protocol will be extended to include these new arguments.  Upload-pack
and pack-objects will be updated accordingly.  Pack-objects will filter
out the unwanted blobs as it is building the packfile.  Rev-list and
index-pack will be updated to not complain when missing blobs are
detected in the received packfile.

[1] "--partial-by-size=<n>[kmg]"
Where <n> is a non-negative integer with an optional unit.

Request that only blobs smaller than this be included in the packfile.
The client might use this to implement an alternate LFS or ODB mechanism
for large blobs, such as suggested in:
    https://public-inbox.org/git/20161130210420.15982-1-chriscool@tuxfamily.org/

A special case of size zero would omit all blobs and is similar to the
commits-and-trees-only feature described in:
    https://public-inbox.org/git/20170113155253.1644-1-benpeart@microsoft.com/

[2] "--partial-special"
Request that special files, such as ".gitignore" and .gitattributes",
be included.

[3] *TODO* "--partial-by-profile=<sparse-checkout-path>"
Where <sparse-checkout-path> is verson-controlled file in the repository
(either present in the requested commit or the default HEAD on the server).

    [I envision a ".gitsparse/<path>" hierarchy where teams can store
     common sparse-checkout profiles.  And then they can reference
     them from their private ".git/info/sparse-checkout" files.]

Pack-objects will use this file and the sparse-checkout rules to only
include blobs in the packfile that would be needed to do the corresponding
sparse-checkout (and let the client avoid having to demand-load their
entire enlistment).


When multiple "partial" options are given, they are treated as a simple OR
giving the union of the blobs selected.

The patch series describes the changes to the fetch-pack/upload-pack
protocol:
    Documentation/technical/pack-protocol.txt
    Documentation/technical/protocol-capabilities.txt


B. Issues Backfilling Omitted Blobs
===================================

Ideally, if the client only does "--partial-by-profile" fetches, it
should not need to fetch individual missing blobs, but we have to allow
for it to handle the other commands and other unexpected issues.

There are 3 orthogonal concepts here:  when, how and where?


[1] When:
(1a) a pre-command or hook to identify needed blobs and pre-fetch them
before allowing the actual command to start;
(1b) a dry-run mode for the command to likewise pre-fetch them; or
(1c) "fault" them in as necessary in read_object() while the command is
running and without any pre-fetch (either synchronously or asynchronously
and with/without a helper process).

Ideas for (1c) are being addressed in the following threads:
    https://public-inbox.org/git/20170113155253.1644-1-benpeart@microsoft.com/
    https://public-inbox.org/git/20170117184258.sd7h2hkv27w52gzt@sigill.intra.peff.net/
    https://public-inbox.org/git/20161130210420.15982-1-chriscool@tuxfamily.org/
so I won't consider them here.

Ideas (1a) and (1b) have the advantage that they try to obtain all
required blobs before allowing an operation to start, so there is
less opportunity to leave the user in a weird state.

The best solution may be a combination of (1a) and (1b) and may depend
on the individual command.  However, (1b) will further complicate the
source in the existing commands, so in some cases it may be simpler to
just take the ideas and implement stand-alone pre-commands.

For now, I'm going to limit this RFC to (1a).


[2] How:
(2a) augment the existing git protocols to include blob requests;
(2b) local external process (such as a database client or a local bulk
fetch daemon);

Ideas for (2b) are being addressed in the above threads, so I won't
consider them here.

So I'm going to limit this RFC to (2a).


[3] Where:
(3a) the same remote server used for the partial clone/fetch;
(3b) anywhere else, such as a proxy server or Azure or S3 blob store.

There's no reason that the client should be limited to going back to
the same server, but I'm not going to consider it here, so I'm going
to limit this RFC to (3a).



C. New Blob-Fetch Protocol (2a)
===============================

*TODO* A new pair of commands, such as fetch-blob-pack and upload-blob-pack,
will be created to let the client request a batch of blobs and receive a
packfile.  A protocol similar to the fetch-pack/upload-pack will be spoken
between them.  (This avoids complicating the existing protocol and the work
of enumerating the refs.)  Upload-blob-pack will use pack-objects to build
the packfile.

It is also more efficient than requesting a single blob at a time using
the existing fetch-pack/upload-pack mechanism (with the various allow
unreachable options).

*TODO* The new request protocol will be defined in the patch series.
It will include: a list of the desired blob SHAs.  Possibly also the commit
SHA, branch name, and pathname of each blob (or whatever is necessary to let
the server address the reachability concerns).  Possibly also the last
known SHA for each blob to allow for deltafication in the packfile.


D. Pre-fetching Blobs (1a)
==========================

On the client side, one or more special commands will be created to assemble
the list of blobs needed for an operation and passed to fetch-blob-pack.


Checkout Example:  After running a command like:
    'clone --partial-by-size=1m --no-checkout'

and before doing an actual checkout, we need a command to essentially do:
    (1) "ls-tree -r <tree-ish>",
    (2) filter that by the sparse-checkout currently in effect,
    (3) filter that for missing blobs,
    (4) and pass the resulting list to fetch-blob-pack.

Afterwards, checkout should complete without faulting.

A new "git ls-partial <treeish>" command has been created to do
steps 1 thru 3 and print the resulting list of SHAs on stdout.


E. Unresolved Thoughts
======================

*TODO* The server should optionally return (in a side-band?) a list 
of the blobs that it omitted from the packfile (and possibly the sizes
or sha1_object_info() data for them) during the fetch-pack/upload-pack
operation.  This would allow the client to distinguish from invalid
SHAs and missing ones.  Size information would allow the client to
maybe choose between various servers.

*TODO* The partial clone arguments should be recorded in ".git/info/"
so that subsequent fetch commands can inherit them and rev-list/index-pack
know to not complain by default.

*TODO* Update GC like rev-list to not complain when there are missing blobs.

*TODO* Extend ls-partial to include the "-m" and 3 tree-ish arguments
like read-tree, so we can pre-fetch for merges that may require file
merges (that may or may not be within our sparse-checkout).

*TODO* I also need to review the RFC that Mark Thomas submitted over
the weekend:
    https://public-inbox.org/git/20170304191901.9622-1-markbt%40efaref.net/t





Jeff Hostetler (10):
  pack-objects: eat CR in addition to LF after fgets.
  pack-objects: add --partial-by-size=n --partial-special
  pack-objects: test for --partial-by-size --partial-special
  upload-pack: add partial (sparse) fetch
  fetch-pack: add partial-by-size and partial-special
  rev-list: add --allow-partial option to relax connectivity checks
  index-pack: add --allow-partial option to relax blob existence checks
  fetch: add partial-by-size and partial-special arguments
  clone: add partial-by-size and partial-special arguments
  ls-partial: created command to list missing blobs

 Documentation/technical/pack-protocol.txt         |  14 ++
 Documentation/technical/protocol-capabilities.txt |   7 +
 Makefile                                          |   2 +
 builtin.h                                         |   1 +
 builtin/clone.c                                   |  26 ++
 builtin/fetch-pack.c                              |   9 +
 builtin/fetch.c                                   |  26 +-
 builtin/index-pack.c                              |  20 +-
 builtin/ls-partial.c                              | 110 +++++++++
 builtin/pack-objects.c                            |  64 ++++-
 builtin/rev-list.c                                |  22 +-
 connected.c                                       |   3 +
 connected.h                                       |   3 +
 fetch-pack.c                                      |  17 ++
 fetch-pack.h                                      |   2 +
 git.c                                             |   1 +
 partial-utils.c                                   | 279 ++++++++++++++++++++++
 partial-utils.h                                   |  93 ++++++++
 t/5316-pack-objects-partial.sh                    |  72 ++++++
 transport.c                                       |   8 +
 transport.h                                       |   8 +
 upload-pack.c                                     |  32 ++-
 22 files changed, 813 insertions(+), 6 deletions(-)
 create mode 100644 builtin/ls-partial.c
 create mode 100644 partial-utils.c
 create mode 100644 partial-utils.h
 create mode 100644 t/5316-pack-objects-partial.sh

-- 
2.7.4


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH 01/10] pack-objects: eat CR in addition to LF after fgets.
  2017-03-08 18:50 [PATCH 00/10] RFC Partial Clone and Fetch git
@ 2017-03-08 18:50 ` git
  2017-03-08 18:50 ` [PATCH 02/10] pack-objects: add --partial-by-size=n --partial-special git
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: git @ 2017-03-08 18:50 UTC (permalink / raw)
  To: git
  Cc: jeffhost, peff, gitster, markbt, benpeart, jonathantanmy,
	Jeff Hostetler

From: Jeff Hostetler <git@jeffhostetler.com>

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 builtin/pack-objects.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index f294dcf..7e052bb 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2764,6 +2764,8 @@ static void get_object_list(int ac, const char **av)
 		int len = strlen(line);
 		if (len && line[len - 1] == '\n')
 			line[--len] = 0;
+		if (len && line[len - 1] == '\r')
+			line[--len] = 0;
 		if (!len)
 			break;
 		if (*line == '-') {
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 02/10] pack-objects: add --partial-by-size=n --partial-special
  2017-03-08 18:50 [PATCH 00/10] RFC Partial Clone and Fetch git
  2017-03-08 18:50 ` [PATCH 01/10] pack-objects: eat CR in addition to LF after fgets git
@ 2017-03-08 18:50 ` git
  2017-03-08 18:50 ` [PATCH 03/10] pack-objects: test for --partial-by-size --partial-special git
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: git @ 2017-03-08 18:50 UTC (permalink / raw)
  To: git
  Cc: jeffhost, peff, gitster, markbt, benpeart, jonathantanmy,
	Jeff Hostetler

From: Jeff Hostetler <git@jeffhostetler.com>

Teach pack-objects to omit blobs from the generated packfile.

When the --partial-by-size=n[kmg] argument is used, only blobs
smaller than the requested size are included.  When n is zero,
no blobs are included.

When the --partial-special argument is used, git special files,
such as ".gitattributes" and ".gitignores" are included.

When both are given, the union of two are included.

This is intended to be used in a partial clone or fetch.
(This has also been called sparse- or lazy-clone.)

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 builtin/pack-objects.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 61 insertions(+), 1 deletion(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 7e052bb..2df2f49 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -77,6 +77,10 @@ static unsigned long cache_max_small_delta_size = 1000;
 
 static unsigned long window_memory_limit = 0;
 
+static signed long partial_by_size = -1;
+static int partial_special = 0;
+static struct trace_key trace_partial = TRACE_KEY_INIT(PARTIAL);
+
 /*
  * stats
  */
@@ -2532,6 +2536,54 @@ static void show_object(struct object *obj, const char *name, void *data)
 	obj->flags |= OBJECT_ADDED;
 }
 
+/*
+ * If ANY --partial-* option was given, we want to OMIT all
+ * blobs UNLESS they match one of our patterns.  We treat
+ * the options as OR's so that we get the resulting UNION.
+ */
+static void show_object_partial(struct object *obj, const char *name, void *data)
+{
+	unsigned long s = 0;
+
+	if (obj->type != OBJ_BLOB)
+		goto include_it;
+
+	/*
+	 * When (partial_by_size == 0), we want to OMIT all blobs.
+	 * When (partial_by_size >  0), we want blobs smaller than that.
+	 */
+	if (partial_by_size > 0) {
+		enum object_type t = sha1_object_info(obj->oid.hash, &s);
+		assert(t == OBJ_BLOB);
+		if (s < partial_by_size)
+			goto include_it;
+	}
+
+	/*
+	 * When (partial_special), we want the .git* special files.
+	 */
+	if (partial_special) {
+		if (strcmp(name, GITATTRIBUTES_FILE) == 0 ||
+			strcmp(name, ".gitignore") == 0)
+			goto include_it;
+		else {
+			const char *last_slash = strrchr(name, '/');
+			if (last_slash)
+				if (strcmp(last_slash+1, GITATTRIBUTES_FILE) == 0 ||
+					strcmp(last_slash+1, ".gitignore") == 0)
+					goto include_it;
+		}
+	}
+
+	trace_printf_key(
+		&trace_partial, "omitting blob '%s' %"PRIuMAX" '%s'\n",
+		oid_to_hex(&obj->oid), (uintmax_t)s, name);
+	return;
+
+include_it:
+	show_object(obj, name, data);
+}
+
 static void show_edge(struct commit *commit)
 {
 	add_preferred_base(commit->object.oid.hash);
@@ -2794,7 +2846,11 @@ static void get_object_list(int ac, const char **av)
 	if (prepare_revision_walk(&revs))
 		die("revision walk setup failed");
 	mark_edges_uninteresting(&revs, show_edge);
-	traverse_commit_list(&revs, show_commit, show_object, NULL);
+
+	if (partial_by_size >= 0 || partial_special)
+		traverse_commit_list(&revs, show_commit, show_object_partial, NULL);
+	else
+		traverse_commit_list(&revs, show_commit, show_object, NULL);
 
 	if (unpack_unreachable_expiration) {
 		revs.ignore_missing_links = 1;
@@ -2930,6 +2986,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			 N_("use a bitmap index if available to speed up counting objects")),
 		OPT_BOOL(0, "write-bitmap-index", &write_bitmap_index,
 			 N_("write a bitmap index together with the pack index")),
+		OPT_MAGNITUDE(0, "partial-by-size", (unsigned long *)&partial_by_size,
+			 N_("only include blobs smaller than size in result")),
+		OPT_BOOL(0, "partial-special", &partial_special,
+			 N_("only include blobs for git special files")),
 		OPT_END(),
 	};
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 03/10] pack-objects: test for --partial-by-size --partial-special
  2017-03-08 18:50 [PATCH 00/10] RFC Partial Clone and Fetch git
  2017-03-08 18:50 ` [PATCH 01/10] pack-objects: eat CR in addition to LF after fgets git
  2017-03-08 18:50 ` [PATCH 02/10] pack-objects: add --partial-by-size=n --partial-special git
@ 2017-03-08 18:50 ` git
  2017-03-08 18:50 ` [PATCH 04/10] upload-pack: add partial (sparse) fetch git
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: git @ 2017-03-08 18:50 UTC (permalink / raw)
  To: git
  Cc: jeffhost, peff, gitster, markbt, benpeart, jonathantanmy,
	Jeff Hostetler

From: Jeff Hostetler <git@jeffhostetler.com>

Some simple tests for pack-objects with the new --partial-by-size
and --partial-special options.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 t/5316-pack-objects-partial.sh | 72 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 72 insertions(+)
 create mode 100644 t/5316-pack-objects-partial.sh

diff --git a/t/5316-pack-objects-partial.sh b/t/5316-pack-objects-partial.sh
new file mode 100644
index 0000000..352de34
--- /dev/null
+++ b/t/5316-pack-objects-partial.sh
@@ -0,0 +1,72 @@
+#!/bin/sh
+
+test_description='pack-object partial'
+
+. ./test-lib.sh
+
+test_expect_success 'setup' '
+	perl -e "print \"a\" x 11;"      > a &&
+	perl -e "print \"a\" x 1100;"    > b &&
+	perl -e "print \"a\" x 1100000;" > c &&
+	echo "ignored"                   > .gitignore &&
+	git add a b c .gitignore &&
+	git commit -m test
+	'
+
+test_expect_success 'all blobs' '
+	git pack-objects --revs --thin --stdout >all.pack <<-EOF &&
+	master
+	
+	EOF
+	git index-pack all.pack &&
+	test 4 = $(git verify-pack -v all.pack | grep blob | wc -l)
+	'
+
+test_expect_success 'no blobs' '
+	git pack-objects --revs --thin --stdout --partial-by-size=0 >none.pack <<-EOF &&
+	master
+	
+	EOF
+	git index-pack none.pack &&
+	test 0 = $(git verify-pack -v none.pack | grep blob | wc -l)
+	'
+
+test_expect_success 'small blobs' '
+	git pack-objects --revs --thin --stdout --partial-by-size=1M >small.pack <<-EOF &&
+	master
+	
+	EOF
+	git index-pack small.pack &&
+	test 3 = $(git verify-pack -v small.pack | grep blob | wc -l)
+	'
+
+test_expect_success 'tiny blobs' '
+	git pack-objects --revs --thin --stdout --partial-by-size=100 >tiny.pack <<-EOF &&
+	master
+	
+	EOF
+	git index-pack tiny.pack &&
+	test 2 = $(git verify-pack -v tiny.pack | grep blob | wc -l)
+	'
+
+test_expect_success 'special' '
+	git pack-objects --revs --thin --stdout --partial-special >spec.pack <<-EOF &&
+	master
+	
+	EOF
+	git index-pack spec.pack &&
+	test 1 = $(git verify-pack -v spec.pack | grep blob | wc -l)
+	'
+
+test_expect_success 'union' '
+	git pack-objects --revs --thin --stdout --partial-by-size=0 --partial-special >union.pack <<-EOF &&
+	master
+	
+	EOF
+	git index-pack union.pack &&
+	test 1 = $(git verify-pack -v union.pack | grep blob | wc -l)
+	'
+
+test_done
+
+
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 04/10] upload-pack: add partial (sparse) fetch
  2017-03-08 18:50 [PATCH 00/10] RFC Partial Clone and Fetch git
                   ` (2 preceding siblings ...)
  2017-03-08 18:50 ` [PATCH 03/10] pack-objects: test for --partial-by-size --partial-special git
@ 2017-03-08 18:50 ` git
  2017-03-08 18:50 ` [PATCH 05/10] fetch-pack: add partial-by-size and partial-special git
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: git @ 2017-03-08 18:50 UTC (permalink / raw)
  To: git
  Cc: jeffhost, peff, gitster, markbt, benpeart, jonathantanmy,
	Jeff Hostetler

From: Jeff Hostetler <git@jeffhostetler.com>

Teach upload-pack to advertise the "partial" capability
in the fetch-pack/upload-pack protocol header and to pass
the value of partial-by-size and partial-special on to
pack-objects.

Update protocol documentation.

This might be used in conjunction with a partial (sparse) clone
or fetch to omit various blobs from the generated packfile.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 Documentation/technical/pack-protocol.txt         | 14 ++++++++++
 Documentation/technical/protocol-capabilities.txt |  7 +++++
 upload-pack.c                                     | 32 ++++++++++++++++++++++-
 3 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/Documentation/technical/pack-protocol.txt b/Documentation/technical/pack-protocol.txt
index c59ac99..0032729 100644
--- a/Documentation/technical/pack-protocol.txt
+++ b/Documentation/technical/pack-protocol.txt
@@ -212,6 +212,7 @@ out of what the server said it could do with the first 'want' line.
   upload-request    =  want-list
 		       *shallow-line
 		       *1depth-request
+		       *partial
 		       flush-pkt
 
   want-list         =  first-want
@@ -223,10 +224,15 @@ out of what the server said it could do with the first 'want' line.
 		       PKT-LINE("deepen-since" SP timestamp) /
 		       PKT-LINE("deepen-not" SP ref)
 
+  partial           =  PKT-LINE("partial-by-size" SP magnitude) /
+		       PKT-LINE("partial-special)  
+
   first-want        =  PKT-LINE("want" SP obj-id SP capability-list)
   additional-want   =  PKT-LINE("want" SP obj-id)
 
   depth             =  1*DIGIT
+
+  magnitude         =  1*DIGIT [ "k" | "m" | "g" ]
 ----
 
 Clients MUST send all the obj-ids it wants from the reference
@@ -249,6 +255,14 @@ complete those commits. Commits whose parents are not received as a
 result are defined as shallow and marked as such in the server. This
 information is sent back to the client in the next step.
 
+The client can optionally request a partial packfile that omits
+various blobs.  The value of "partial-by-size" is a non-negative
+integer with optional units and requests blobs smaller than this
+value.  The "partial-special" command requests git-special files,
+such as ".gitignore".  Using both requests the union of the two.
+These requests are only valid if the server advertises the "partial"
+capability.
+
 Once all the 'want's and 'shallow's (and optional 'deepen') are
 transferred, clients MUST send a flush-pkt, to tell the server side
 that it is done sending the list.
diff --git a/Documentation/technical/protocol-capabilities.txt b/Documentation/technical/protocol-capabilities.txt
index 26dcc6f..9aa2123 100644
--- a/Documentation/technical/protocol-capabilities.txt
+++ b/Documentation/technical/protocol-capabilities.txt
@@ -309,3 +309,10 @@ to accept a signed push certificate, and asks the <nonce> to be
 included in the push certificate.  A send-pack client MUST NOT
 send a push-cert packet unless the receive-pack server advertises
 this capability.
+
+partial
+-------
+
+If the upload-pack server advertises this capability, fetch-pack
+may send various "partial-*" commands to request a partial clone
+or fetch where the server omits certain blobs from the packfile.
diff --git a/upload-pack.c b/upload-pack.c
index 7597ba3..74f9dfa 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -63,6 +63,11 @@ static int advertise_refs;
 static int stateless_rpc;
 static const char *pack_objects_hook;
 
+static struct strbuf partial_by_size = STRBUF_INIT;
+static int client_requested_partial_capability;
+static int have_partial_by_size;
+static int have_partial_special;
+
 static void reset_timeout(void)
 {
 	alarm(timeout);
@@ -130,6 +135,10 @@ static void create_pack_file(void)
 		argv_array_push(&pack_objects.args, "--delta-base-offset");
 	if (use_include_tag)
 		argv_array_push(&pack_objects.args, "--include-tag");
+	if (have_partial_by_size)
+		argv_array_push(&pack_objects.args, partial_by_size.buf);
+	if (have_partial_special)
+		argv_array_push(&pack_objects.args, "--partial-special");
 
 	pack_objects.in = -1;
 	pack_objects.out = -1;
@@ -793,6 +802,23 @@ static void receive_needs(void)
 			deepen_rev_list = 1;
 			continue;
 		}
+		if (skip_prefix(line, "partial-by-size ", &arg)) {
+			unsigned long s;
+			if (!client_requested_partial_capability)
+				die("git upload-pack: 'partial-by-size' option requires 'partial' capability");
+			if (!git_parse_ulong(arg, &s))
+				die("git upload-pack: invalid partial-by-size value: %s", line);
+			strbuf_addstr(&partial_by_size, "--partial-by-size=");
+			strbuf_addstr(&partial_by_size, arg);
+			have_partial_by_size = 1;
+			continue;
+		}
+		if (skip_prefix(line, "partial-special", &arg)) {
+			if (!client_requested_partial_capability)
+				die("git upload-pack: 'partial-special' option requires 'partial' capability");
+			have_partial_special = 1;
+			continue;
+		}
 		if (!skip_prefix(line, "want ", &arg) ||
 		    get_sha1_hex(arg, sha1_buf))
 			die("git upload-pack: protocol error, "
@@ -820,6 +846,8 @@ static void receive_needs(void)
 			no_progress = 1;
 		if (parse_feature_request(features, "include-tag"))
 			use_include_tag = 1;
+		if (parse_feature_request(features, "partial"))
+			client_requested_partial_capability = 1;
 
 		o = parse_object(sha1_buf);
 		if (!o)
@@ -924,7 +952,9 @@ static int send_ref(const char *refname, const struct object_id *oid,
 {
 	static const char *capabilities = "multi_ack thin-pack side-band"
 		" side-band-64k ofs-delta shallow deepen-since deepen-not"
-		" deepen-relative no-progress include-tag multi_ack_detailed";
+		" deepen-relative no-progress include-tag multi_ack_detailed"
+		" partial"
+		;
 	const char *refname_nons = strip_namespace(refname);
 	struct object_id peeled;
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 05/10] fetch-pack: add partial-by-size and partial-special
  2017-03-08 18:50 [PATCH 00/10] RFC Partial Clone and Fetch git
                   ` (3 preceding siblings ...)
  2017-03-08 18:50 ` [PATCH 04/10] upload-pack: add partial (sparse) fetch git
@ 2017-03-08 18:50 ` git
  2017-03-08 18:50 ` [PATCH 06/10] rev-list: add --allow-partial option to relax connectivity checks git
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: git @ 2017-03-08 18:50 UTC (permalink / raw)
  To: git
  Cc: jeffhost, peff, gitster, markbt, benpeart, jonathantanmy,
	Jeff Hostetler

From: Jeff Hostetler <git@jeffhostetler.com>

Teach fetch-pack to take --partial-by-size and --partial-special
arguments and pass them via the transport to upload-pack to
request that certain blobs be omitted from the resulting packfile.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 builtin/fetch-pack.c |  9 +++++++++
 fetch-pack.c         | 17 +++++++++++++++++
 fetch-pack.h         |  2 ++
 transport.c          |  8 ++++++++
 transport.h          |  8 ++++++++
 5 files changed, 44 insertions(+)

diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index cfe9e44..324d7b2 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -8,6 +8,7 @@
 static const char fetch_pack_usage[] =
 "git fetch-pack [--all] [--stdin] [--quiet | -q] [--keep | -k] [--thin] "
 "[--include-tag] [--upload-pack=<git-upload-pack>] [--depth=<n>] "
+"[--partial-by-size=<n>] [--partial-special] "
 "[--no-progress] [--diag-url] [-v] [<host>:]<directory> [<refs>...]";
 
 static void add_sought_entry(struct ref ***sought, int *nr, int *alloc,
@@ -143,6 +144,14 @@ int cmd_fetch_pack(int argc, const char **argv, const char *prefix)
 			args.update_shallow = 1;
 			continue;
 		}
+		if (skip_prefix(arg, "--partial-by-size=", &arg)) {
+			args.partial_by_size = xstrdup(arg);
+			continue;
+		}
+		if (!strcmp("--partial-special", arg)) {
+			args.partial_special = 1;
+			continue;
+		}
 		usage(fetch_pack_usage);
 	}
 	if (deepen_not.nr)
diff --git a/fetch-pack.c b/fetch-pack.c
index e0f5d5c..e355c38 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -372,6 +372,8 @@ static int find_common(struct fetch_pack_args *args,
 			if (prefer_ofs_delta)   strbuf_addstr(&c, " ofs-delta");
 			if (deepen_since_ok)    strbuf_addstr(&c, " deepen-since");
 			if (deepen_not_ok)      strbuf_addstr(&c, " deepen-not");
+			if (args->partial_by_size || args->partial_special)
+				strbuf_addstr(&c, " partial");
 			if (agent_supported)    strbuf_addf(&c, " agent=%s",
 							    git_user_agent_sanitized());
 			packet_buf_write(&req_buf, "want %s%s\n", remote_hex, c.buf);
@@ -402,6 +404,12 @@ static int find_common(struct fetch_pack_args *args,
 			packet_buf_write(&req_buf, "deepen-not %s", s->string);
 		}
 	}
+
+	if (args->partial_by_size)
+		packet_buf_write(&req_buf, "partial-by-size %s", args->partial_by_size);
+	if (args->partial_special)
+		packet_buf_write(&req_buf, "partial-special");
+
 	packet_buf_flush(&req_buf);
 	state_len = req_buf.len;
 
@@ -807,6 +815,10 @@ static int get_pack(struct fetch_pack_args *args,
 					"--keep=fetch-pack %"PRIuMAX " on %s",
 					(uintmax_t)getpid(), hostname);
 		}
+
+		if (args->partial_by_size || args->partial_special)
+			argv_array_push(&cmd.args, "--allow-partial");
+
 		if (args->check_self_contained_and_connected)
 			argv_array_push(&cmd.args, "--check-self-contained-and-connected");
 	}
@@ -920,6 +932,11 @@ static struct ref *do_fetch_pack(struct fetch_pack_args *args,
 	else
 		prefer_ofs_delta = 0;
 
+	if (server_supports("partial"))
+		print_verbose(args, _("Server supports partial"));
+	else if (args->partial_by_size || args->partial_special)
+		die(_("Server does not support 'partial'"));
+
 	if ((agent_feature = server_feature_value("agent", &agent_len))) {
 		agent_supported = 1;
 		if (agent_len)
diff --git a/fetch-pack.h b/fetch-pack.h
index c912e3d..b8a26e0 100644
--- a/fetch-pack.h
+++ b/fetch-pack.h
@@ -12,6 +12,7 @@ struct fetch_pack_args {
 	int depth;
 	const char *deepen_since;
 	const struct string_list *deepen_not;
+	const char *partial_by_size;
 	unsigned deepen_relative:1;
 	unsigned quiet:1;
 	unsigned keep_pack:1;
@@ -29,6 +30,7 @@ struct fetch_pack_args {
 	unsigned cloning:1;
 	unsigned update_shallow:1;
 	unsigned deepen:1;
+	unsigned partial_special:1;
 };
 
 /*
diff --git a/transport.c b/transport.c
index 5828e06..45f35a4 100644
--- a/transport.c
+++ b/transport.c
@@ -160,6 +160,12 @@ static int set_git_option(struct git_transport_options *opts,
 	} else if (!strcmp(name, TRANS_OPT_DEEPEN_RELATIVE)) {
 		opts->deepen_relative = !!value;
 		return 0;
+	} else if (!strcmp(name, TRANS_OPT_PARTIAL_BY_SIZE)) {
+		opts->partial_by_size = xstrdup(value);
+		return 0;
+	} else if (!strcmp(name, TRANS_OPT_PARTIAL_SPECIAL)) {
+		opts->partial_special = !!value;
+		return 0;
 	}
 	return 1;
 }
@@ -227,6 +233,8 @@ static int fetch_refs_via_pack(struct transport *transport,
 		data->options.check_self_contained_and_connected;
 	args.cloning = transport->cloning;
 	args.update_shallow = data->options.update_shallow;
+	args.partial_by_size = data->options.partial_by_size;
+	args.partial_special = data->options.partial_special;
 
 	if (!data->got_remote_heads) {
 		connect_setup(transport, 0);
diff --git a/transport.h b/transport.h
index bc55715..c3f2d52 100644
--- a/transport.h
+++ b/transport.h
@@ -15,12 +15,14 @@ struct git_transport_options {
 	unsigned self_contained_and_connected : 1;
 	unsigned update_shallow : 1;
 	unsigned deepen_relative : 1;
+	unsigned partial_special : 1;
 	int depth;
 	const char *deepen_since;
 	const struct string_list *deepen_not;
 	const char *uploadpack;
 	const char *receivepack;
 	struct push_cas_option *cas;
+	const char *partial_by_size;
 };
 
 enum transport_family {
@@ -210,6 +212,12 @@ void transport_check_allowed(const char *type);
 /* Send push certificates */
 #define TRANS_OPT_PUSH_CERT "pushcert"
 
+/* Partial fetch to only include small files */
+#define TRANS_OPT_PARTIAL_BY_SIZE "partial-by-size"
+
+/* Partial fetch to only include special files, like ".gitignore" */
+#define TRANS_OPT_PARTIAL_SPECIAL "partial-special"
+
 /**
  * Returns 0 if the option was used, non-zero otherwise. Prints a
  * message to stderr if the option is not used.
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 06/10] rev-list: add --allow-partial option to relax connectivity checks
  2017-03-08 18:50 [PATCH 00/10] RFC Partial Clone and Fetch git
                   ` (4 preceding siblings ...)
  2017-03-08 18:50 ` [PATCH 05/10] fetch-pack: add partial-by-size and partial-special git
@ 2017-03-08 18:50 ` git
  2017-03-08 18:50 ` [PATCH 07/10] index-pack: add --allow-partial option to relax blob existence checks git
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: git @ 2017-03-08 18:50 UTC (permalink / raw)
  To: git
  Cc: jeffhost, peff, gitster, markbt, benpeart, jonathantanmy,
	Jeff Hostetler

From: Jeff Hostetler <git@jeffhostetler.com>

Teach rev-list to optionally not complain when there are missing
blobs.  This is for use following a partial clone or fetch when
the server omitted certain blobs.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 builtin/rev-list.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 0aa93d5..50c49ba 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -45,6 +45,7 @@ static const char rev_list_usage[] =
 "    --left-right\n"
 "    --count\n"
 "  special purpose:\n"
+"    --allow-partial\n"
 "    --bisect\n"
 "    --bisect-vars\n"
 "    --bisect-all"
@@ -53,6 +54,9 @@ static const char rev_list_usage[] =
 static struct progress *progress;
 static unsigned progress_counter;
 
+static int allow_partial;
+static struct trace_key trace_partial = TRACE_KEY_INIT(PARTIAL);
+
 static void finish_commit(struct commit *commit, void *data);
 static void show_commit(struct commit *commit, void *data)
 {
@@ -178,8 +182,16 @@ static void finish_commit(struct commit *commit, void *data)
 static void finish_object(struct object *obj, const char *name, void *cb_data)
 {
 	struct rev_list_info *info = cb_data;
-	if (obj->type == OBJ_BLOB && !has_object_file(&obj->oid))
+	if (obj->type == OBJ_BLOB && !has_object_file(&obj->oid)) {
+		if (allow_partial) {
+			/* Assume a previous partial clone/fetch omitted it. */
+			trace_printf_key(
+				&trace_partial, "omitted blob '%s' '%s'\n",
+				oid_to_hex(&obj->oid), name);
+			return;
+		}
 		die("missing blob object '%s'", oid_to_hex(&obj->oid));
+	}
 	if (info->revs->verify_objects && !obj->parsed && obj->type != OBJ_COMMIT)
 		parse_object(obj->oid.hash);
 }
@@ -329,6 +341,14 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 			show_progress = arg;
 			continue;
 		}
+		if (!strcmp(arg, "--allow-partial")) {
+			allow_partial = 1;
+			continue;
+		}
+		if (!strcmp(arg, "--no-allow-partial")) {
+			allow_partial = 0;
+			continue;
+		}
 		usage(rev_list_usage);
 
 	}
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 07/10] index-pack: add --allow-partial option to relax blob existence checks
  2017-03-08 18:50 [PATCH 00/10] RFC Partial Clone and Fetch git
                   ` (5 preceding siblings ...)
  2017-03-08 18:50 ` [PATCH 06/10] rev-list: add --allow-partial option to relax connectivity checks git
@ 2017-03-08 18:50 ` git
  2017-03-08 18:50 ` [PATCH 08/10] fetch: add partial-by-size and partial-special arguments git
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: git @ 2017-03-08 18:50 UTC (permalink / raw)
  To: git
  Cc: jeffhost, peff, gitster, markbt, benpeart, jonathantanmy,
	Jeff Hostetler

From: Jeff Hostetler <git@jeffhostetler.com>

Teach index-pack to optionally not complain when there are missing
blobs.  This is for use following a partial clone or fetch when
the server omitted certain blobs.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 builtin/index-pack.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index f4b87c6..8f99408 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -13,7 +13,7 @@
 #include "thread-utils.h"
 
 static const char index_pack_usage[] =
-"git index-pack [-v] [-o <index-file>] [--keep | --keep=<msg>] [--verify] [--strict] (<pack-file> | --stdin [--fix-thin] [<pack-file>])";
+"git index-pack [-v] [-o <index-file>] [--keep | --keep=<msg>] [--verify] [--allow-partial] [--strict] (<pack-file> | --stdin [--fix-thin] [<pack-file>])";
 
 struct object_entry {
 	struct pack_idx_entry idx;
@@ -81,6 +81,9 @@ static int show_resolving_progress;
 static int show_stat;
 static int check_self_contained_and_connected;
 
+static int allow_partial;
+static struct trace_key trace_partial = TRACE_KEY_INIT(PARTIAL);
+
 static struct progress *progress;
 
 /* We always read in 4kB chunks. */
@@ -220,9 +223,18 @@ static unsigned check_object(struct object *obj)
 	if (!(obj->flags & FLAG_CHECKED)) {
 		unsigned long size;
 		int type = sha1_object_info(obj->oid.hash, &size);
-		if (type <= 0)
+		if (type <= 0) {
+			if (allow_partial > 0 && obj->type == OBJ_BLOB) {
+				/* Assume a previous partial clone/fetch omitted it. */
+				trace_printf_key(
+					&trace_partial, "omitted blob '%s'\n",
+					oid_to_hex(&obj->oid));
+				obj->flags |= FLAG_CHECKED;
+				return 0;
+			}
 			die(_("did not receive expected object %s"),
 			      oid_to_hex(&obj->oid));
+		}
 		if (type != obj->type)
 			die(_("object %s: expected type %s, found %s"),
 			    oid_to_hex(&obj->oid),
@@ -1718,6 +1730,10 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 					die(_("bad %s"), arg);
 			} else if (skip_prefix(arg, "--max-input-size=", &arg)) {
 				max_input_size = strtoumax(arg, NULL, 10);
+			} else if (!strcmp(arg, "--allow-partial")) {
+				allow_partial = 1;
+			} else if (!strcmp(arg, "--no-allow-partial")) {
+				allow_partial = 0;
 			} else
 				usage(index_pack_usage);
 			continue;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 08/10] fetch: add partial-by-size and partial-special arguments
  2017-03-08 18:50 [PATCH 00/10] RFC Partial Clone and Fetch git
                   ` (6 preceding siblings ...)
  2017-03-08 18:50 ` [PATCH 07/10] index-pack: add --allow-partial option to relax blob existence checks git
@ 2017-03-08 18:50 ` git
  2017-03-08 18:50 ` [PATCH 09/10] clone: " git
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: git @ 2017-03-08 18:50 UTC (permalink / raw)
  To: git
  Cc: jeffhost, peff, gitster, markbt, benpeart, jonathantanmy,
	Jeff Hostetler

From: Jeff Hostetler <git@jeffhostetler.com>

Teach fetch to accept --partial-by-size=n and --partial-special
arguments and pass them to fetch-patch to request that the
server certain blobs from the generated packfile.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 builtin/fetch.c | 26 +++++++++++++++++++++++++-
 connected.c     |  3 +++
 connected.h     |  3 +++
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index b5ad09d..3d47107 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -52,6 +52,8 @@ static const char *recurse_submodules_default;
 static int shown_url = 0;
 static int refmap_alloc, refmap_nr;
 static const char **refmap_array;
+static const char *partial_by_size;
+static int partial_special;
 
 static int option_parse_recurse_submodules(const struct option *opt,
 				   const char *arg, int unset)
@@ -141,6 +143,11 @@ static struct option builtin_fetch_options[] = {
 			TRANSPORT_FAMILY_IPV4),
 	OPT_SET_INT('6', "ipv6", &family, N_("use IPv6 addresses only"),
 			TRANSPORT_FAMILY_IPV6),
+	OPT_STRING(0, "partial-by-size", &partial_by_size,
+			   N_("size"),
+			   N_("only include blobs smaller than this")),
+	OPT_BOOL(0, "partial-special", &partial_special,
+			 N_("only include blobs for git special files")),
 	OPT_END()
 };
 
@@ -731,6 +738,10 @@ static int store_updated_refs(const char *raw_url, const char *remote_name,
 	const char *filename = dry_run ? "/dev/null" : git_path_fetch_head();
 	int want_status;
 	int summary_width = transport_summary_width(ref_map);
+	struct check_connected_options opt = CHECK_CONNECTED_INIT;
+
+	if (partial_by_size || partial_special)
+		opt.allow_partial = 1;
 
 	fp = fopen(filename, "a");
 	if (!fp)
@@ -742,7 +753,7 @@ static int store_updated_refs(const char *raw_url, const char *remote_name,
 		url = xstrdup("foreign");
 
 	rm = ref_map;
-	if (check_connected(iterate_ref_map, &rm, NULL)) {
+	if (check_connected(iterate_ref_map, &rm, &opt)) {
 		rc = error(_("%s did not send all necessary objects\n"), url);
 		goto abort;
 	}
@@ -882,6 +893,9 @@ static int quickfetch(struct ref *ref_map)
 	struct ref *rm = ref_map;
 	struct check_connected_options opt = CHECK_CONNECTED_INIT;
 
+	if (partial_by_size || partial_special)
+		opt.allow_partial = 1;
+
 	/*
 	 * If we are deepening a shallow clone we already have these
 	 * objects reachable.  Running rev-list here will return with
@@ -1020,6 +1034,10 @@ static struct transport *prepare_transport(struct remote *remote, int deepen)
 		set_option(transport, TRANS_OPT_DEEPEN_RELATIVE, "yes");
 	if (update_shallow)
 		set_option(transport, TRANS_OPT_UPDATE_SHALLOW, "yes");
+	if (partial_by_size)
+		set_option(transport, TRANS_OPT_PARTIAL_BY_SIZE, partial_by_size);
+	if (partial_special)
+		set_option(transport, TRANS_OPT_PARTIAL_SPECIAL, "yes");
 	return transport;
 }
 
@@ -1314,6 +1332,12 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 	argc = parse_options(argc, argv, prefix,
 			     builtin_fetch_options, builtin_fetch_usage, 0);
 
+	if (partial_by_size) {
+		unsigned long s;
+		if (!git_parse_ulong(partial_by_size, &s))
+			die(_("invalid partial-by-size value"));
+	}
+
 	if (deepen_relative) {
 		if (deepen_relative < 0)
 			die(_("Negative depth in --deepen is not supported"));
diff --git a/connected.c b/connected.c
index 136c2ac..b07cbb5 100644
--- a/connected.c
+++ b/connected.c
@@ -62,6 +62,9 @@ int check_connected(sha1_iterate_fn fn, void *cb_data,
 		argv_array_pushf(&rev_list.args, "--progress=%s",
 				 _("Checking connectivity"));
 
+	if (opt->allow_partial)
+		argv_array_push(&rev_list.args, "--allow-partial");
+
 	rev_list.git_cmd = 1;
 	rev_list.env = opt->env;
 	rev_list.in = -1;
diff --git a/connected.h b/connected.h
index 4ca325f..756259e 100644
--- a/connected.h
+++ b/connected.h
@@ -34,6 +34,9 @@ struct check_connected_options {
 	/* If non-zero, show progress as we traverse the objects. */
 	int progress;
 
+	/* A previous partial clone/fetch may have omitted some blobs. */
+	int allow_partial;
+
 	/*
 	 * Insert these variables into the environment of the child process.
 	 */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 09/10] clone: add partial-by-size and partial-special arguments
  2017-03-08 18:50 [PATCH 00/10] RFC Partial Clone and Fetch git
                   ` (7 preceding siblings ...)
  2017-03-08 18:50 ` [PATCH 08/10] fetch: add partial-by-size and partial-special arguments git
@ 2017-03-08 18:50 ` git
  2017-03-08 18:50 ` [PATCH 10/10] ls-partial: created command to list missing blobs git
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: git @ 2017-03-08 18:50 UTC (permalink / raw)
  To: git
  Cc: jeffhost, peff, gitster, markbt, benpeart, jonathantanmy,
	Jeff Hostetler

From: Jeff Hostetler <git@jeffhostetler.com>

Teach clone to accept --partial-by-size=n and --partial-special
arguments to request that the server omit certain blobs from
the generated packfile.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 builtin/clone.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/builtin/clone.c b/builtin/clone.c
index 3f63edb..e5a5904 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -56,6 +56,8 @@ static struct string_list option_required_reference = STRING_LIST_INIT_NODUP;
 static struct string_list option_optional_reference = STRING_LIST_INIT_NODUP;
 static int option_dissociate;
 static int max_jobs = -1;
+static const char *partial_by_size;
+static int partial_special;
 
 static struct option builtin_clone_options[] = {
 	OPT__VERBOSITY(&option_verbosity),
@@ -112,6 +114,11 @@ static struct option builtin_clone_options[] = {
 			TRANSPORT_FAMILY_IPV4),
 	OPT_SET_INT('6', "ipv6", &family, N_("use IPv6 addresses only"),
 			TRANSPORT_FAMILY_IPV6),
+	OPT_STRING(0, "partial-by-size", &partial_by_size,
+			   N_("size"),
+			   N_("only include blobs smaller than this")),
+	OPT_BOOL(0, "partial-special", &partial_special,
+			 N_("only include blobs for git special files")),
 	OPT_END()
 };
 
@@ -625,6 +632,9 @@ static void update_remote_refs(const struct ref *refs,
 	if (check_connectivity) {
 		struct check_connected_options opt = CHECK_CONNECTED_INIT;
 
+		if (partial_by_size || partial_special)
+			opt.allow_partial = 1;
+
 		opt.transport = transport;
 		opt.progress = transport->progress;
 
@@ -1021,6 +1031,10 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 			warning(_("--shallow-since is ignored in local clones; use file:// instead."));
 		if (option_not.nr)
 			warning(_("--shallow-exclude is ignored in local clones; use file:// instead."));
+		if (partial_by_size)
+			warning(_("--partial-by-size is ignored in local clones; use file:// instead."));
+		if (partial_special)
+			warning(_("--partial-special is ignored in local clones; use file:// instead."));
 		if (!access(mkpath("%s/shallow", path), F_OK)) {
 			if (option_local > 0)
 				warning(_("source repository is shallow, ignoring --local"));
@@ -1052,6 +1066,18 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 		transport_set_option(transport, TRANS_OPT_UPLOADPACK,
 				     option_upload_pack);
 
+	if (partial_by_size) {
+		transport_set_option(transport, TRANS_OPT_PARTIAL_BY_SIZE,
+				     partial_by_size);
+		if (transport->smart_options)
+			transport->smart_options->partial_by_size = partial_by_size;
+	}
+	if (partial_special) {
+		transport_set_option(transport, TRANS_OPT_PARTIAL_SPECIAL, "yes");
+		if (transport->smart_options)
+			transport->smart_options->partial_special = 1;
+	}
+
 	if (transport->smart_options && !deepen)
 		transport->smart_options->check_self_contained_and_connected = 1;
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH 10/10] ls-partial: created command to list missing blobs
  2017-03-08 18:50 [PATCH 00/10] RFC Partial Clone and Fetch git
                   ` (8 preceding siblings ...)
  2017-03-08 18:50 ` [PATCH 09/10] clone: " git
@ 2017-03-08 18:50 ` git
  2017-03-09 20:18 ` [PATCH 00/10] RFC Partial Clone and Fetch Jonathan Tan
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 25+ messages in thread
From: git @ 2017-03-08 18:50 UTC (permalink / raw)
  To: git
  Cc: jeffhost, peff, gitster, markbt, benpeart, jonathantanmy,
	Jeff Hostetler

From: Jeff Hostetler <git@jeffhostetler.com>

Added a command to list the missing blobs for a commit.
This can be used after a partial clone or fetch to list
the omitted blobs that the client would need to checkout
the given commit/branch.  Optionally respecting or ignoring
the current sparse-checkout definition.

This command prints a simple list of blob SHAs.  It is
expected that this would be piped into another command
with knowledge of the transport and/or blob store.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 Makefile             |   2 +
 builtin.h            |   1 +
 builtin/ls-partial.c | 110 ++++++++++++++++++++
 git.c                |   1 +
 partial-utils.c      | 279 +++++++++++++++++++++++++++++++++++++++++++++++++++
 partial-utils.h      |  93 +++++++++++++++++
 6 files changed, 486 insertions(+)
 create mode 100644 builtin/ls-partial.c
 create mode 100644 partial-utils.c
 create mode 100644 partial-utils.h

diff --git a/Makefile b/Makefile
index 9ec6065..96e9e1e 100644
--- a/Makefile
+++ b/Makefile
@@ -791,6 +791,7 @@ LIB_OBJS += pack-write.o
 LIB_OBJS += pager.o
 LIB_OBJS += parse-options.o
 LIB_OBJS += parse-options-cb.o
+LIB_OBJS += partial-utils.o
 LIB_OBJS += patch-delta.o
 LIB_OBJS += patch-ids.o
 LIB_OBJS += path.o
@@ -908,6 +909,7 @@ BUILTIN_OBJS += builtin/init-db.o
 BUILTIN_OBJS += builtin/interpret-trailers.o
 BUILTIN_OBJS += builtin/log.o
 BUILTIN_OBJS += builtin/ls-files.o
+BUILTIN_OBJS += builtin/ls-partial.o
 BUILTIN_OBJS += builtin/ls-remote.o
 BUILTIN_OBJS += builtin/ls-tree.o
 BUILTIN_OBJS += builtin/mailinfo.o
diff --git a/builtin.h b/builtin.h
index 9e4a898..df00c4b 100644
--- a/builtin.h
+++ b/builtin.h
@@ -79,6 +79,7 @@ extern int cmd_interpret_trailers(int argc, const char **argv, const char *prefi
 extern int cmd_log(int argc, const char **argv, const char *prefix);
 extern int cmd_log_reflog(int argc, const char **argv, const char *prefix);
 extern int cmd_ls_files(int argc, const char **argv, const char *prefix);
+extern int cmd_ls_partial(int argc, const char **argv, const char *prefix);
 extern int cmd_ls_tree(int argc, const char **argv, const char *prefix);
 extern int cmd_ls_remote(int argc, const char **argv, const char *prefix);
 extern int cmd_mailinfo(int argc, const char **argv, const char *prefix);
diff --git a/builtin/ls-partial.c b/builtin/ls-partial.c
new file mode 100644
index 0000000..8ebf045
--- /dev/null
+++ b/builtin/ls-partial.c
@@ -0,0 +1,110 @@
+#include "cache.h"
+#include "blob.h"
+#include "tree.h"
+#include "commit.h"
+#include "quote.h"
+#include "builtin.h"
+#include "parse-options.h"
+#include "pathspec.h"
+#include "dir.h"
+#include "partial-utils.h"
+
+static struct trace_key trace_partial = TRACE_KEY_INIT(PARTIAL);
+
+static int verbose;
+static int ignore_sparse;
+struct exclude_list el;
+
+static const char * const ls_partial_usage[] = {
+	N_("git ls-partial [<options>] <tree-ish>"),
+	NULL
+};
+
+/*
+ * map <tree-ish> arg into SHA1 and get the root treenode.
+ */
+static struct tree *lookup_tree_from_treeish(const char *arg)
+{
+	unsigned char sha1[20];
+	struct tree *tree;
+
+	if (get_sha1(arg, sha1))
+		die("not a valid object name '%s'", arg);
+
+	trace_printf_key(
+		&trace_partial,
+		"ls-partial: treeish '%s' '%s'\n",
+		arg, sha1_to_hex(sha1));
+
+	if (verbose) {
+		printf("commit\t%s\n", sha1_to_hex(sha1));
+		printf("branch\t%s\n", arg);
+	}
+	
+	tree = parse_tree_indirect(sha1);
+	if (!tree)
+		die("not a tree object '%s'", arg);
+
+	return tree;
+}
+
+static void print_results(const struct pu_vec *vec)
+{
+	int k;
+
+	for (k = 0; k < vec->data_nr; k++)
+		printf("%s\n", oid_to_hex(&vec->data[k]->oid));
+}
+
+static void print_results_verbose(const struct pu_vec *vec)
+{
+	int k;
+
+	/* TODO Consider -z version */
+
+	for (k = 0; k < vec->data_nr; k++)
+		printf("%s\t%s\n", oid_to_hex(&vec->data[k]->oid), vec->data[k]->fullpath.buf);
+}
+
+int cmd_ls_partial(int argc, const char **argv, const char *prefix)
+{
+	struct exclude_list el;
+	struct tree *tree;
+	struct pu_vec *vec;
+	struct pu_vec *vec_all = NULL;
+	struct pu_vec *vec_sparse = NULL;
+	struct pu_vec *vec_missing = NULL;
+	
+	const struct option ls_partial_options[] = {
+		OPT__VERBOSE(&verbose, N_("show verbose blob details")),
+		OPT_BOOL(0, "ignore-sparse", &ignore_sparse,
+				 N_("ignore sparse-checkout settings (scan whole tree)")),
+		OPT_END()
+	};
+
+	git_config(git_default_config, NULL);
+	argc = parse_options(argc, argv, prefix,
+						 ls_partial_options, ls_partial_usage, 0);
+	if (argc < 1)
+		usage_with_options(ls_partial_usage, ls_partial_options);
+
+	tree = lookup_tree_from_treeish(argv[0]);
+
+	vec_all = pu_vec_ls_tree(tree, prefix, argv + 1);
+	if (ignore_sparse || pu_load_sparse_definitions("info/sparse-checkout", &el) < 0)
+		vec = vec_all;
+	else {
+		vec_sparse = pu_vec_filter_sparse(vec_all, &el);
+		vec = vec_sparse;
+	}
+
+	vec_missing = pu_vec_filter_missing(vec);
+	vec = vec_missing;
+
+	if (verbose)
+		print_results_verbose(vec);
+	else
+		print_results(vec);
+
+	return 0;
+}
diff --git a/git.c b/git.c
index 33f52ac..ef1e019 100644
--- a/git.c
+++ b/git.c
@@ -444,6 +444,7 @@ static struct cmd_struct commands[] = {
 	{ "interpret-trailers", cmd_interpret_trailers, RUN_SETUP_GENTLY },
 	{ "log", cmd_log, RUN_SETUP },
 	{ "ls-files", cmd_ls_files, RUN_SETUP | SUPPORT_SUPER_PREFIX },
+	{ "ls-partial", cmd_ls_partial, RUN_SETUP },
 	{ "ls-remote", cmd_ls_remote, RUN_SETUP_GENTLY },
 	{ "ls-tree", cmd_ls_tree, RUN_SETUP },
 	{ "mailinfo", cmd_mailinfo, RUN_SETUP_GENTLY },
diff --git a/partial-utils.c b/partial-utils.c
new file mode 100644
index 0000000..b75e91e
--- /dev/null
+++ b/partial-utils.c
@@ -0,0 +1,279 @@
+#include "cache.h"
+#include "blob.h"
+#include "tree.h"
+#include "commit.h"
+#include "quote.h"
+#include "builtin.h"
+#include "parse-options.h"
+#include "pathspec.h"
+#include "dir.h"
+#include "partial-utils.h"
+
+static struct trace_key trace_partial_utils = TRACE_KEY_INIT(PARTIAL_UTILS);
+
+void pu_row_trace(
+	const struct pu_row *row,
+	const char *label)
+{
+	trace_printf_key(
+		&trace_partial_utils,
+		"%s: %06o %s %.*s\n",
+		label,
+		row->mode,
+		oid_to_hex(&row->oid),
+		(int)row->fullpath.len,
+		row->fullpath.buf);
+}
+
+struct pu_row *pu_row_alloc(
+	const unsigned char *sha1,
+	const struct strbuf *base,
+	const char *entryname,
+	unsigned mode)
+{
+	struct pu_row *row = xcalloc(1, sizeof(struct pu_row));
+
+	hashcpy(row->oid.hash, sha1);
+	strbuf_init(&row->fullpath, base->len + strlen(entryname) + 1);
+	if (base->len)
+		strbuf_addbuf(&row->fullpath, base);
+	strbuf_addstr(&row->fullpath, entryname);
+	row->mode = mode;
+	row->entryname_offset = base->len;
+
+	pu_row_trace(row, "alloc");
+
+	return row;
+}
+
+struct pu_vec *pu_vec_alloc(
+	unsigned int nr_pre_alloc)
+{
+	struct pu_vec *vec = xcalloc(1, sizeof(struct pu_vec));
+
+	vec->data = xcalloc(nr_pre_alloc, sizeof(struct pu_row *));
+	vec->data_alloc = nr_pre_alloc;
+
+	return vec;
+}
+
+void pu_vec_append(
+	struct pu_vec *vec,
+	struct pu_row *row)
+{
+	ALLOC_GROW(vec->data, vec->data_nr + 1, vec->data_alloc);
+	vec->data[vec->data_nr++] = row;
+}
+
+static int ls_tree_cb(
+	const unsigned char *sha1,
+	struct strbuf *base,
+	const char *pathname,
+	unsigned mode,
+	int stage,
+	void *context)
+{
+	struct pu_vec *vec = (struct pu_vec *)context;
+
+	/* omit submodules */
+	if (S_ISGITLINK(mode))
+		return 0;
+
+	pu_vec_append(vec, pu_row_alloc(sha1, base, pathname, mode));
+
+	if (S_ISDIR(mode))
+		return READ_TREE_RECURSIVE;
+
+	return 0;
+}
+
+struct pu_vec *pu_vec_ls_tree(
+	struct tree *tree,
+	const char *prefix,
+	const char **argv)
+{
+	struct pu_vec *vec;
+	struct pathspec pathspec;
+	int k;
+
+	vec = pu_vec_alloc(PU_VEC_DEFAULT_SIZE);
+
+	parse_pathspec(
+		&pathspec, PATHSPEC_GLOB | PATHSPEC_ICASE | PATHSPEC_EXCLUDE,
+		PATHSPEC_PREFER_CWD, prefix, argv);
+	for (k = 0; k < pathspec.nr; k++)
+		pathspec.items[k].nowildcard_len = pathspec.items[k].len;
+	pathspec.has_wildcard = 0;
+
+	if (read_tree_recursive(tree, "", 0, 0, &pathspec, ls_tree_cb, vec) != 0)
+		die("Could not read tree");
+
+	return vec;
+}
+
+int pu_load_sparse_definitions(
+	const char *path,
+	struct exclude_list *pel)
+{
+	int result;
+	char *sparse = git_pathdup("info/sparse-checkout");
+	memset(pel, 0, sizeof(*pel));
+	result = add_excludes_from_file_to_list(sparse, "", 0, pel, 0);
+	free(sparse);
+	return result;
+}
+
+static int mode_to_dtype(unsigned mode)
+{
+	if (S_ISREG(mode))
+		return DT_REG;
+	if (S_ISDIR(mode) || S_ISGITLINK(mode))
+		return DT_DIR;
+	if (S_ISLNK(mode))
+		return DT_LNK;
+	return DT_UNKNOWN;
+}
+
+static int apply_excludes_1(
+	struct pu_row **subset,
+	unsigned int nr,
+	struct strbuf *prefix,
+	struct exclude_list *pel,
+	int defval,
+	struct pu_vec *vec_out);
+
+/* apply directory rules. based on clear_ce_flags_dir() */
+static int apply_excludes_dir(
+	struct pu_row **subset,
+	unsigned int nr,
+	struct strbuf *prefix,
+	char *basename,
+	struct exclude_list *pel,
+	int defval,
+	struct pu_vec *vec_out)
+{
+	struct pu_row **subset_end;
+	int dtype = DT_DIR;
+	int ret = is_excluded_from_list(
+		prefix->buf, prefix->len, basename, &dtype, pel);
+	int rc;
+
+	strbuf_addch(prefix, '/');
+
+	if (ret < 0)
+		ret = defval;
+
+	for (subset_end = subset; subset_end != subset + nr; subset_end++) {
+		struct pu_row *row = *subset_end;
+		if (strncmp(row->fullpath.buf, prefix->buf, prefix->len))
+			break;
+	}
+
+	rc = apply_excludes_1(
+		subset, subset_end - subset,
+		prefix, pel, ret,
+		vec_out);
+	strbuf_setlen(prefix, prefix->len - 1);
+	return rc;
+}
+
+/* apply sparse rules to subset[0..nr). based on clear_ce_flags_1() */
+static int apply_excludes_1(
+	struct pu_row **subset,
+	unsigned int nr,
+	struct strbuf *prefix,
+	struct exclude_list *pel,
+	int defval,
+	struct pu_vec *vec_out)
+{
+	struct pu_row **subset_end = subset + nr;
+
+	while (subset != subset_end) {
+		struct pu_row *row = *subset;
+		const char *name, *slash;
+		int len, dtype, val;
+
+		if (prefix->len && strncmp(row->fullpath.buf, prefix->buf, prefix->len))
+			break;
+
+		name = row->fullpath.buf + prefix->len;
+		slash = strchr(name, '/');
+
+		if (slash) {
+			int processed;
+
+			len = slash - name;
+			strbuf_add(prefix, name, len);
+
+			processed = apply_excludes_dir(
+				subset, subset_end - subset,
+				prefix, prefix->buf + prefix->len - len,
+				pel, defval,
+				vec_out);
+
+			if (processed) {
+				subset += processed;
+				strbuf_setlen(prefix, prefix->len - len);
+				continue;
+			}
+
+			strbuf_addch(prefix, '/');
+			subset += apply_excludes_1(
+				subset, subset_end - subset,
+				prefix, pel, defval,
+				vec_out);
+			strbuf_setlen(prefix, prefix->len - len - 1);
+			continue;
+		}
+
+		dtype = mode_to_dtype(row->mode);
+		val = is_excluded_from_list(
+			row->fullpath.buf, row->fullpath.len, name, &dtype, pel);
+		if (val < 0)
+			val = defval;
+		if (val > 0) {
+			pu_row_trace(row, "sparse");
+			pu_vec_append(vec_out, row);
+		}
+		subset++;
+	}
+
+	return nr - (subset_end - subset);
+}
+
+struct pu_vec *pu_vec_filter_sparse(
+	const struct pu_vec *vec_in,
+	struct exclude_list *pel)
+{
+	struct pu_vec *vec_out;
+	struct strbuf prefix = STRBUF_INIT;
+	int defval = 0;
+
+	vec_out = pu_vec_alloc(vec_in->data_nr);
+
+	apply_excludes_1(
+		vec_in->data, vec_in->data_nr,
+		&prefix, pel, defval,
+		vec_out);
+
+	return vec_out;
+}
+
+struct pu_vec *pu_vec_filter_missing(
+	const struct pu_vec *vec_in)
+{
+	struct pu_vec *vec_out;
+	int k;
+
+	vec_out = pu_vec_alloc(vec_in->data_nr);
+
+	for (k = 0; k < vec_in->data_nr; k++) {
+		struct pu_row *row = vec_in->data[k];
+		if (!has_sha1_file(row->oid.hash)) {
+			pu_row_trace(row, "missing");
+			pu_vec_append(vec_out, row);
+		}
+	}
+
+	return vec_out;
+}
diff --git a/partial-utils.h b/partial-utils.h
new file mode 100644
index 0000000..3bdf2e4
--- /dev/null
+++ b/partial-utils.h
@@ -0,0 +1,93 @@
+#ifndef PARTIAL_UTILS_H
+#define PARTIAL_UTILS_H
+
+/*
+ * A 'partial-utils row' represents a single item in the tree.
+ * This is conceptually equivalent to a cache_entry, but does
+ * not require an index_state and lets us operate on any commit
+ * and not be tied to the current worktree.
+ */
+struct pu_row
+{
+	struct strbuf fullpath;
+	struct object_id oid;
+	unsigned mode;
+	unsigned entryname_offset;
+};
+
+/*
+ * A 'partial-utils vec' represents a vector of 'pu row'
+ * values using the normal vector machinery.
+ */
+struct pu_vec
+{
+	struct pu_row **data;
+	unsigned int data_nr;
+	unsigned int data_alloc;
+};
+
+#define PU_VEC_DEFAULT_SIZE (1024*1024)
+
+
+void pu_row_trace(
+	const struct pu_row *row,
+	const char *label);
+
+struct pu_row *pu_row_alloc(
+	const unsigned char *sha1,
+	const struct strbuf *base,
+	const char *entryname,
+	unsigned mode);
+
+struct pu_vec *pu_vec_alloc(
+	unsigned int nr_pre_alloc);
+
+/*
+ * Append the given row onto the vector WITHOUT
+ * assuming ownership of the pointer.
+ */
+void pu_vec_append(
+	struct pu_vec *vec,
+	struct pu_row *row);
+
+/*
+ * Enumerate the contents of the tree (recursively) into
+ * a vector of rows.  This is essentially "ls-tree -r -t"
+ * into a vector.
+ */ 
+struct pu_vec *pu_vec_ls_tree(
+	struct tree *tree,
+	const char *prefix,
+	const char **argv);
+
+/*
+ * Load a sparse-checkout file into (*pel).
+ * Returns -1 if none or error.
+ */
+int pu_load_sparse_definitions(
+	const char *path,
+	struct exclude_list *pel);
+
+/*
+ * Filter the given vector using the sparse-checkout
+ * definitions and return new vector of just the paths
+ * that WOULD BE populated.
+ *
+ * The returned vector BORROWS rows from the input vector.
+ *
+ * This is loosely based upon clear_ce_flags() in unpack-trees.c
+ */
+struct pu_vec *pu_vec_filter_sparse(
+	const struct pu_vec *vec_in,
+	struct exclude_list *pel);
+
+/*
+ * Filter the given vector and return the list of blobs
+ * missing from the local ODB.
+ *
+ * The returned vector BORROWS rows from the input vector.
+ */
+struct pu_vec *pu_vec_filter_missing(
+	const struct pu_vec *vec_in);
+
+#endif /* PARTIAL_UTILS_H */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] RFC Partial Clone and Fetch
  2017-03-08 18:50 [PATCH 00/10] RFC Partial Clone and Fetch git
                   ` (9 preceding siblings ...)
  2017-03-08 18:50 ` [PATCH 10/10] ls-partial: created command to list missing blobs git
@ 2017-03-09 20:18 ` Jonathan Tan
  2017-03-16 21:43   ` Jeff Hostetler
  2017-03-22 15:16 ` ankostis
  2017-05-03 16:38 ` Jeff Hostetler
  12 siblings, 1 reply; 25+ messages in thread
From: Jonathan Tan @ 2017-03-09 20:18 UTC (permalink / raw)
  To: git, git; +Cc: jeffhost, peff, gitster, markbt, benpeart

Overall, this fetch/clone approach seems reasonable to me, except 
perhaps some unanswered questions (some of which are also being 
discussed elsewhere):
  - does the server need to tell us of missing blobs?
  - if yes, does the server need to tell us their file sizes?
  - do we need to store the list of missing blobs somewhere (whether the
    server told it to us or whether we inferred it from the fetched
    trees)

The answers to this probably depend on the answers in "B. Issues 
Backfilling Omitted Blobs" (especially the additional concepts I listed 
below).

Also, do you have any plans to implement other functionality, e.g. "git 
checkout" (which will allow fetches and clones to repositories with a 
working directory)? (I don't know what the mailing list consensus would 
be for the "acceptance criteria" for this patch set, but I would at 
least include "checkout".)

On 03/08/2017 10:50 AM, git@jeffhostetler.com wrote:
> B. Issues Backfilling Omitted Blobs
> ===================================
>
> Ideally, if the client only does "--partial-by-profile" fetches, it
> should not need to fetch individual missing blobs, but we have to allow
> for it to handle the other commands and other unexpected issues.
>
> There are 3 orthogonal concepts here:  when, how and where?

Another concept is "how to determine if a blob is really omitted" - do 
we store a list somewhere or do we assume that all missing blobs are 
purposely omitted (like in this patch set)?

Yet another concept is "whether to fetch" - for example, a checkout 
should almost certainly fetch, but a rev-list used by a connectivity 
check (like in patch 6 of this set) should not.

For example, for historical-blob-searching commands like "git log -S", 
should we:
  a) fetch everything missing (so users should use date-limiting
     arguments)
  b) fetch nothing missing
  c) use the file size to automatically exclude big files, but fetch
     everything else

For a) and b), we wouldn't need file size information for missing blobs, 
but for c), we do. This might determine if we need file size information 
in the fetch-pack/upload-pack protocol.

> C. New Blob-Fetch Protocol (2a)
> ===============================
>
> *TODO* A new pair of commands, such as fetch-blob-pack and upload-blob-pack,
> will be created to let the client request a batch of blobs and receive a
> packfile.  A protocol similar to the fetch-pack/upload-pack will be spoken
> between them.  (This avoids complicating the existing protocol and the work
> of enumerating the refs.)  Upload-blob-pack will use pack-objects to build
> the packfile.
>
> It is also more efficient than requesting a single blob at a time using
> the existing fetch-pack/upload-pack mechanism (with the various allow
> unreachable options).
>
> *TODO* The new request protocol will be defined in the patch series.
> It will include: a list of the desired blob SHAs.  Possibly also the commit
> SHA, branch name, and pathname of each blob (or whatever is necessary to let
> the server address the reachability concerns).  Possibly also the last
> known SHA for each blob to allow for deltafication in the packfile.

Context (like the commit SHA-1) would help in reachability checks, but 
I'm not sure if we can rely on that. It is true that I can't think of a 
way that the client would dissociate a blob that is missing from its 
tree or commit (because it would first need to "fault-in" that blob to 
do its operation). But clients operating on non-contextual SHA-1s (e.g. 
"git cat-file") and servers manipulating commits (so that the commit 
SHA-1 that the client had in its context is no longer reachable) are not 
uncommon, I think.

Having said that, it might be useful to include the context in the 
protocol anyway as an optional "hint".

I'm not sure what you mean by "last known SHA for each blob".

(If we do store the file size of a blob somewhere, we could also store 
some context there. I'm not sure how useful this is, though.)

> E. Unresolved Thoughts
> ======================

<snip>

> *TODO* The partial clone arguments should be recorded in ".git/info/"
> so that subsequent fetch commands can inherit them and rev-list/index-pack
> know to not complain by default.
>
> *TODO* Update GC like rev-list to not complain when there are missing blobs.

These 2 points would be part of "whether to fetch" above.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] RFC Partial Clone and Fetch
  2017-03-09 20:18 ` [PATCH 00/10] RFC Partial Clone and Fetch Jonathan Tan
@ 2017-03-16 21:43   ` Jeff Hostetler
  2017-03-17 14:13     ` Jeff Hostetler
  0 siblings, 1 reply; 25+ messages in thread
From: Jeff Hostetler @ 2017-03-16 21:43 UTC (permalink / raw)
  To: Jonathan Tan, git; +Cc: jeffhost, peff, gitster, markbt, benpeart



On 3/9/2017 3:18 PM, Jonathan Tan wrote:
> Overall, this fetch/clone approach seems reasonable to me, except
> perhaps some unanswered questions (some of which are also being
> discussed elsewhere):
>  - does the server need to tell us of missing blobs?
>  - if yes, does the server need to tell us their file sizes?

File sizes are a nice addition.  For example, with a virtual
file system, a "ls -l" can lie and tell you the sizes of the
yet-to-be-populated files.  Or if the client wants to distinguish
between going back to the original remote or going to S3 for the
blob, it could use the size to choose.  (I'm not saying we actually
build that yet, but others on the mailing list have spoken about
parking large blobs in S3.)

So, not necessary, but might be nice to have.

>  - do we need to store the list of missing blobs somewhere (whether the
>    server told it to us or whether we inferred it from the fetched
>    trees)

We should be able to infer the list of missing blobs; I hadn't
considered that.  However, by doing so we will need to disable
some of the integrity checking (as I had to do with the
"--allow-partial" option) and some concerns about that were
discussed earlier in the thread.  But if we do that inference
during clone/fetch and record it somewhere, we could get back
the integrity checking.

>
> The answers to this probably depend on the answers in "B. Issues
> Backfilling Omitted Blobs" (especially the additional concepts I listed
> below).
>
> Also, do you have any plans to implement other functionality, e.g. "git
> checkout" (which will allow fetches and clones to repositories with a
> working directory)? (I don't know what the mailing list consensus would
> be for the "acceptance criteria" for this patch set, but I would at
> least include "checkout".)

Yes, supporting "checkout" is essential. Commands like "merge", "diff",
and etc. will come later.  In Ben's RFC, he has been investigating
demand-loading blobs in read_object().  I've been focusing on
pre-fetching the missing blobs for a particular command.  I need
to make more progress on this topic.

>
> On 03/08/2017 10:50 AM, git@jeffhostetler.com wrote:
>> B. Issues Backfilling Omitted Blobs
>> ===================================
>>
>> Ideally, if the client only does "--partial-by-profile" fetches, it
>> should not need to fetch individual missing blobs, but we have to allow
>> for it to handle the other commands and other unexpected issues.
>>
>> There are 3 orthogonal concepts here:  when, how and where?
>
> Another concept is "how to determine if a blob is really omitted" - do
> we store a list somewhere or do we assume that all missing blobs are
> purposely omitted (like in this patch set)?
>
> Yet another concept is "whether to fetch" - for example, a checkout
> should almost certainly fetch, but a rev-list used by a connectivity
> check (like in patch 6 of this set) should not.
>
> For example, for historical-blob-searching commands like "git log -S",
> should we:
>  a) fetch everything missing (so users should use date-limiting
>     arguments)
>  b) fetch nothing missing
>  c) use the file size to automatically exclude big files, but fetch
>     everything else
>
> For a) and b), we wouldn't need file size information for missing blobs,
> but for c), we do. This might determine if we need file size information
> in the fetch-pack/upload-pack protocol.

good points.

>
>> C. New Blob-Fetch Protocol (2a)
>> ===============================
>>
>> *TODO* A new pair of commands, such as fetch-blob-pack and
>> upload-blob-pack,
>> will be created to let the client request a batch of blobs and receive a
>> packfile.  A protocol similar to the fetch-pack/upload-pack will be
>> spoken
>> between them.  (This avoids complicating the existing protocol and the
>> work
>> of enumerating the refs.)  Upload-blob-pack will use pack-objects to
>> build
>> the packfile.
>>
>> It is also more efficient than requesting a single blob at a time using
>> the existing fetch-pack/upload-pack mechanism (with the various allow
>> unreachable options).
>>
>> *TODO* The new request protocol will be defined in the patch series.
>> It will include: a list of the desired blob SHAs.  Possibly also the
>> commit
>> SHA, branch name, and pathname of each blob (or whatever is necessary
>> to let
>> the server address the reachability concerns).  Possibly also the last
>> known SHA for each blob to allow for deltafication in the packfile.
>
> Context (like the commit SHA-1) would help in reachability checks, but
> I'm not sure if we can rely on that. It is true that I can't think of a
> way that the client would dissociate a blob that is missing from its
> tree or commit (because it would first need to "fault-in" that blob to
> do its operation). But clients operating on non-contextual SHA-1s (e.g.
> "git cat-file") and servers manipulating commits (so that the commit
> SHA-1 that the client had in its context is no longer reachable) are not
> uncommon, I think.
>
> Having said that, it might be useful to include the context in the
> protocol anyway as an optional "hint".

That is what I was thinking. A hint of the branch or commit the
client is referencing (wants to checkout, for example).  Frankly,
I'm not sure how secure some of the proposed/existing reachability
constraints are, so I'm hedging here a bit.

>
> I'm not sure what you mean by "last known SHA for each blob".

What I meant was. If I'm missing the blob for a particular file
in the commit I want to checkout, but I have the blob for the
same file in a recent ancestor commit, I could send the SHA-1 of
the file in the parent commit. The server would be free to deltafy
the missing blob relative to it, rather than sending the entire
blob.

>
> (If we do store the file size of a blob somewhere, we could also store
> some context there. I'm not sure how useful this is, though.)
>
>> E. Unresolved Thoughts
>> ======================
>
> <snip>
>
>> *TODO* The partial clone arguments should be recorded in ".git/info/"
>> so that subsequent fetch commands can inherit them and
>> rev-list/index-pack
>> know to not complain by default.
>>
>> *TODO* Update GC like rev-list to not complain when there are missing
>> blobs.
>
> These 2 points would be part of "whether to fetch" above.

Thanks for all of the comments.
Jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] RFC Partial Clone and Fetch
  2017-03-16 21:43   ` Jeff Hostetler
@ 2017-03-17 14:13     ` Jeff Hostetler
  0 siblings, 0 replies; 25+ messages in thread
From: Jeff Hostetler @ 2017-03-17 14:13 UTC (permalink / raw)
  To: Jonathan Tan, git; +Cc: jeffhost, peff, gitster, markbt, benpeart



On 3/16/2017 5:43 PM, Jeff Hostetler wrote:
>
>
> On 3/9/2017 3:18 PM, Jonathan Tan wrote:
>> Overall, this fetch/clone approach seems reasonable to me, except
>> perhaps some unanswered questions (some of which are also being
>> discussed elsewhere):
>>  - does the server need to tell us of missing blobs?
>>  - if yes, does the server need to tell us their file sizes?
>
> File sizes are a nice addition.  For example, with a virtual
> file system, a "ls -l" can lie and tell you the sizes of the
> yet-to-be-populated files.

Nevermind the "ls -l" case, I forgot about the need for the
client to display the size of the (possibly) smudged file,
rather than the actual blob size.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] RFC Partial Clone and Fetch
  2017-03-08 18:50 [PATCH 00/10] RFC Partial Clone and Fetch git
                   ` (10 preceding siblings ...)
  2017-03-09 20:18 ` [PATCH 00/10] RFC Partial Clone and Fetch Jonathan Tan
@ 2017-03-22 15:16 ` ankostis
  2017-03-22 16:21   ` Johannes Schindelin
  2017-05-03 16:38 ` Jeff Hostetler
  12 siblings, 1 reply; 25+ messages in thread
From: ankostis @ 2017-03-22 15:16 UTC (permalink / raw)
  To: git
  Cc: Git Mailing List, jeffhost, Jeff King, Junio C Hamano, markbt,
	benpeart, Jonathan Tan

Dear Jeff

I read most of the valuable references you provided
but could not find something along the lines describing inline.


On 8 March 2017 at 19:50,  <git@jeffhostetler.com> wrote:
> From: Jeff Hostetler <jeffhost@microsoft.com>
>
>
> [RFC] Partial Clone and Fetch
> =============================
>
> This is a WIP RFC for a partial clone and fetch feature wherein the client
> can request that the server omit various blobs from the packfile during
> clone and fetch.  Clients can later request omitted blobs (either from a
> modified upload-pack-like request to the server or via a completely
> independent mechanism).

Is it foreseen the server to *decide* with partial objects to serve
And the cloning-client still to work ok?

My case in mind is storing confidential files in Git (server)
that I want to publicize them to partial-cloning clients,
for non-repudiation, by sending out trees and commits alone
(or any non-sensitive blobs).

A possible UI would be to rely on a `.gitattributes` to specify
which objects are to be upheld.


Apologies if I'm intruding with an unrelated feature requests.
  Kostis

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] RFC Partial Clone and Fetch
  2017-03-22 15:16 ` ankostis
@ 2017-03-22 16:21   ` Johannes Schindelin
  2017-03-22 17:51     ` Jeff Hostetler
  0 siblings, 1 reply; 25+ messages in thread
From: Johannes Schindelin @ 2017-03-22 16:21 UTC (permalink / raw)
  To: ankostis
  Cc: git, Git Mailing List, jeffhost, Jeff King, Junio C Hamano,
	markbt, benpeart, Jonathan Tan

Hi Kostis,

On Wed, 22 Mar 2017, ankostis wrote:

> On 8 March 2017 at 19:50,  <git@jeffhostetler.com> wrote:
> > From: Jeff Hostetler <jeffhost@microsoft.com>
> >
> > [RFC] Partial Clone and Fetch
> > =============================
> >
> > This is a WIP RFC for a partial clone and fetch feature wherein the
> > client can request that the server omit various blobs from the
> > packfile during clone and fetch.  Clients can later request omitted
> > blobs (either from a modified upload-pack-like request to the server
> > or via a completely independent mechanism).
> 
> Is it foreseen the server to *decide* with partial objects to serve
> And the cloning-client still to work ok?

The foreseeable use case will be to speed up clones of insanely large
repositories by omitting blobs that are not immediately required, and let
the client fetch them later on demand.

That is all, no additional permission model or anything. In fact, we do
not even need to ensure that blobs are reachable in our use case, as only
trusted parties are allowed to access the server to begin with.

That does not mean, of course, that there should not be an option to limit
access to objects that are reachable.

> My case in mind is storing confidential files in Git (server)
> that I want to publicize them to partial-cloning clients,
> for non-repudiation, by sending out trees and commits alone
> (or any non-sensitive blobs).
> 
> A possible UI would be to rely on a `.gitattributes` to specify
> which objects are to be upheld.
> 
> 
> Apologies if I'm intruding with an unrelated feature requests.

I think this is a valid use case, and Jeff's design certainly does not
prevent future patches to that end.

However, given that Jeff's use case does not require any such feature, I
would expect the people who want those features to do the heavy lifting on
top of his work. It is too different from the intended use case to
reasonably ask of Jeff.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] RFC Partial Clone and Fetch
  2017-03-22 16:21   ` Johannes Schindelin
@ 2017-03-22 17:51     ` Jeff Hostetler
  0 siblings, 0 replies; 25+ messages in thread
From: Jeff Hostetler @ 2017-03-22 17:51 UTC (permalink / raw)
  To: Johannes Schindelin, ankostis
  Cc: Git Mailing List, jeffhost, Jeff King, Junio C Hamano, markbt,
	benpeart, Jonathan Tan



On 3/22/2017 12:21 PM, Johannes Schindelin wrote:
> Hi Kostis,
>
> On Wed, 22 Mar 2017, ankostis wrote:
>
>> On 8 March 2017 at 19:50,  <git@jeffhostetler.com> wrote:
>>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>>
>>> [RFC] Partial Clone and Fetch
>>> =============================
>>>
>>> This is a WIP RFC for a partial clone and fetch feature wherein the
>>> client can request that the server omit various blobs from the
>>> packfile during clone and fetch.  Clients can later request omitted
>>> blobs (either from a modified upload-pack-like request to the server
>>> or via a completely independent mechanism).
>>
>> Is it foreseen the server to *decide* with partial objects to serve
>> And the cloning-client still to work ok?
>
> The foreseeable use case will be to speed up clones of insanely large
> repositories by omitting blobs that are not immediately required, and let
> the client fetch them later on demand.
>
> That is all, no additional permission model or anything. In fact, we do
> not even need to ensure that blobs are reachable in our use case, as only
> trusted parties are allowed to access the server to begin with.
>
> That does not mean, of course, that there should not be an option to limit
> access to objects that are reachable.
>
>> My case in mind is storing confidential files in Git (server)
>> that I want to publicize them to partial-cloning clients,
>> for non-repudiation, by sending out trees and commits alone
>> (or any non-sensitive blobs).
>>
>> A possible UI would be to rely on a `.gitattributes` to specify
>> which objects are to be upheld.
>>
>>
>> Apologies if I'm intruding with an unrelated feature requests.
>
> I think this is a valid use case, and Jeff's design certainly does not
> prevent future patches to that end.
>
> However, given that Jeff's use case does not require any such feature, I
> would expect the people who want those features to do the heavy lifting on
> top of his work. It is too different from the intended use case to
> reasonably ask of Jeff.

As Johannes said, all I'm proposing is a way to limit the amount of
data the client receives to help git scale to extremely large
repositories.  For example, I probably don't need 20 years of history
or the entire source tree if I'm only working in a narrow subset of
the tree.

I'm not sure how you would achieve the confidential file scenario
that you describe, but you might try to build on it and see if you
can make it work.

Jeff




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] RFC Partial Clone and Fetch
  2017-03-08 18:50 [PATCH 00/10] RFC Partial Clone and Fetch git
                   ` (11 preceding siblings ...)
  2017-03-22 15:16 ` ankostis
@ 2017-05-03 16:38 ` Jeff Hostetler
  2017-05-03 18:27   ` Jonathan Nieder
  2017-05-03 20:40   ` Jonathan Tan
  12 siblings, 2 replies; 25+ messages in thread
From: Jeff Hostetler @ 2017-05-03 16:38 UTC (permalink / raw)
  To: git; +Cc: jeffhost, peff, gitster, markbt, benpeart, jonathantanmy



On 3/8/2017 1:50 PM, git@jeffhostetler.com wrote:
> From: Jeff Hostetler <jeffhost@microsoft.com>
>
>
> [RFC] Partial Clone and Fetch
> =============================
> [...]
> E. Unresolved Thoughts
> ======================
>
> *TODO* The server should optionally return (in a side-band?) a list
> of the blobs that it omitted from the packfile (and possibly the sizes
> or sha1_object_info() data for them) during the fetch-pack/upload-pack
> operation.  This would allow the client to distinguish from invalid
> SHAs and missing ones.  Size information would allow the client to
> maybe choose between various servers.

Since I first posted this, Jonathan Tan has started a related
discussion on missing blob support.
https://public-inbox.org/git/CAGf8dgK05+f4uX-8+iMFvQd0n2JP6YxJ18ag8uDaEH6qc6SgVQ@mail.gmail.com/T/

I want to respond to both of these threads here.
-------------------------------------------------

Missing-Blob Support
====================

Let me offer up an alternative idea for representing
missing blobs.  This is differs from both of our previous
proposals.  (I don't have any code for this new proposal,
I just want to think out loud a bit and see if this is a
direction worth pursuing -- or a complete non-starter.)

Both proposals talk about detecting and adapting to a missing
blob and ways to recover -- when we fail to find a blob.
Comments on the thread asked about:
() being able to detect missing blobs vs corrupt repos
() being unable to detect duplicate blobs
() expense of blob search.

Suppose we store "positive" information about missing blobs?
This would let us know that a blob is intentionally missing
and possibly some meta-data about it.


1. Suppose we update the .pack file format slightly.
    () We use the 5 value in "enum object_type" to mean a
       "missing-blob".
    () We update git-pack-object as I did in my RFC, but have it
       create type 5 entries for the blobs that are omitted,
       rather than nothing.
    () Hopefully, the same logic that currently keeps pack-object
       from sending unnecessary blobs on subsequent fetches can
       also be used to keep it from sending unnecessary missing-blob
       entries.
    () The type 5 missing-blob entry would contain the SHA-1 of the
       blob and some meta-data to be explained later.

2. Make a similar change in the .idx format and git-index-pack
    to include them there.  Then blob lookup operations could
    definitively determine that a blob exists and is just not
    present locally.

3. With this, packfile-based blob-lookup operations can get a
    "missing-blob" result.
    () It should be possible to short-cut searching in other
       packfiles (because we don't have to assume that the blob
       was just misplaced in another packfile).
    () Lookup can still look for the corresponding loose blob
       (in case a previous lookup already "faulted it in").

4. We can then think about dynamically fetching it.
    () Several techniques for this are currently being
       discussed on the mailing list in other threads,
       so I won't go into this here.
    () There has also been debate about whether this should
       yield a loose blob or a new packfile.  I think both
       forms have merit and depend on whether we are limited
       to asking for a single blob or can make a batch request.
    () A dynamically-fetched loose blob is placed in the normal
       loose blob directory hierarchy so that subsequent
       lookups can find it as mentioned above.
    () A dynamically-fetched packfile (with one or more blobs)
       is written to the ODB and then the lookup operation
       completes.
       {} I want to isolate these packfiles from the main
          packfiles, so that they behave like a second-stage
          lookup and don't affect the caching/LRU nature of
          the existing first-stage packfile lookup.
       {} I also don't want the ambiguity of having 2 primary
          packfiles with a blob marked as missing in 1 and
          present in the other.

5. git-repack should be updated to "do the right thing" and
    squash missing-blob entries.

6. And etc.


Missing-Blob Entry Data
=======================

A missing-blob entry needs to contain the SHA-1 value of
the blob (obviously).  Other fields are nice to have, but
are not necessary.  Here are a few fields to consider.

A. The SHA-1 (20 bytes)

B. The raw size of the blob (5? bytes).
    () This is the cleaned size of the file as stored.  The
       server does not (and should not) have any knowledge
       of the smudging that may happen.
    () This may be useful if whatever dynamic-fetch-hook
       wants to customize its behavior, such as individually
       fetching large blobs and batch fetching smaller ones
       from the same server.
    () GVFS found it necessary to create a custom server
       end-point to get blob size data so that "ls -l"
       could show file sizes for non-present virtualized
       files.
    () 5 bytes (uint:40) should be more than enough for this.

C. A server "hint" (20 bytes)
    () Instructions to help the client fetch the blob.
    () If I have multiple remotes configured, a missing-blob
       should be fetched from the same server that created
       the missing-blob entry (since it may be the only
       one that has it).
    () If a blob is very large (and was omitted for this
       reason), the server may want to redirect the client
       to a geographically closer CDN.
    () This is the SHA-1 of a file in the repository of a
       hook (or a set of parameters to be used by a hook).
       {} This is a bit of *hand-wave* right now, but the
          idea is that you can use the information here to
          individually fetch a blob or batch fetch a set
          of blobs that have the same hint.
       {} Yes, there are security concerns here, so perhaps
          the hint file should just contain parameters for
          a stock git-fetch-pack or git-fetch-blob-pack or
          curl command (or wrapper script) that "does the
          right thing".
       {} I thought this would be more compact than listing
          detailed fetch data per-blob.  And we don't have
          to define yet another syntax.  For example, we can
          let the SHA-1 point to an administrator configured
          shell script and be done.
    () We assume that the SHA-1 file is present locally
       (not missing).  This might refer to a pinned file
       in a special ".git*" file (that we never omit) in
       HEAD.  Or it might be in a branch that all clients
       are assumed to have.


Concluding Thoughts
===================

Combining the ideas here with the partial clone/fetch
parameters and the various blob back-filling proposals
gives us the ability to create and work with sparse
repos.
() Filtering can be based upon blob size; this could be
    seen as an alternative solution to LFS for repos with
    large objects.
() Filtering could also be based upon pathnames (such as
    a sparse-checkout filter) and greatly help performance
    on very large repos where developers only work with
    small areas of the tree.

Thanks
Jeff








^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] RFC Partial Clone and Fetch
  2017-05-03 16:38 ` Jeff Hostetler
@ 2017-05-03 18:27   ` Jonathan Nieder
  2017-05-04 16:51     ` Jeff Hostetler
  2017-05-08  0:15     ` Junio C Hamano
  2017-05-03 20:40   ` Jonathan Tan
  1 sibling, 2 replies; 25+ messages in thread
From: Jonathan Nieder @ 2017-05-03 18:27 UTC (permalink / raw)
  To: Jeff Hostetler
  Cc: git, jeffhost, peff, gitster, markbt, benpeart, jonathantanmy

Hi,

Jeff Hostetler wrote:
> On 3/8/2017 1:50 PM, git@jeffhostetler.com wrote:

>> [RFC] Partial Clone and Fetch
>> =============================
>> [...]
>> E. Unresolved Thoughts
>> ======================
>>
>> *TODO* The server should optionally return (in a side-band?) a list
>> of the blobs that it omitted from the packfile (and possibly the sizes
>> or sha1_object_info() data for them) during the fetch-pack/upload-pack
>> operation.  This would allow the client to distinguish from invalid
>> SHAs and missing ones.  Size information would allow the client to
>> maybe choose between various servers.
>
> Since I first posted this, Jonathan Tan has started a related
> discussion on missing blob support.
> https://public-inbox.org/git/CAGf8dgK05+f4uX-8+iMFvQd0n2JP6YxJ18ag8uDaEH6qc6SgVQ@mail.gmail.com/T/
>
> I want to respond to both of these threads here.

Thanks much for this.

> Missing-Blob Support
> ====================
>
> Let me offer up an alternative idea for representing
> missing blobs.  This is differs from both of our previous
> proposals.  (I don't have any code for this new proposal,
> I just want to think out loud a bit and see if this is a
> direction worth pursuing -- or a complete non-starter.)
>
> Both proposals talk about detecting and adapting to a missing
> blob and ways to recover -- when we fail to find a blob.
> Comments on the thread asked about:
> () being able to detect missing blobs vs corrupt repos
> () being unable to detect duplicate blobs
> () expense of blob search.
>
> Suppose we store "positive" information about missing blobs?
> This would let us know that a blob is intentionally missing
> and possibly some meta-data about it.

We've discussed this a little informally but didn't go more into
it, so I'm glad you brought it up.

There are two use cases I care about.  I'll want names to refer to
them later, so describing them now:

 A. A p4 or svn style "monorepo" containing all code within a company.
    We want to make git scale well when working with such a
    repository.  Disk usage, network usage, index size, and object
    lookup time are all issues for such a repository.

    A repository can creep up in size so it starts falling into this
    category even though git coped well with it before.  Another way
    to end up in this category is a conversion from a version control
    system like p4 or svn.

 B. A more modestly sized repository with some large blobs in it.  At
    clone time we want to omit unneeded large blobs to speed up the
    clone, save disk space, and save bandwidth.

    For this kind of repository, the idx file already contained all
    those blobs and that was not causing problems --- the only problem
    was the actual blob size.

> 1. Suppose we update the .pack file format slightly.
[...]
> 2. Make a similar change in the .idx format and git-index-pack
>    to include them there.  Then blob lookup operations could
>    definitively determine that a blob exists and is just not
>    present locally.

Some nits:

- there shouldn't be any need for the blobs to even be mentioned in
  the .pack stored locally.  The .idx file maps from sha1 to offset
  within the packfile --- a special offset could mean "this is a
  missing blob".

- Git protocol works by sending pack files over the wire.  The idx
  file is not transmitted to the client --- the client instead
  reconstructs it from the pack file.  I assume this is why you
  stored the SHA-1 of the object in the pack file, but it could
  equally well be sent through another stream (since this proposal
  involves a change to git protocol anyway).

- However, the list of missing blobs can be inferred from the existing
  pack format, without a change to the pack format used in git
  protocol.  As part of constructing the idx, "git index-pack"
  inflates every object in the pack file sent by the server.  This
  means we know what blobs they reference, so we can easily produce a
  list for the idx file without changing the pack file format.

If this is done by only changing the idx file format and not the pack
file, then it does not affect the protocol.  That is good for
experimentation --- it lets us try out different formats client-side
without having to coordinate with server authors.

In case (A), this proposal means we get back some of the per-object
overhead that we were trying to avoid.  I would like to avoid that
if possible.

In case (B), this proposal doesn't hurt.

One problem with proposals so far has been how to handle repositories
with multiple remotes.  Having a local list of missing blobs is
convenient because it can be associated to the remote --- when a blob
is referenced later, we know which remote to ask for for it.  This is
a niche feature, but it's a nice bonus.

[...]
> 3. With this, packfile-based blob-lookup operations can get a
>    "missing-blob" result.
>    () It should be possible to short-cut searching in other
>       packfiles (because we don't have to assume that the blob
>       was just misplaced in another packfile).
>    () Lookup can still look for the corresponding loose blob
>       (in case a previous lookup already "faulted it in").

The number of packfiles to search is always a problem for object
lookups --- we need to be able to keep the number of packfiles low.
If we solve that for missing blobs but not for the ones that are
present, then git's object access performance is still slow.

I have some ideas for improving git's gc behavior to bound the number
of packfiles better.  I think that's mostly orthogonal to this
proposal.  The main point of overlap is the number of packfiles
produced if someone uses one packfile per large blob, but that is bad
for the performance in successful lookups, not just negative lookups,
so it needs to be solved directly anyway.

> 4. We can then think about dynamically fetching it.

This also seems orthogonal to whether the missing blobs list exists.

[...]
> A missing-blob entry needs to contain the SHA-1 value of
> the blob (obviously).  Other fields are nice to have, but
> are not necessary.  Here are a few fields to consider.
[...]
> B. The raw size of the blob (5? bytes).

This would be a very nice thing.

If we were starting over, would this belong in the tree object?
(Part of the reason I ask is that we have an opportunity to sort
of start over as part of the transition away from SHA-1.)

[...]
> C. A server "hint" (20 bytes)
>    () Instructions to help the client fetch the blob.
>    () If I have multiple remotes configured, a missing-blob
>       should be fetched from the same server that created
>       the missing-blob entry (since it may be the only
>       one that has it).

I am worried about the implications of storing this kind of policy
information in the pack file.  There may be useful information along
these lines for a server to advertise, but I don't think it belongs in
the pack file.  Git protocol relies on it being cheap to stream pack
files as is.

[...]
> Combining the ideas here with the partial clone/fetch
> parameters and the various blob back-filling proposals
> gives us the ability to create and work with sparse
> repos.
> () Filtering can be based upon blob size; this could be
>    seen as an alternative solution to LFS for repos with
>    large objects.
> () Filtering could also be based upon pathnames (such as
>    a sparse-checkout filter) and greatly help performance
>    on very large repos where developers only work with
>    small areas of the tree.

To summarize my thoughts:

+ It is possible to construct a list of blobs missing from the
  packfile sent by the server, without changing the protocol.
  This is a very good thing, since it makes local experimentation
  possible.

- Except in the multiple-remotes case, I am mostly unconvinced about
  the benefit of having the list of missing blobs around locally.
  For example:

  - "git fsck" can use the list of missing blobs to see whether a
    reachable missing blob is intentional or a violation of the
    repository's integrity.  I don't think this is useful.  In a
    partial clone repository, missing blobs are always permitted
    and do not indicate repository corruption, since git operations
    have a way of coping with them (by fetching them on demand using
    read-blob protocol).
 - In a partial clone repository that is acting as a server, the
   list of missing blobs can be used to deny a read-blob request
   without passing the request on to "origin".  Without this
   check, such checks could be a DoS vector.  Since there are other,
   simpler requests that have the potential to DoS "origin" (e.g.
   using plain clone --mirror) in such a setup, I think it best to
   disallow read-blob requests to partial clone repositories
   altogether for now.

- I am also unconvinced about the server hint.

+ Having the list of missing blob sizes locally seems very useful.  I
  can imagine a client that wants this information for some blobs but
  not all --- e.g. they are only interested in a particular
  subdirectory so they want blob sizes from that directory.

The blob-size use case is compelling, but I still lean toward
constructing the list of blobs locally for now where it's useful,
instead of adding a missing-blob entry to the pack file.

I'll think more about blob sizes.  I agree that we need to do
something to handle them.

Thanks and hope that helps,
Jonathan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] RFC Partial Clone and Fetch
  2017-05-03 16:38 ` Jeff Hostetler
  2017-05-03 18:27   ` Jonathan Nieder
@ 2017-05-03 20:40   ` Jonathan Tan
  2017-05-03 21:08     ` Jonathan Nieder
  1 sibling, 1 reply; 25+ messages in thread
From: Jonathan Tan @ 2017-05-03 20:40 UTC (permalink / raw)
  To: Jeff Hostetler, git; +Cc: jeffhost, peff, gitster, markbt, benpeart

On 05/03/2017 09:38 AM, Jeff Hostetler wrote:
>
>
> On 3/8/2017 1:50 PM, git@jeffhostetler.com wrote:
>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>
>>
>> [RFC] Partial Clone and Fetch
>> =============================
>> [...]
>> E. Unresolved Thoughts
>> ======================
>>
>> *TODO* The server should optionally return (in a side-band?) a list
>> of the blobs that it omitted from the packfile (and possibly the sizes
>> or sha1_object_info() data for them) during the fetch-pack/upload-pack
>> operation.  This would allow the client to distinguish from invalid
>> SHAs and missing ones.  Size information would allow the client to
>> maybe choose between various servers.
>
> Since I first posted this, Jonathan Tan has started a related
> discussion on missing blob support.
> https://public-inbox.org/git/CAGf8dgK05+f4uX-8+iMFvQd0n2JP6YxJ18ag8uDaEH6qc6SgVQ@mail.gmail.com/T/
>
>
> I want to respond to both of these threads here.
> -------------------------------------------------

Thanks for your input. I see that you have explained both "storing 
'positive' information about missing blobs" and "what to store with 
those positive information"; I'll just comment on the former for now.

> Missing-Blob Support
> ====================
>
> Let me offer up an alternative idea for representing
> missing blobs.  This is differs from both of our previous
> proposals.  (I don't have any code for this new proposal,
> I just want to think out loud a bit and see if this is a
> direction worth pursuing -- or a complete non-starter.)
>
> Both proposals talk about detecting and adapting to a missing
> blob and ways to recover -- when we fail to find a blob.
> Comments on the thread asked about:
> () being able to detect missing blobs vs corrupt repos
> () being unable to detect duplicate blobs
> () expense of blob search.
>
> Suppose we store "positive" information about missing blobs?
> This would let us know that a blob is intentionally missing
> and possibly some meta-data about it.

I thought about this (see "Some alternative designs" in [1]), listing 
some similar benefits, but concluded that "it is difficult to scale to 
large repos".

Firstly, to be clear, by large repos I meant (and mean) the svn-style 
"monorepos" that Jonathan Nieder mentions as use case "A" [2].

My concern is that such lists (whether in separate file(s) or in .idx 
files) would be too unwieldy to manipulate. Even if we design things to 
avoid modifying such lists (for example, by adding a new list whenever 
we fetch instead of trying to modify an existing one), we would at least 
need to sort their contents (for example, when generating an .idx in the 
first place). For a repo with 10M-100M blobs [3], this might be doable 
on today's computers, but I would be concerned if a repo would exceed 
such numbers.

[1] <20170426221346.25337-1-jonathantanmy@google.com>
[2] <20170503182725.GC28740@aiede.svl.corp.google.com>
[3] In Microsoft's announcement of Git Virtual File System [4], they 
mentioned "over 3.5 million files" in the Windows codebase. I'm not sure 
if this refers to files in a snapshot (that is, working copy) or all 
historical versions.
[4] 
https://blogs.msdn.microsoft.com/visualstudioalm/2017/02/03/announcing-gvfs-git-virtual-file-system/

> 1. Suppose we update the .pack file format slightly.
>    () We use the 5 value in "enum object_type" to mean a
>       "missing-blob".
>    () We update git-pack-object as I did in my RFC, but have it
>       create type 5 entries for the blobs that are omitted,
>       rather than nothing.
>    () Hopefully, the same logic that currently keeps pack-object
>       from sending unnecessary blobs on subsequent fetches can
>       also be used to keep it from sending unnecessary missing-blob
>       entries.
>    () The type 5 missing-blob entry would contain the SHA-1 of the
>       blob and some meta-data to be explained later.

My original idea was to have sorted list(s) of hashes in separate 
file(s) much like the currently existing shallow file; it would have the 
semantics of "a hash here might be present or absent; if it is absent, 
use the hook". (Initially I thought that one list would be sufficient, 
but after reading your idea and considering it some more, multiple lists 
might be better.)

Your idea of storing them in an .idx (and possibly corresponding .pack 
file) is similar, I think. Although mine is probably simpler - at least, 
we wouldn't need a new object_type.

As described above, I don't think this list-of-hashes idea will work 
(because of the large numbers of blobs involved), but I'll compare it to 
yours anyway just in case we end up being convinced that this general 
idea works.

> 2. Make a similar change in the .idx format and git-index-pack
>    to include them there.  Then blob lookup operations could
>    definitively determine that a blob exists and is just not
>    present locally.
>
> 3. With this, packfile-based blob-lookup operations can get a
>    "missing-blob" result.
>    () It should be possible to short-cut searching in other
>       packfiles (because we don't have to assume that the blob
>       was just misplaced in another packfile).
>    () Lookup can still look for the corresponding loose blob
>       (in case a previous lookup already "faulted it in").

The binary search to lookup a packfile offset from a .idx file (which 
involves disk reads) would take longer for all lookups (not just lookups 
for missing blobs) - I think I prefer keeping the lists separate, to 
avoid pessimizing the (likely) usual case where the relevant blobs are 
all already in local repo storage.

> 4. We can then think about dynamically fetching it.
>    () Several techniques for this are currently being
>       discussed on the mailing list in other threads,
>       so I won't go into this here.
>    () There has also been debate about whether this should
>       yield a loose blob or a new packfile.  I think both
>       forms have merit and depend on whether we are limited
>       to asking for a single blob or can make a batch request.
>    () A dynamically-fetched loose blob is placed in the normal
>       loose blob directory hierarchy so that subsequent
>       lookups can find it as mentioned above.
>    () A dynamically-fetched packfile (with one or more blobs)
>       is written to the ODB and then the lookup operation
>       completes.
>       {} I want to isolate these packfiles from the main
>          packfiles, so that they behave like a second-stage
>          lookup and don't affect the caching/LRU nature of
>          the existing first-stage packfile lookup.
>       {} I also don't want the ambiguity of having 2 primary
>          packfiles with a blob marked as missing in 1 and
>          present in the other.

With my idea, the second-stage lookup is done on the list of missing 
hashes; there is no division between packfiles.

> 5. git-repack should be updated to "do the right thing" and
>    squash missing-blob entries.
>
> 6. And etc.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] RFC Partial Clone and Fetch
  2017-05-03 20:40   ` Jonathan Tan
@ 2017-05-03 21:08     ` Jonathan Nieder
  0 siblings, 0 replies; 25+ messages in thread
From: Jonathan Nieder @ 2017-05-03 21:08 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: Jeff Hostetler, git, jeffhost, peff, gitster, markbt, benpeart

Hi,

Jonathan Tan wrote:

> The binary search to lookup a packfile offset from a .idx file
> (which involves disk reads) would take longer for all lookups (not
> just lookups for missing blobs) - I think I prefer keeping the lists
> separate, to avoid pessimizing the (likely) usual case where the
> relevant blobs are all already in local repo storage.

Another relevant operation is looking up objects by offset or
index_nr.  The current implementation involves building an in-memory
reverse index on demand by reading the idx file and sorting it by
offset --- see pack-revindex.c::create_pack_revindex.  This takes O(n
log n) time where n is the size of the idx file.

That said, it could be avoided by storing an on-disk reverse index
with the pack.  That's something we've been wanting to do anyway.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] RFC Partial Clone and Fetch
  2017-05-03 18:27   ` Jonathan Nieder
@ 2017-05-04 16:51     ` Jeff Hostetler
  2017-05-04 18:41       ` Jonathan Nieder
  2017-05-08  0:15     ` Junio C Hamano
  1 sibling, 1 reply; 25+ messages in thread
From: Jeff Hostetler @ 2017-05-04 16:51 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: git, jeffhost, peff, gitster, markbt, benpeart, jonathantanmy



On 5/3/2017 2:27 PM, Jonathan Nieder wrote:
> Hi,
>
> Jeff Hostetler wrote:
>> Missing-Blob Support
>> ====================
>>
>> Let me offer up an alternative idea for representing
>> missing blobs.  This is differs from both of our previous
>> proposals.  (I don't have any code for this new proposal,
>> I just want to think out loud a bit and see if this is a
>> direction worth pursuing -- or a complete non-starter.)
>>
>> Both proposals talk about detecting and adapting to a missing
>> blob and ways to recover -- when we fail to find a blob.
>> Comments on the thread asked about:
>> () being able to detect missing blobs vs corrupt repos
>> () being unable to detect duplicate blobs
>> () expense of blob search.
>>
>> Suppose we store "positive" information about missing blobs?
>> This would let us know that a blob is intentionally missing
>> and possibly some meta-data about it.
>
> We've discussed this a little informally but didn't go more into
> it, so I'm glad you brought it up.
>
> There are two use cases I care about.  I'll want names to refer to
> them later, so describing them now:
>
>  A. A p4 or svn style "monorepo" containing all code within a company.
>     We want to make git scale well when working with such a
>     repository.  Disk usage, network usage, index size, and object
>     lookup time are all issues for such a repository.
>
>     A repository can creep up in size so it starts falling into this
>     category even though git coped well with it before.  Another way
>     to end up in this category is a conversion from a version control
>     system like p4 or svn.
>
>  B. A more modestly sized repository with some large blobs in it.  At
>     clone time we want to omit unneeded large blobs to speed up the
>     clone, save disk space, and save bandwidth.
>
>     For this kind of repository, the idx file already contained all
>     those blobs and that was not causing problems --- the only problem
>     was the actual blob size.

Yes, I've been primarily concerned with "case A" repos.
I work with the team converting the Windows source repo
to git.  This was discussed in Brussels as part of the
GVFS presentation.

The Windows tree has 3.5M files in the worktree for a simple checkout
of HEAD.  The index is 450MB.  There are 500K trees/folders in
the commit.  Multiply that by scale factor considering the number
of trunk/release branches, number of developers, typical number of
commits per day per developer, and n years(decades) of history and
we get to a very large number....

FWIW, there's also a "case C" which has both, but that
just hurts to think about.

>
>> 1. Suppose we update the .pack file format slightly.
> [...]
>> 2. Make a similar change in the .idx format and git-index-pack
>>    to include them there.  Then blob lookup operations could
>>    definitively determine that a blob exists and is just not
>>    present locally.
>
> Some nits:
>
> - there shouldn't be any need for the blobs to even be mentioned in
>   the .pack stored locally.  The .idx file maps from sha1 to offset
>   within the packfile --- a special offset could mean "this is a
>   missing blob".
>
> - Git protocol works by sending pack files over the wire.  The idx
>   file is not transmitted to the client --- the client instead
>   reconstructs it from the pack file.  I assume this is why you
>   stored the SHA-1 of the object in the pack file, but it could
>   equally well be sent through another stream (since this proposal
>   involves a change to git protocol anyway).
>
> - However, the list of missing blobs can be inferred from the existing
>   pack format, without a change to the pack format used in git
>   protocol.  As part of constructing the idx, "git index-pack"
>   inflates every object in the pack file sent by the server.  This
>   means we know what blobs they reference, so we can easily produce a
>   list for the idx file without changing the pack file format.

In my original RFC there were comments/complaints that with
missing blobs we lose the ability to detect corruptions.  My
proposed changes to index-pack and rev-list (and suggestions
for other commands like fsck) just disabled those errors.
Personally, I'm OK with that, but I understand that others
would like to save the ability to distinguish between missing
and corrupted.

Right, only the .pack is sent over the wire.  And that's why I
suggest listing the missing SHAs in it.  I think we need the server
to send a list of them -- whether in individual per-file type-5
records as I suggested, or a single record containing a list of
SHAs+data (which I think I prefer in hindsight).

WRT being able to discover the missing blobs, is that true in
all cases?  I don't think it is for thin-packs -- where the server
only sends stuff you don't (supposedly) already have, right ?

If instead, we have pack-object indicate that it *would have*
sent this blob normally, we don't change any of the semantics
of how pack files are assembled.  This gives the client a
definitive list of what's missing.


> If this is done by only changing the idx file format and not the pack
> file, then it does not affect the protocol.  That is good for
> experimentation --- it lets us try out different formats client-side
> without having to coordinate with server authors.
>
> In case (A), this proposal means we get back some of the per-object
> overhead that we were trying to avoid.  I would like to avoid that
> if possible.

If thin-packs are used, the fetch only receives information about
blobs that are new relative to the heads the client already has,
right?

>
> In case (B), this proposal doesn't hurt.
>
> One problem with proposals so far has been how to handle repositories
> with multiple remotes.  Having a local list of missing blobs is
> convenient because it can be associated to the remote --- when a blob
> is referenced later, we know which remote to ask for for it.  This is
> a niche feature, but it's a nice bonus.

The more I think about it, I'd like to NOT put the list in the .idx
file.  If we put it in a separate peer file next to the *.{pack,idx}
we could decorate it with the name of the remote used in the fetch
or clone.

>
> [...]
>> 3. With this, packfile-based blob-lookup operations can get a
>>    "missing-blob" result.
>>    () It should be possible to short-cut searching in other
>>       packfiles (because we don't have to assume that the blob
>>       was just misplaced in another packfile).
>>    () Lookup can still look for the corresponding loose blob
>>       (in case a previous lookup already "faulted it in").
>
> The number of packfiles to search is always a problem for object
> lookups --- we need to be able to keep the number of packfiles low.
> If we solve that for missing blobs but not for the ones that are
> present, then git's object access performance is still slow.
>
> I have some ideas for improving git's gc behavior to bound the number
> of packfiles better.  I think that's mostly orthogonal to this
> proposal.  The main point of overlap is the number of packfiles
> produced if someone uses one packfile per large blob, but that is bad
> for the performance in successful lookups, not just negative lookups,
> so it needs to be solved directly anyway.
>

I've always wondered if repack, fetch, and friends should build
a meta-idx that merges all of the current .idx files, but that
is a different conversation....


>> 4. We can then think about dynamically fetching it.
>
> This also seems orthogonal to whether the missing blobs list exists.
>
> [...]
>> A missing-blob entry needs to contain the SHA-1 value of
>> the blob (obviously).  Other fields are nice to have, but
>> are not necessary.  Here are a few fields to consider.
> [...]
>> B. The raw size of the blob (5? bytes).
>
> This would be a very nice thing.
>
> If we were starting over, would this belong in the tree object?
> (Part of the reason I ask is that we have an opportunity to sort
> of start over as part of the transition away from SHA-1.)

Yes, putting the size in the tree would be nice.  That does
add 5-8 bytes to every entry in every tree (on top of the
larger hash), but it would be useful.

If we're going there, we might just define the new hash
as the concatenation of the SHA* and the length of the data
hashed.  So instead of a 32-byte SHA256, we'd have a (32 + 8)
byte value.  (Or maybe a (32 + 5) if we want to squeeze it.)


>
> [...]
>> C. A server "hint" (20 bytes)
>>    () Instructions to help the client fetch the blob.
>>    () If I have multiple remotes configured, a missing-blob
>>       should be fetched from the same server that created
>>       the missing-blob entry (since it may be the only
>>       one that has it).
>
> I am worried about the implications of storing this kind of policy
> information in the pack file.  There may be useful information along
> these lines for a server to advertise, but I don't think it belongs in
> the pack file.  Git protocol relies on it being cheap to stream pack
> files as is.

Perhaps.  I'm not sure it would be that expensive -- I'm thinking
about it being a server constant rather than a per-user/per-fetch
field.  But maybe we don't need it.  Assuming we can correctly
associate a missing blob with the server/remote that omitted it.

>
> [...]
>
> Thanks and hope that helps,
> Jonathan
>

thanks!
Jeff

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] RFC Partial Clone and Fetch
  2017-05-04 16:51     ` Jeff Hostetler
@ 2017-05-04 18:41       ` Jonathan Nieder
  0 siblings, 0 replies; 25+ messages in thread
From: Jonathan Nieder @ 2017-05-04 18:41 UTC (permalink / raw)
  To: Jeff Hostetler
  Cc: git, jeffhost, peff, gitster, markbt, benpeart, jonathantanmy

Hi again,

Jeff Hostetler wrote:

> In my original RFC there were comments/complaints that with
> missing blobs we lose the ability to detect corruptions.  My
> proposed changes to index-pack and rev-list (and suggestions
> for other commands like fsck) just disabled those errors.
> Personally, I'm OK with that, but I understand that others
> would like to save the ability to distinguish between missing
> and corrupted.

I'm also okay with it.  In a partial clone, in the same way as a
missing ref represents a different valid state and thus passes fsck
regardless of how it happened, a missing blob is a valid state and it
is sensible for it to pass fsck.

A person might object that previously a repository that passed "git
fsck" was a repository where "git fast-export --all" would succeed,
and if I omit a blob that is not present on the remote server then
that invariant is gone.  But that problem exists even if we have a
list of missing blobs.  The server could rewind history and garbage
collect, causing attempts on the client to fetch a previously
advertised missing blob to fail.  Or the server can disappear
completely, or it can lose all its data and have to be restored from
an older backup that is missing newer blobs.

> Right, only the .pack is sent over the wire.  And that's why I
> suggest listing the missing SHAs in it.  I think we need the server
> to send a list of them -- whether in individual per-file type-5
> records as I suggested, or a single record containing a list of
> SHAs+data (which I think I prefer in hindsight).

A list of SHAs+data sounds sensible as a way of conveying additional
per-blob information (such as size).

> WRT being able to discover the missing blobs, is that true in
> all cases?  I don't think it is for thin-packs -- where the server
> only sends stuff you don't (supposedly) already have, right ?

Generate the list of blobs referenced by trees in the pack, when you
are inflating them in git index-pack.  Omit any objects that you
already know about.  The remainder is the set of missing blobs.

One thing this doesn't tell you is if the same missing blob is
available from multiple remotes.  It associates each blob with a
single remote, the first one it was discovered from.

> If instead, we have pack-object indicate that it *would have*
> sent this blob normally, we don't change any of the semantics
> of how pack files are assembled.  This gives the client a
> definitive list of what's missing.

If there is additional information the server wants to convey about
the missing blobs, then this makes sense to me --- it has to send it
somewhere, and a separate list outside the pack seems like a good
place to put that.

When there is no additional information beyond "this is a blob I am
omitting", there is nothing the wire protocol needs to convey.  But
you've convinced me that that's usually moot because the blob size
is valuable information.

[...]
> The more I think about it, I'd like to NOT put the list in the .idx
> file.  If we put it in a separate peer file next to the *.{pack,idx}
> we could decorate it with the name of the remote used in the fetch
> or clone.

I have no strong opinions about this in either direction.

Since it only affects the local repository format and doesn't affect
the protocol, we can experiment without too much fuss. :)

[...]
> I've always wondered if repack, fetch, and friends should build
> a meta-idx that merges all of the current .idx files, but that
> is a different conversation....

Yes, we've been thinking about a one-index-for-many-packs facility
to deal with the proliferation of packfiles with only one or a few
large objects without having to waste I/O copying them into a
concatenated pack file.

Another thing we're looking into is incorporating something like
Martin Fick's "git exproll" (
http://public-inbox.org/git/1375756727-1275-1-git-send-email-artagnon@gmail.com/)
into "git gc --auto" to more aggressively keep the number of packs
down.

> On 5/3/2017 2:27 PM, Jonathan Nieder wrote:

>> If we were starting over, would this belong in the tree object?
>> (Part of the reason I ask is that we have an opportunity to sort
>> of start over as part of the transition away from SHA-1.)
>
> Yes, putting the size in the tree would be nice.  That does
> add 5-8 bytes to every entry in every tree (on top of the
> larger hash), but it would be useful.
>
> If we're going there, we might just define the new hash
> as the concatenation of the SHA* and the length of the data
> hashed.  So instead of a 32-byte SHA256, we'd have a (32 + 8)
> byte value.  (Or maybe a (32 + 5) if we want to squeeze it.)

Thanks --- that feedback helps.

It doesn't stop us from having to figure something else out in the
short term, of course.

[...]
>> I am worried about the implications of storing this kind of policy
>> information in the pack file.  There may be useful information along
>> these lines for a server to advertise, but I don't think it belongs in
>> the pack file.  Git protocol relies on it being cheap to stream pack
>> files as is.
>
> Perhaps.  I'm not sure it would be that expensive -- I'm thinking
> about it being a server constant rather than a per-user/per-fetch
> field.  But maybe we don't need it.  Assuming we can correctly
> associate a missing blob with the server/remote that omitted it.

Right.

I think we're heading toward a consensus:

- that the server should (at least if requested?) inform the client of
  the blobs it is omitting and their sizes

- that this would go in the same response as the packfile, but
  out-of-line from the pack

- that it (at least optionally?) gets stored *somewhere* locally
  associated with the pack --- either through an extension to the idx
  file format or in a separate file alongside the pack file and idx
  file.  This doesn't affect the protocol so it's not expensive to
  experiment

Thanks for the thoughtful replies.

Sincerely,
Jonathan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH 00/10] RFC Partial Clone and Fetch
  2017-05-03 18:27   ` Jonathan Nieder
  2017-05-04 16:51     ` Jeff Hostetler
@ 2017-05-08  0:15     ` Junio C Hamano
  1 sibling, 0 replies; 25+ messages in thread
From: Junio C Hamano @ 2017-05-08  0:15 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Jeff Hostetler, git, jeffhost, peff, markbt, benpeart,
	jonathantanmy

Jonathan Nieder <jrnieder@gmail.com> writes:

> - there shouldn't be any need for the blobs to even be mentioned in
>   the .pack stored locally.  The .idx file maps from sha1 to offset
>   within the packfile --- a special offset could mean "this is a
>   missing blob".

Clever.

> - However, the list of missing blobs can be inferred from the existing
>   pack format, without a change to the pack format used in git
>   protocol.  As part of constructing the idx, "git index-pack"
>   inflates every object in the pack file sent by the server.  This
>   means we know what blobs they reference, so we can easily produce a
>   list for the idx file without changing the pack file format.

A minor wrinkle to keep in mind if you were to go this route is that
you'd need a way to tell the reason why a blob that is referenced by
a tree in the pack stream is not in the same pack stream.  

If the resulting repository on the receiving side has that blob
after the transfer, it is likely that the reason why the blob does
not appear in the pack is because the want/have/ack exchange told
the sending side that the receiving side has a commit that contains
the blob.  But when the blob does not exist in the receiving side
after the transfer, we cannot tell between two possible cases.  The
server may have actively wanted to omit it (i.e. the case we are
interested in in this discussion thread).  Or the receiving end said
that it has a commit that contains the blob, but due to preexisting
corruption, the receiving repository was missing the blob in
reality.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2017-05-08  0:15 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-08 18:50 [PATCH 00/10] RFC Partial Clone and Fetch git
2017-03-08 18:50 ` [PATCH 01/10] pack-objects: eat CR in addition to LF after fgets git
2017-03-08 18:50 ` [PATCH 02/10] pack-objects: add --partial-by-size=n --partial-special git
2017-03-08 18:50 ` [PATCH 03/10] pack-objects: test for --partial-by-size --partial-special git
2017-03-08 18:50 ` [PATCH 04/10] upload-pack: add partial (sparse) fetch git
2017-03-08 18:50 ` [PATCH 05/10] fetch-pack: add partial-by-size and partial-special git
2017-03-08 18:50 ` [PATCH 06/10] rev-list: add --allow-partial option to relax connectivity checks git
2017-03-08 18:50 ` [PATCH 07/10] index-pack: add --allow-partial option to relax blob existence checks git
2017-03-08 18:50 ` [PATCH 08/10] fetch: add partial-by-size and partial-special arguments git
2017-03-08 18:50 ` [PATCH 09/10] clone: " git
2017-03-08 18:50 ` [PATCH 10/10] ls-partial: created command to list missing blobs git
2017-03-09 20:18 ` [PATCH 00/10] RFC Partial Clone and Fetch Jonathan Tan
2017-03-16 21:43   ` Jeff Hostetler
2017-03-17 14:13     ` Jeff Hostetler
2017-03-22 15:16 ` ankostis
2017-03-22 16:21   ` Johannes Schindelin
2017-03-22 17:51     ` Jeff Hostetler
2017-05-03 16:38 ` Jeff Hostetler
2017-05-03 18:27   ` Jonathan Nieder
2017-05-04 16:51     ` Jeff Hostetler
2017-05-04 18:41       ` Jonathan Nieder
2017-05-08  0:15     ` Junio C Hamano
2017-05-03 20:40   ` Jonathan Tan
2017-05-03 21:08     ` Jonathan Nieder
  -- strict thread matches above, loose matches on Subject: below --
2017-03-08 17:37 Jeff Hostetler

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).