git@vger.kernel.org mailing list mirror (one of many)
 help / Atom feed
* [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
@ 2017-09-29 20:11 Jonathan Tan
  2017-09-29 20:11 ` [PATCH 01/18] fsck: introduce partialclone extension Jonathan Tan
                   ` (20 more replies)
  0 siblings, 21 replies; 28+ messages in thread
From: Jonathan Tan @ 2017-09-29 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, git, peartben, christian.couder

These patches are also available online:
https://github.com/jonathantanmy/git/commits/partialclone3

(I've announced it in another e-mail, but am now sending the patches to the
mailing list too.)

Here's an update of my work so far. Notable features:
 - These 18 patches allow a user to clone with --blob-max-bytes=<bytes>,
   creating a partial clone that is automatically configured to lazily
   fetch missing objects from the origin. The local repo also has fsck
   working offline, and GC working (albeit only on locally created
   objects).
 - Cloning and fetching is currently only able to exclude blobs by a
   size threshold, but the local repository is already capable of
   fetching missing objects of any type. For example, if a repository
   with missing trees or commits is generated by any tool (for example,
   a future version of Git), current Git with my patches will still be
   able to operate on them, automatically fetching those missing trees
   and commits when needed.
 - Missing blobs are fetched all at once during checkout.

Jeff Hostetler has sent out some object-filtering patches [1] that is a
superset of the object-filtering functionality that I have (in the
pack-objects patches). I have gone for the minimal approach here, but if
his patches are merged, I'll update my patch set to use those.

[1] https://public-inbox.org/git/20170922203017.53986-6-git@jeffhostetler.com/

Demo
====

Obtain a repository.

    $ make prefix=$HOME/local install
    $ cd $HOME/tmp
    $ git clone https://github.com/git/git

Make it advertise the new feature and allow requests for arbitrary blobs.

    $ git -C git config uploadpack.advertiseblobmaxbytes 1
    $ git -C git config uploadpack.allowanysha1inwant 1

Perform the partial clone and check that it is indeed smaller. Specify
"file://" in order to test the partial clone mechanism. (If not, Git will
perform a local clone, which unselectively copies every object.)

    $ git clone --blob-max-bytes=0 "file://$(pwd)/git" git2
    $ git clone "file://$(pwd)/git" git3
    $ du -sh git2 git3
    85M	git2
    130M	git3

Observe that the new repo is automatically configured to fetch missing objects
from the original repo. Subsequent fetches will also be partial.

    $ cat git2/.git/config
    [core]
    	repositoryformatversion = 1
    	filemode = true
    	bare = false
    	logallrefupdates = true
    [remote "origin"]
    	url = [snip]
    	fetch = +refs/heads/*:refs/remotes/origin/*
    	blobmaxbytes = 0
    [extensions]
    	partialclone = origin
    [branch "master"]
    	remote = origin
    	merge = refs/heads/master

Design
======

Local repository layout
-----------------------

A repository declares its dependence on a *promisor remote* (a remote that
declares that it can serve certain objects when requested) by a repository
extension "partialclone". `extensions.partialclone` must be set to the name of
the remote ("origin" in the demo above).

A packfile can be annotated as originating from the promisor remote by the
existence of a "(packfile name).promisor" file with arbitrary contents (similar
to the ".keep" file). Whenever a promisor remote sends an object, it declares
that it can serve every object directly or indirectly referenced by the sent
object.

A promisor packfile is a packfile annotated with the ".promisor" file. A
promisor object is an object that the promisor remote is known to be able to
serve, because it is an object in a promisor packfile or directly referred to by
one.

(In the future, we might need to add ".promisor" support to loose objects.)

Connectivity check and gc
-------------------------

The object walk done by the connectivity check (as used by fsck and fetch) stops
at all promisor objects.

The object walk done by gc also stops at all promisor objects. Only non-promisor
packfiles are deleted (if pack deletion is requested); promisor packfiles are
left alone. This maintains the distinction between promisor packfiles and
non-promisor packfiles. (In the future, we might need to do something more
sophisticated with promisor packfiles.)

Fetching of missing objects
---------------------------

When `sha1_object_info_extended()` (or similar) is invoked, it will
automatically attempt to fetch a missing object from the promisor remote if that
object is not in the local repository. For efficiency, no check is made as to
whether that object is known to be a promisor object or not.

This automatic fetching can be toggled on and off by the `fetch_if_missing`
global variable, and it is on by default.

The actual fetch is done through the fetch-pack/upload-pack protocol. Right now,
this uses the fact that upload-pack allows blob and tree "want"s, and this
incurs the overhead of the unnecessary ref advertisement. I hope that protocol
v2 will allow us to declare that blob and tree "want"s are allowed, and allow
the client to declare that it does not want the ref advertisement. All packfiles
downloaded in this way are annotated with ".promisor".

Fetching with `git fetch`
-------------------------

The fetch-pack/upload-pack protocol has also been extended to support omission
of blobs above a certain size. The client only allows this when fetching from
the promisor remote, and will annotate any packs received from the promisor
remote with ".promisor".

Jonathan Tan (18):
  fsck: introduce partialclone extension
  fsck: support refs pointing to promisor objects
  fsck: support referenced promisor objects
  fsck: support promisor objects as CLI argument
  index-pack: refactor writing of .keep files
  introduce fetch-object: fetch one promisor object
  sha1_file: support lazily fetching missing objects
  rev-list: support termination at promisor objects
  gc: do not repack promisor packfiles
  pack-objects: rename want_.* to ignore_.*
  pack-objects: support --blob-max-bytes
  fetch-pack: support excluding large blobs
  fetch: refactor calculation of remote list
  fetch: support excluding large blobs
  clone: support excluding large blobs
  clone: configure blobmaxbytes in created repos
  unpack-trees: batch fetching of missing blobs
  fetch-pack: restore save_commit_buffer after use

 Documentation/git-pack-objects.txt                |  12 +-
 Documentation/technical/pack-protocol.txt         |   9 +
 Documentation/technical/protocol-capabilities.txt |   7 +
 Documentation/technical/repository-version.txt    |  12 +
 Makefile                                          |   1 +
 builtin/cat-file.c                                |   2 +
 builtin/clone.c                                   |  24 +-
 builtin/fetch-pack.c                              |  21 ++
 builtin/fetch.c                                   |  36 ++-
 builtin/fsck.c                                    |  26 +-
 builtin/gc.c                                      |   3 +
 builtin/index-pack.c                              | 113 ++++---
 builtin/pack-objects.c                            |  97 ++++--
 builtin/prune.c                                   |   7 +
 builtin/repack.c                                  |   7 +-
 builtin/rev-list.c                                |  13 +
 cache.h                                           |  13 +-
 connected.c                                       |   1 +
 environment.c                                     |   1 +
 fetch-object.c                                    |  45 +++
 fetch-object.h                                    |  11 +
 fetch-pack.c                                      |  23 +-
 fetch-pack.h                                      |   3 +
 list-objects.c                                    |  16 +-
 object.c                                          |   2 +-
 packfile.c                                        |  77 ++++-
 packfile.h                                        |  13 +
 remote-curl.c                                     |  21 +-
 remote.c                                          |   2 +
 remote.h                                          |   2 +
 revision.c                                        |  33 ++-
 revision.h                                        |   5 +-
 setup.c                                           |   7 +-
 sha1_file.c                                       |  38 ++-
 t/t0410-partial-clone.sh                          | 343 ++++++++++++++++++++++
 t/t5300-pack-object.sh                            |  45 +++
 t/t5500-fetch-pack.sh                             | 115 ++++++++
 t/t5601-clone.sh                                  | 101 +++++++
 t/test-lib-functions.sh                           |  12 +
 transport-helper.c                                |   4 +
 transport.c                                       |  18 ++
 transport.h                                       |  12 +
 unpack-trees.c                                    |  22 ++
 upload-pack.c                                     |  16 +-
 44 files changed, 1278 insertions(+), 113 deletions(-)
 create mode 100644 fetch-object.c
 create mode 100644 fetch-object.h
 create mode 100755 t/t0410-partial-clone.sh

-- 
2.14.2.822.g60be5d43e6-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 01/18] fsck: introduce partialclone extension
  2017-09-29 20:11 [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Jonathan Tan
@ 2017-09-29 20:11 ` Jonathan Tan
  2017-09-29 20:11 ` [PATCH 02/18] fsck: support refs pointing to promisor objects Jonathan Tan
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Tan @ 2017-09-29 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, git, peartben, christian.couder

Currently, Git does not support repos with very large numbers of objects
or repos that wish to minimize manipulation of certain blobs (for
example, because they are very large) very well, even if the user
operates mostly on part of the repo, because Git is designed on the
assumption that every referenced object is available somewhere in the
repo storage. In such an arrangement, the full set of objects is usually
available in remote storage, ready to be lazily downloaded.

Introduce the ability to have missing objects in a repo.  This
functionality is guarded behind a new repository extension option
`extensions.partialClone`. See the update to
Documentation/technical/repository-version.txt in this patch for more
information.

Teach fsck about the new state of affairs. In this commit, teach fsck
that missing promisor objects referenced from the reflog are not an
error case; in future commits, fsck will be taught about other cases.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/technical/repository-version.txt | 12 ++++
 builtin/fsck.c                                 |  2 +-
 cache.h                                        |  5 +-
 environment.c                                  |  1 +
 packfile.c                                     | 77 +++++++++++++++++++++++-
 packfile.h                                     | 13 +++++
 setup.c                                        |  7 ++-
 t/t0410-partial-clone.sh                       | 81 ++++++++++++++++++++++++++
 8 files changed, 192 insertions(+), 6 deletions(-)
 create mode 100755 t/t0410-partial-clone.sh

diff --git a/Documentation/technical/repository-version.txt b/Documentation/technical/repository-version.txt
index 00ad37986..074ccb9e0 100644
--- a/Documentation/technical/repository-version.txt
+++ b/Documentation/technical/repository-version.txt
@@ -86,3 +86,15 @@ for testing format-1 compatibility.
 When the config key `extensions.preciousObjects` is set to `true`,
 objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
 `git repack -d`).
+
+`partialClone`
+~~~~~~~~~~~~~~
+
+When the config key `extensions.partialClone` is set, a remote is
+designated as a "promisor remote". Objects referenced by packed objects
+obtained from that promisor remote do not need to be in the local repo.
+Instead, the promisor remote promises that all such objects can be
+fetched from it in the future, and Git will fetch those objects when
+needed.
+
+The value of this key is the name of the aforementioned promisor remote.
diff --git a/builtin/fsck.c b/builtin/fsck.c
index 1e4c471b4..97d1e621e 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -403,7 +403,7 @@ static void fsck_handle_reflog_oid(const char *refname, struct object_id *oid,
 					xstrfmt("%s@{%"PRItime"}", refname, timestamp));
 			obj->flags |= USED;
 			mark_object_reachable(obj);
-		} else {
+		} else if (!is_promisor_object(oid)) {
 			error("%s: invalid reflog entry %s", refname, oid_to_hex(oid));
 			errors_found |= ERROR_REACHABLE;
 		}
diff --git a/cache.h b/cache.h
index a916bc79e..437206d06 100644
--- a/cache.h
+++ b/cache.h
@@ -853,10 +853,12 @@ extern int grafts_replace_parents;
 #define GIT_REPO_VERSION 0
 #define GIT_REPO_VERSION_READ 1
 extern int repository_format_precious_objects;
+extern char *repository_format_partial_clone;
 
 struct repository_format {
 	int version;
 	int precious_objects;
+	char *partial_clone;
 	int is_bare;
 	char *work_tree;
 	struct string_list unknown_extensions;
@@ -1578,7 +1580,8 @@ extern struct packed_git {
 	unsigned pack_local:1,
 		 pack_keep:1,
 		 freshened:1,
-		 do_not_close:1;
+		 do_not_close:1,
+		 pack_promisor:1;
 	unsigned char sha1[20];
 	struct revindex_entry *revindex;
 	/* something like ".git/objects/pack/xxxxx.pack" */
diff --git a/environment.c b/environment.c
index f1f934b6f..624385fb2 100644
--- a/environment.c
+++ b/environment.c
@@ -27,6 +27,7 @@ int warn_ambiguous_refs = 1;
 int warn_on_object_refname_ambiguity = 1;
 int ref_paranoia = -1;
 int repository_format_precious_objects;
+char *repository_format_partial_clone;
 const char *git_commit_encoding;
 const char *git_log_output_encoding;
 const char *apply_default_whitespace;
diff --git a/packfile.c b/packfile.c
index f86fa051c..88f424320 100644
--- a/packfile.c
+++ b/packfile.c
@@ -8,6 +8,11 @@
 #include "list.h"
 #include "streaming.h"
 #include "sha1-lookup.h"
+#include "commit.h"
+#include "object.h"
+#include "tag.h"
+#include "tree-walk.h"
+#include "tree.h"
 
 char *odb_pack_name(struct strbuf *buf,
 		    const unsigned char *sha1,
@@ -638,10 +643,10 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
 		return NULL;
 
 	/*
-	 * ".pack" is long enough to hold any suffix we're adding (and
+	 * ".promisor" is long enough to hold any suffix we're adding (and
 	 * the use xsnprintf double-checks that)
 	 */
-	alloc = st_add3(path_len, strlen(".pack"), 1);
+	alloc = st_add3(path_len, strlen(".promisor"), 1);
 	p = alloc_packed_git(alloc);
 	memcpy(p->pack_name, path, path_len);
 
@@ -649,6 +654,10 @@ struct packed_git *add_packed_git(const char *path, size_t path_len, int local)
 	if (!access(p->pack_name, F_OK))
 		p->pack_keep = 1;
 
+	xsnprintf(p->pack_name + path_len, alloc - path_len, ".promisor");
+	if (!access(p->pack_name, F_OK))
+		p->pack_promisor = 1;
+
 	xsnprintf(p->pack_name + path_len, alloc - path_len, ".pack");
 	if (stat(p->pack_name, &st) || !S_ISREG(st.st_mode)) {
 		free(p);
@@ -776,7 +785,8 @@ static void prepare_packed_git_one(char *objdir, int local)
 		if (ends_with(de->d_name, ".idx") ||
 		    ends_with(de->d_name, ".pack") ||
 		    ends_with(de->d_name, ".bitmap") ||
-		    ends_with(de->d_name, ".keep"))
+		    ends_with(de->d_name, ".keep") ||
+		    ends_with(de->d_name, ".promisor"))
 			string_list_append(&garbage, path.buf);
 		else
 			report_garbage(PACKDIR_FILE_GARBAGE, path.buf);
@@ -1884,6 +1894,9 @@ int for_each_packed_object(each_packed_object_fn cb, void *data, unsigned flags)
 	for (p = packed_git; p; p = p->next) {
 		if ((flags & FOR_EACH_OBJECT_LOCAL_ONLY) && !p->pack_local)
 			continue;
+		if ((flags & FOR_EACH_OBJECT_PROMISOR_ONLY) &&
+		    !p->pack_promisor)
+			continue;
 		if (open_pack_index(p)) {
 			pack_errors = 1;
 			continue;
@@ -1894,3 +1907,61 @@ int for_each_packed_object(each_packed_object_fn cb, void *data, unsigned flags)
 	}
 	return r ? r : pack_errors;
 }
+
+static int add_promisor_object(const struct object_id *oid,
+			       struct packed_git *pack,
+			       uint32_t pos,
+			       void *set_)
+{
+	struct oidset *set = set_;
+	struct object *obj = parse_object(oid);
+	if (!obj)
+		return 1;
+
+	oidset_insert(set, oid);
+
+	/*
+	 * If this is a tree, commit, or tag, the objects it refers
+	 * to are also promisor objects. (Blobs refer to no objects.)
+	 */
+	if (obj->type == OBJ_TREE) {
+		struct tree *tree = (struct tree *)obj;
+		struct tree_desc desc;
+		struct name_entry entry;
+		if (init_tree_desc_gently(&desc, tree->buffer, tree->size))
+			/*
+			 * Error messages are given when packs are
+			 * verified, so do not print any here.
+			 */
+			return 0;
+		while (tree_entry_gently(&desc, &entry))
+			oidset_insert(set, entry.oid);
+	} else if (obj->type == OBJ_COMMIT) {
+		struct commit *commit = (struct commit *) obj;
+		struct commit_list *parents = commit->parents;
+
+		oidset_insert(set, &commit->tree->object.oid);
+		for (; parents; parents = parents->next)
+			oidset_insert(set, &parents->item->object.oid);
+	} else if (obj->type == OBJ_TAG) {
+		struct tag *tag = (struct tag *) obj;
+		oidset_insert(set, &tag->tagged->oid);
+	}
+	return 0;
+}
+
+int is_promisor_object(const struct object_id *oid)
+{
+	static struct oidset promisor_objects;
+	static int promisor_objects_prepared;
+
+	if (!promisor_objects_prepared) {
+		if (repository_format_partial_clone) {
+			for_each_packed_object(add_promisor_object,
+					       &promisor_objects,
+					       FOR_EACH_OBJECT_PROMISOR_ONLY);
+		}
+		promisor_objects_prepared = 1;
+	}
+	return oidset_contains(&promisor_objects, oid);
+}
diff --git a/packfile.h b/packfile.h
index 0cdeb54dc..a7fca598d 100644
--- a/packfile.h
+++ b/packfile.h
@@ -1,6 +1,8 @@
 #ifndef PACKFILE_H
 #define PACKFILE_H
 
+#include "oidset.h"
+
 /*
  * Generate the filename to be used for a pack file with checksum "sha1" and
  * extension "ext". The result is written into the strbuf "buf", overwriting
@@ -124,6 +126,11 @@ extern int has_sha1_pack(const unsigned char *sha1);
 
 extern int has_pack_index(const unsigned char *sha1);
 
+/*
+ * Only iterate over packs obtained from the promisor remote.
+ */
+#define FOR_EACH_OBJECT_PROMISOR_ONLY 2
+
 /*
  * Iterate over packed objects in both the local
  * repository and any alternates repositories (unless the
@@ -135,4 +142,10 @@ typedef int each_packed_object_fn(const struct object_id *oid,
 				  void *data);
 extern int for_each_packed_object(each_packed_object_fn, void *, unsigned flags);
 
+/*
+ * Return 1 if an object in a promisor packfile is or refers to the given
+ * object, 0 otherwise.
+ */
+extern int is_promisor_object(const struct object_id *oid);
+
 #endif
diff --git a/setup.c b/setup.c
index 6d8380acd..35c9da806 100644
--- a/setup.c
+++ b/setup.c
@@ -420,7 +420,11 @@ static int check_repo_format(const char *var, const char *value, void *vdata)
 			;
 		else if (!strcmp(ext, "preciousobjects"))
 			data->precious_objects = git_config_bool(var, value);
-		else
+		else if (!strcmp(ext, "partialclone")) {
+			if (!value)
+				return config_error_nonbool(var);
+			data->partial_clone = xstrdup(value);
+		} else
 			string_list_append(&data->unknown_extensions, ext);
 	} else if (strcmp(var, "core.bare") == 0) {
 		data->is_bare = git_config_bool(var, value);
@@ -463,6 +467,7 @@ static int check_repository_format_gently(const char *gitdir, int *nongit_ok)
 	}
 
 	repository_format_precious_objects = candidate.precious_objects;
+	repository_format_partial_clone = candidate.partial_clone;
 	string_list_clear(&candidate.unknown_extensions, 0);
 	if (!has_common) {
 		if (candidate.is_bare != -1) {
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
new file mode 100755
index 000000000..3ddb3b98f
--- /dev/null
+++ b/t/t0410-partial-clone.sh
@@ -0,0 +1,81 @@
+#!/bin/sh
+
+test_description='partial clone'
+
+. ./test-lib.sh
+
+delete_object () {
+	rm $1/.git/objects/$(echo $2 | sed -e 's|^..|&/|')
+}
+
+pack_as_from_promisor () {
+	HASH=$(git -C repo pack-objects .git/objects/pack/pack) &&
+	>repo/.git/objects/pack/pack-$HASH.promisor
+}
+
+test_expect_success 'missing reflog object, but promised by a commit, passes fsck' '
+	test_create_repo repo &&
+	test_commit -C repo my_commit &&
+
+	A=$(git -C repo commit-tree -m a HEAD^{tree}) &&
+	C=$(git -C repo commit-tree -m c -p $A HEAD^{tree}) &&
+
+	# Reference $A only from reflog, and delete it
+	git -C repo branch my_branch "$A" &&
+	git -C repo branch -f my_branch my_commit &&
+	delete_object repo "$A" &&
+
+	# State that we got $C, which refers to $A, from promisor
+	printf "$C\n" | pack_as_from_promisor &&
+
+	# Normally, it fails
+	test_must_fail git -C repo fsck &&
+
+	# But with the extension, it succeeds
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.partialclone "arbitrary string" &&
+	git -C repo fsck
+'
+
+test_expect_success 'missing reflog object, but promised by a tag, passes fsck' '
+	rm -rf repo &&
+	test_create_repo repo &&
+	test_commit -C repo my_commit &&
+
+	A=$(git -C repo commit-tree -m a HEAD^{tree}) &&
+	git -C repo tag -a -m d my_tag_name $A &&
+	T=$(git -C repo rev-parse my_tag_name) &&
+	git -C repo tag -d my_tag_name &&
+
+	# Reference $A only from reflog, and delete it
+	git -C repo branch my_branch "$A" &&
+	git -C repo branch -f my_branch my_commit &&
+	delete_object repo "$A" &&
+
+	# State that we got $T, which refers to $A, from promisor
+	printf "$T\n" | pack_as_from_promisor &&
+
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.partialclone "arbitrary string" &&
+	git -C repo fsck
+'
+
+test_expect_success 'missing reflog object alone fails fsck, even with extension set' '
+	rm -rf repo &&
+	test_create_repo repo &&
+	test_commit -C repo my_commit &&
+
+	A=$(git -C repo commit-tree -m a HEAD^{tree}) &&
+	B=$(git -C repo commit-tree -m b HEAD^{tree}) &&
+
+	# Reference $A only from reflog, and delete it
+	git -C repo branch my_branch "$A" &&
+	git -C repo branch -f my_branch my_commit &&
+	delete_object repo "$A" &&
+
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.partialclone "arbitrary string" &&
+	test_must_fail git -C repo fsck
+'
+
+test_done
-- 
2.14.2.822.g60be5d43e6-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 02/18] fsck: support refs pointing to promisor objects
  2017-09-29 20:11 [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Jonathan Tan
  2017-09-29 20:11 ` [PATCH 01/18] fsck: introduce partialclone extension Jonathan Tan
@ 2017-09-29 20:11 ` Jonathan Tan
  2017-09-29 20:11 ` [PATCH 03/18] fsck: support referenced " Jonathan Tan
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Tan @ 2017-09-29 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, git, peartben, christian.couder

Teach fsck to not treat refs referring to missing promisor objects as an
error when extensions.partialclone is set.

For the purposes of warning about no default refs, such refs are still
treated as legitimate refs.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/fsck.c           |  8 ++++++++
 t/t0410-partial-clone.sh | 24 ++++++++++++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 97d1e621e..f1529527b 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -439,6 +439,14 @@ static int fsck_handle_ref(const char *refname, const struct object_id *oid,
 
 	obj = parse_object(oid);
 	if (!obj) {
+		if (is_promisor_object(oid)) {
+			/*
+			 * Increment default_refs anyway, because this is a
+			 * valid ref.
+			 */
+			 default_refs++;
+			 return 0;
+		}
 		error("%s: invalid sha1 pointer %s", refname, oid_to_hex(oid));
 		errors_found |= ERROR_REACHABLE;
 		/* We'll continue with the rest despite the error.. */
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 3ddb3b98f..bf75162c1 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -13,6 +13,14 @@ pack_as_from_promisor () {
 	>repo/.git/objects/pack/pack-$HASH.promisor
 }
 
+promise_and_delete () {
+	HASH=$(git -C repo rev-parse "$1") &&
+	git -C repo tag -a -m message my_annotated_tag "$HASH" &&
+	git -C repo rev-parse my_annotated_tag | pack_as_from_promisor &&
+	git -C repo tag -d my_annotated_tag &&
+	delete_object repo "$HASH"
+}
+
 test_expect_success 'missing reflog object, but promised by a commit, passes fsck' '
 	test_create_repo repo &&
 	test_commit -C repo my_commit &&
@@ -78,4 +86,20 @@ test_expect_success 'missing reflog object alone fails fsck, even with extension
 	test_must_fail git -C repo fsck
 '
 
+test_expect_success 'missing ref object, but promised, passes fsck' '
+	rm -rf repo &&
+	test_create_repo repo &&
+	test_commit -C repo my_commit &&
+
+	A=$(git -C repo commit-tree -m a HEAD^{tree}) &&
+
+	# Reference $A only from ref
+	git -C repo branch my_branch "$A" &&
+	promise_and_delete "$A" &&
+
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.partialclone "arbitrary string" &&
+	git -C repo fsck
+'
+
 test_done
-- 
2.14.2.822.g60be5d43e6-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 03/18] fsck: support referenced promisor objects
  2017-09-29 20:11 [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Jonathan Tan
  2017-09-29 20:11 ` [PATCH 01/18] fsck: introduce partialclone extension Jonathan Tan
  2017-09-29 20:11 ` [PATCH 02/18] fsck: support refs pointing to promisor objects Jonathan Tan
@ 2017-09-29 20:11 ` " Jonathan Tan
  2017-09-29 20:11 ` [PATCH 04/18] fsck: support promisor objects as CLI argument Jonathan Tan
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Tan @ 2017-09-29 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, git, peartben, christian.couder

Teach fsck to not treat missing promisor objects indirectly pointed to
by refs as an error when extensions.partialclone is set.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/fsck.c           | 11 +++++++++++
 t/t0410-partial-clone.sh | 23 +++++++++++++++++++++++
 2 files changed, 34 insertions(+)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index f1529527b..4492a4fab 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -149,6 +149,15 @@ static int mark_object(struct object *obj, int type, void *data, struct fsck_opt
 	if (obj->flags & REACHABLE)
 		return 0;
 	obj->flags |= REACHABLE;
+
+	if (is_promisor_object(&obj->oid))
+		/*
+		 * Further recursion does not need to be performed on this
+		 * object since it is a promisor object (so it does not need to
+		 * be added to "pending").
+		 */
+		return 0;
+
 	if (!(obj->flags & HAS_OBJ)) {
 		if (parent && !has_object_file(&obj->oid)) {
 			printf("broken link from %7s %s\n",
@@ -213,6 +222,8 @@ static void check_reachable_object(struct object *obj)
 	 * do a full fsck
 	 */
 	if (!(obj->flags & HAS_OBJ)) {
+		if (is_promisor_object(&obj->oid))
+			return;
 		if (has_sha1_pack(obj->oid.hash))
 			return; /* it is in pack - forget about it */
 		printf("missing %s %s\n", printable_type(obj),
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index bf75162c1..4f9931f9b 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -102,4 +102,27 @@ test_expect_success 'missing ref object, but promised, passes fsck' '
 	git -C repo fsck
 '
 
+test_expect_success 'missing object, but promised, passes fsck' '
+	rm -rf repo &&
+	test_create_repo repo &&
+	test_commit -C repo 1 &&
+	test_commit -C repo 2 &&
+	test_commit -C repo 3 &&
+	git -C repo tag -a annotated_tag -m "annotated tag" &&
+
+	C=$(git -C repo rev-parse 1) &&
+	T=$(git -C repo rev-parse 2^{tree}) &&
+	B=$(git hash-object repo/3.t) &&
+	AT=$(git -C repo rev-parse annotated_tag) &&
+
+	promise_and_delete "$C" &&
+	promise_and_delete "$T" &&
+	promise_and_delete "$B" &&
+	promise_and_delete "$AT" &&
+
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.partialclone "arbitrary string" &&
+	git -C repo fsck
+'
+
 test_done
-- 
2.14.2.822.g60be5d43e6-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 04/18] fsck: support promisor objects as CLI argument
  2017-09-29 20:11 [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Jonathan Tan
                   ` (2 preceding siblings ...)
  2017-09-29 20:11 ` [PATCH 03/18] fsck: support referenced " Jonathan Tan
@ 2017-09-29 20:11 ` Jonathan Tan
  2017-09-29 20:11 ` [PATCH 05/18] index-pack: refactor writing of .keep files Jonathan Tan
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Tan @ 2017-09-29 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, git, peartben, christian.couder

Teach fsck to not treat missing promisor objects provided on the CLI as
an error when extensions.partialclone is set.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/fsck.c           |  2 ++
 t/t0410-partial-clone.sh | 13 +++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 4492a4fab..f6cb4d755 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -755,6 +755,8 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 			struct object *obj = lookup_object(oid.hash);
 
 			if (!obj || !(obj->flags & HAS_OBJ)) {
+				if (is_promisor_object(&oid))
+					continue;
 				error("%s: object missing", oid_to_hex(&oid));
 				errors_found |= ERROR_OBJECT;
 				continue;
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 4f9931f9b..e96f436b0 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -125,4 +125,17 @@ test_expect_success 'missing object, but promised, passes fsck' '
 	git -C repo fsck
 '
 
+test_expect_success 'missing CLI object, but promised, passes fsck' '
+	rm -rf repo &&
+	test_create_repo repo &&
+	test_commit -C repo my_commit &&
+
+	A=$(git -C repo commit-tree -m a HEAD^{tree}) &&
+	promise_and_delete "$A" &&
+
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.partialclone "arbitrary string" &&
+	git -C repo fsck "$A"
+'
+
 test_done
-- 
2.14.2.822.g60be5d43e6-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 05/18] index-pack: refactor writing of .keep files
  2017-09-29 20:11 [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Jonathan Tan
                   ` (3 preceding siblings ...)
  2017-09-29 20:11 ` [PATCH 04/18] fsck: support promisor objects as CLI argument Jonathan Tan
@ 2017-09-29 20:11 ` Jonathan Tan
  2017-09-29 20:11 ` [PATCH 06/18] introduce fetch-object: fetch one promisor object Jonathan Tan
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Tan @ 2017-09-29 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, git, peartben, christian.couder

In a subsequent commit, index-pack will be taught to write ".promisor"
files which are similar to the ".keep" files it knows how to write.
Refactor the writing of ".keep" files, so that the implementation of
writing ".promisor" files becomes easier.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/index-pack.c | 99 ++++++++++++++++++++++++++++------------------------
 1 file changed, 53 insertions(+), 46 deletions(-)

diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index f2be145e1..7ad170590 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1389,15 +1389,58 @@ static void fix_unresolved_deltas(struct sha1file *f)
 	free(sorted_by_pos);
 }
 
+static const char *derive_filename(const char *pack_name, const char *suffix,
+				   struct strbuf *buf)
+{
+	size_t len;
+	if (!strip_suffix(pack_name, ".pack", &len))
+		die(_("packfile name '%s' does not end with '.pack'"),
+		    pack_name);
+	strbuf_add(buf, pack_name, len);
+	strbuf_addch(buf, '.');
+	strbuf_addstr(buf, suffix);
+	return buf->buf;
+}
+
+static void write_special_file(const char *suffix, const char *msg,
+			       const char *pack_name, const unsigned char *sha1,
+			       const char **report)
+{
+	struct strbuf name_buf = STRBUF_INIT;
+	const char *filename;
+	int fd;
+	int msg_len = strlen(msg);
+
+	if (pack_name)
+		filename = derive_filename(pack_name, suffix, &name_buf);
+	else
+		filename = odb_pack_name(&name_buf, sha1, suffix);
+
+	fd = odb_pack_keep(filename);
+	if (fd < 0) {
+		if (errno != EEXIST)
+			die_errno(_("cannot write %s file '%s'"),
+				  suffix, filename);
+	} else {
+		if (msg_len > 0) {
+			write_or_die(fd, msg, msg_len);
+			write_or_die(fd, "\n", 1);
+		}
+		if (close(fd) != 0)
+			die_errno(_("cannot close written %s file '%s'"),
+				  suffix, filename);
+		*report = suffix;
+	}
+	strbuf_release(&name_buf);
+}
+
 static void final(const char *final_pack_name, const char *curr_pack_name,
 		  const char *final_index_name, const char *curr_index_name,
-		  const char *keep_name, const char *keep_msg,
-		  unsigned char *sha1)
+		  const char *keep_msg, unsigned char *sha1)
 {
 	const char *report = "pack";
 	struct strbuf pack_name = STRBUF_INIT;
 	struct strbuf index_name = STRBUF_INIT;
-	struct strbuf keep_name_buf = STRBUF_INIT;
 	int err;
 
 	if (!from_stdin) {
@@ -1409,28 +1452,9 @@ static void final(const char *final_pack_name, const char *curr_pack_name,
 			die_errno(_("error while closing pack file"));
 	}
 
-	if (keep_msg) {
-		int keep_fd, keep_msg_len = strlen(keep_msg);
-
-		if (!keep_name)
-			keep_name = odb_pack_name(&keep_name_buf, sha1, "keep");
-
-		keep_fd = odb_pack_keep(keep_name);
-		if (keep_fd < 0) {
-			if (errno != EEXIST)
-				die_errno(_("cannot write keep file '%s'"),
-					  keep_name);
-		} else {
-			if (keep_msg_len > 0) {
-				write_or_die(keep_fd, keep_msg, keep_msg_len);
-				write_or_die(keep_fd, "\n", 1);
-			}
-			if (close(keep_fd) != 0)
-				die_errno(_("cannot close written keep file '%s'"),
-					  keep_name);
-			report = "keep";
-		}
-	}
+	if (keep_msg)
+		write_special_file("keep", keep_msg, final_pack_name, sha1,
+				   &report);
 
 	if (final_pack_name != curr_pack_name) {
 		if (!final_pack_name)
@@ -1472,7 +1496,6 @@ static void final(const char *final_pack_name, const char *curr_pack_name,
 
 	strbuf_release(&index_name);
 	strbuf_release(&pack_name);
-	strbuf_release(&keep_name_buf);
 }
 
 static int git_index_pack_config(const char *k, const char *v, void *cb)
@@ -1615,26 +1638,13 @@ static void show_pack_info(int stat_only)
 	}
 }
 
-static const char *derive_filename(const char *pack_name, const char *suffix,
-				   struct strbuf *buf)
-{
-	size_t len;
-	if (!strip_suffix(pack_name, ".pack", &len))
-		die(_("packfile name '%s' does not end with '.pack'"),
-		    pack_name);
-	strbuf_add(buf, pack_name, len);
-	strbuf_addstr(buf, suffix);
-	return buf->buf;
-}
-
 int cmd_index_pack(int argc, const char **argv, const char *prefix)
 {
 	int i, fix_thin_pack = 0, verify = 0, stat_only = 0;
 	const char *curr_index;
 	const char *index_name = NULL, *pack_name = NULL;
-	const char *keep_name = NULL, *keep_msg = NULL;
-	struct strbuf index_name_buf = STRBUF_INIT,
-		      keep_name_buf = STRBUF_INIT;
+	const char *keep_msg = NULL;
+	struct strbuf index_name_buf = STRBUF_INIT;
 	struct pack_idx_entry **idx_objects;
 	struct pack_idx_option opts;
 	unsigned char pack_sha1[20];
@@ -1745,9 +1755,7 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	if (from_stdin && !startup_info->have_repository)
 		die(_("--stdin requires a git repository"));
 	if (!index_name && pack_name)
-		index_name = derive_filename(pack_name, ".idx", &index_name_buf);
-	if (keep_msg && !keep_name && pack_name)
-		keep_name = derive_filename(pack_name, ".keep", &keep_name_buf);
+		index_name = derive_filename(pack_name, "idx", &index_name_buf);
 
 	if (verify) {
 		if (!index_name)
@@ -1795,13 +1803,12 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	if (!verify)
 		final(pack_name, curr_pack,
 		      index_name, curr_index,
-		      keep_name, keep_msg,
+		      keep_msg,
 		      pack_sha1);
 	else
 		close(input_fd);
 	free(objects);
 	strbuf_release(&index_name_buf);
-	strbuf_release(&keep_name_buf);
 	if (pack_name == NULL)
 		free((void *) curr_pack);
 	if (index_name == NULL)
-- 
2.14.2.822.g60be5d43e6-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 06/18] introduce fetch-object: fetch one promisor object
  2017-09-29 20:11 [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Jonathan Tan
                   ` (4 preceding siblings ...)
  2017-09-29 20:11 ` [PATCH 05/18] index-pack: refactor writing of .keep files Jonathan Tan
@ 2017-09-29 20:11 ` Jonathan Tan
  2017-09-29 20:11 ` [PATCH 07/18] sha1_file: support lazily fetching missing objects Jonathan Tan
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Tan @ 2017-09-29 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, git, peartben, christian.couder

Introduce fetch-object, providing the ability to fetch one object from a
promisor remote.

This uses fetch-pack. To do this, the transport mechanism has been
updated with 2 flags, "from-promisor" to indicate that the resulting
pack comes from a promisor remote (and thus should be annotated as such
by index-pack), and "no-haves" to suppress the sending of "have" lines.

This will be tested in a subsequent commit.

NEEDSWORK: update this when we have more information about protocol v2,
which should allow a way to suppress the ref advertisement and
officially allow any object type to be "want"-ed.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Makefile             |  1 +
 builtin/fetch-pack.c |  8 ++++++++
 builtin/index-pack.c | 16 +++++++++++++---
 fetch-object.c       | 23 +++++++++++++++++++++++
 fetch-object.h       |  6 ++++++
 fetch-pack.c         |  8 ++++++--
 fetch-pack.h         |  2 ++
 remote-curl.c        | 14 +++++++++++++-
 transport.c          |  8 ++++++++
 transport.h          |  8 ++++++++
 10 files changed, 88 insertions(+), 6 deletions(-)
 create mode 100644 fetch-object.c
 create mode 100644 fetch-object.h

diff --git a/Makefile b/Makefile
index ed5960e6b..4303ef6f8 100644
--- a/Makefile
+++ b/Makefile
@@ -789,6 +789,7 @@ LIB_OBJS += ewah/ewah_bitmap.o
 LIB_OBJS += ewah/ewah_io.o
 LIB_OBJS += ewah/ewah_rlw.o
 LIB_OBJS += exec_cmd.o
+LIB_OBJS += fetch-object.o
 LIB_OBJS += fetch-pack.o
 LIB_OBJS += fsck.o
 LIB_OBJS += gettext.o
diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index 366b9d13f..9f303cf98 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -143,6 +143,14 @@ int cmd_fetch_pack(int argc, const char **argv, const char *prefix)
 			args.update_shallow = 1;
 			continue;
 		}
+		if (!strcmp("--from-promisor", arg)) {
+			args.from_promisor = 1;
+			continue;
+		}
+		if (!strcmp("--no-haves", arg)) {
+			args.no_haves = 1;
+			continue;
+		}
 		usage(fetch_pack_usage);
 	}
 	if (deepen_not.nr)
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 7ad170590..14ebb2f17 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1429,14 +1429,16 @@ static void write_special_file(const char *suffix, const char *msg,
 		if (close(fd) != 0)
 			die_errno(_("cannot close written %s file '%s'"),
 				  suffix, filename);
-		*report = suffix;
+		if (report)
+			*report = suffix;
 	}
 	strbuf_release(&name_buf);
 }
 
 static void final(const char *final_pack_name, const char *curr_pack_name,
 		  const char *final_index_name, const char *curr_index_name,
-		  const char *keep_msg, unsigned char *sha1)
+		  const char *keep_msg, const char *promisor_msg,
+		  unsigned char *sha1)
 {
 	const char *report = "pack";
 	struct strbuf pack_name = STRBUF_INIT;
@@ -1455,6 +1457,9 @@ static void final(const char *final_pack_name, const char *curr_pack_name,
 	if (keep_msg)
 		write_special_file("keep", keep_msg, final_pack_name, sha1,
 				   &report);
+	if (promisor_msg)
+		write_special_file("promisor", promisor_msg, final_pack_name,
+				   sha1, NULL);
 
 	if (final_pack_name != curr_pack_name) {
 		if (!final_pack_name)
@@ -1644,6 +1649,7 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	const char *curr_index;
 	const char *index_name = NULL, *pack_name = NULL;
 	const char *keep_msg = NULL;
+	const char *promisor_msg = NULL;
 	struct strbuf index_name_buf = STRBUF_INIT;
 	struct pack_idx_entry **idx_objects;
 	struct pack_idx_option opts;
@@ -1693,6 +1699,10 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 				keep_msg = "";
 			} else if (starts_with(arg, "--keep=")) {
 				keep_msg = arg + 7;
+			} else if (!strcmp(arg, "--promisor")) {
+				promisor_msg = "";
+			} else if (starts_with(arg, "--promisor=")) {
+				promisor_msg = arg + strlen("--promisor=");
 			} else if (starts_with(arg, "--threads=")) {
 				char *end;
 				nr_threads = strtoul(arg+10, &end, 0);
@@ -1803,7 +1813,7 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	if (!verify)
 		final(pack_name, curr_pack,
 		      index_name, curr_index,
-		      keep_msg,
+		      keep_msg, promisor_msg,
 		      pack_sha1);
 	else
 		close(input_fd);
diff --git a/fetch-object.c b/fetch-object.c
new file mode 100644
index 000000000..f89dbba75
--- /dev/null
+++ b/fetch-object.c
@@ -0,0 +1,23 @@
+#include "cache.h"
+#include "packfile.h"
+#include "pkt-line.h"
+#include "strbuf.h"
+#include "transport.h"
+
+void fetch_object(const char *remote_name, const unsigned char *sha1)
+{
+	struct remote *remote;
+	struct transport *transport;
+	struct ref *ref;
+
+	remote = remote_get(remote_name);
+	if (!remote->url[0])
+		die(_("Remote with no URL"));
+	transport = transport_get(remote, remote->url[0]);
+
+	ref = alloc_ref(sha1_to_hex(sha1));
+	hashcpy(ref->old_oid.hash, sha1);
+	transport_set_option(transport, TRANS_OPT_FROM_PROMISOR, "1");
+	transport_set_option(transport, TRANS_OPT_NO_HAVES, "1");
+	transport_fetch_refs(transport, ref);
+}
diff --git a/fetch-object.h b/fetch-object.h
new file mode 100644
index 000000000..f371300c8
--- /dev/null
+++ b/fetch-object.h
@@ -0,0 +1,6 @@
+#ifndef FETCH_OBJECT_H
+#define FETCH_OBJECT_H
+
+extern void fetch_object(const char *remote_name, const unsigned char *sha1);
+
+#endif
diff --git a/fetch-pack.c b/fetch-pack.c
index 105506e9a..d376c4ef1 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -450,6 +450,8 @@ static int find_common(struct fetch_pack_args *args,
 
 	flushes = 0;
 	retval = -1;
+	if (args->no_haves)
+		goto done;
 	while ((oid = get_rev())) {
 		packet_buf_write(&req_buf, "have %s\n", oid_to_hex(oid));
 		print_verbose(args, "have %s", oid_to_hex(oid));
@@ -832,7 +834,7 @@ static int get_pack(struct fetch_pack_args *args,
 		argv_array_push(&cmd.args, alternate_shallow_file);
 	}
 
-	if (do_keep) {
+	if (do_keep || args->from_promisor) {
 		if (pack_lockfile)
 			cmd.out = -1;
 		cmd_name = "index-pack";
@@ -842,7 +844,7 @@ static int get_pack(struct fetch_pack_args *args,
 			argv_array_push(&cmd.args, "-v");
 		if (args->use_thin_pack)
 			argv_array_push(&cmd.args, "--fix-thin");
-		if (args->lock_pack || unpack_limit) {
+		if (do_keep && (args->lock_pack || unpack_limit)) {
 			char hostname[HOST_NAME_MAX + 1];
 			if (xgethostname(hostname, sizeof(hostname)))
 				xsnprintf(hostname, sizeof(hostname), "localhost");
@@ -852,6 +854,8 @@ static int get_pack(struct fetch_pack_args *args,
 		}
 		if (args->check_self_contained_and_connected)
 			argv_array_push(&cmd.args, "--check-self-contained-and-connected");
+		if (args->from_promisor)
+			argv_array_push(&cmd.args, "--promisor");
 	}
 	else {
 		cmd_name = "unpack-objects";
diff --git a/fetch-pack.h b/fetch-pack.h
index b6aeb43a8..84904c348 100644
--- a/fetch-pack.h
+++ b/fetch-pack.h
@@ -29,6 +29,8 @@ struct fetch_pack_args {
 	unsigned cloning:1;
 	unsigned update_shallow:1;
 	unsigned deepen:1;
+	unsigned from_promisor:1;
+	unsigned no_haves:1;
 };
 
 /*
diff --git a/remote-curl.c b/remote-curl.c
index 0053b0954..34a81b8d3 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -33,7 +33,9 @@ struct options {
 		thin : 1,
 		/* One of the SEND_PACK_PUSH_CERT_* constants. */
 		push_cert : 2,
-		deepen_relative : 1;
+		deepen_relative : 1,
+		from_promisor : 1,
+		no_haves : 1;
 };
 static struct options options;
 static struct string_list cas_options = STRING_LIST_INIT_DUP;
@@ -157,6 +159,12 @@ static int set_option(const char *name, const char *value)
 			return -1;
 		return 0;
 #endif /* LIBCURL_VERSION_NUM >= 0x070a08 */
+	} else if (!strcmp(name, "from-promisor")) {
+		options.from_promisor = 1;
+		return 0;
+	} else if (!strcmp(name, "no-haves")) {
+		options.no_haves = 1;
+		return 0;
 	} else {
 		return 1 /* unsupported */;
 	}
@@ -822,6 +830,10 @@ static int fetch_git(struct discovery *heads,
 				 options.deepen_not.items[i].string);
 	if (options.deepen_relative && options.depth)
 		argv_array_push(&args, "--deepen-relative");
+	if (options.from_promisor)
+		argv_array_push(&args, "--from-promisor");
+	if (options.no_haves)
+		argv_array_push(&args, "--no-haves");
 	argv_array_push(&args, url.buf);
 
 	for (i = 0; i < nr_heads; i++) {
diff --git a/transport.c b/transport.c
index d75ff0514..fb46e33bf 100644
--- a/transport.c
+++ b/transport.c
@@ -161,6 +161,12 @@ static int set_git_option(struct git_transport_options *opts,
 	} else if (!strcmp(name, TRANS_OPT_DEEPEN_RELATIVE)) {
 		opts->deepen_relative = !!value;
 		return 0;
+	} else if (!strcmp(name, TRANS_OPT_FROM_PROMISOR)) {
+		opts->from_promisor = !!value;
+		return 0;
+	} else if (!strcmp(name, TRANS_OPT_NO_HAVES)) {
+		opts->no_haves = !!value;
+		return 0;
 	}
 	return 1;
 }
@@ -229,6 +235,8 @@ static int fetch_refs_via_pack(struct transport *transport,
 		data->options.check_self_contained_and_connected;
 	args.cloning = transport->cloning;
 	args.update_shallow = data->options.update_shallow;
+	args.from_promisor = data->options.from_promisor;
+	args.no_haves = data->options.no_haves;
 
 	if (!data->got_remote_heads) {
 		connect_setup(transport, 0);
diff --git a/transport.h b/transport.h
index bc5571574..67428f6df 100644
--- a/transport.h
+++ b/transport.h
@@ -15,6 +15,8 @@ struct git_transport_options {
 	unsigned self_contained_and_connected : 1;
 	unsigned update_shallow : 1;
 	unsigned deepen_relative : 1;
+	unsigned from_promisor : 1;
+	unsigned no_haves : 1;
 	int depth;
 	const char *deepen_since;
 	const struct string_list *deepen_not;
@@ -210,6 +212,12 @@ void transport_check_allowed(const char *type);
 /* Send push certificates */
 #define TRANS_OPT_PUSH_CERT "pushcert"
 
+/* Indicate that these objects are being fetched by a promisor */
+#define TRANS_OPT_FROM_PROMISOR "from-promisor"
+
+/* Do not send "have" lines */
+#define TRANS_OPT_NO_HAVES "no-haves"
+
 /**
  * Returns 0 if the option was used, non-zero otherwise. Prints a
  * message to stderr if the option is not used.
-- 
2.14.2.822.g60be5d43e6-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 07/18] sha1_file: support lazily fetching missing objects
  2017-09-29 20:11 [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Jonathan Tan
                   ` (5 preceding siblings ...)
  2017-09-29 20:11 ` [PATCH 06/18] introduce fetch-object: fetch one promisor object Jonathan Tan
@ 2017-09-29 20:11 ` Jonathan Tan
  2017-10-12 14:42   ` Christian Couder
  2017-09-29 20:11 ` [PATCH 08/18] rev-list: support termination at promisor " Jonathan Tan
                   ` (13 subsequent siblings)
  20 siblings, 1 reply; 28+ messages in thread
From: Jonathan Tan @ 2017-09-29 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, git, peartben, christian.couder

Teach sha1_file to fetch objects from the remote configured in
extensions.partialClone whenever an object is requested but missing.

The fetching of objects can be suppressed through a global variable.
This is used by fsck and index-pack.

However, by default, such fetching is not suppressed. This is meant as a
temporary measure to ensure that all Git commands work in such a
situation. Future patches will update some commands to either tolerate
missing objects (without fetching them) or be more efficient in fetching
them.

In order to determine the code changes in sha1_file.c necessary, I
investigated the following:
 (1) functions in sha1_file.c that take in a hash, without the user
     regarding how the object is stored (loose or packed)
 (2) functions in packfile.c (because I need to check callers that know
     about the loose/packed distinction and operate on both differently,
     and ensure that they can handle the concept of objects that are
     neither loose nor packed)

(1) is handled by the modification to sha1_object_info_extended().

For (2), I looked at for_each_packed_object and others.  For
for_each_packed_object, the callers either already work or are fixed in
this patch:
 - reachable - only to find recent objects
 - builtin/fsck - already knows about missing objects
 - builtin/cat-file - warning message added in this commit

Callers of the other functions do not need to be changed:
 - parse_pack_index
   - http - indirectly from http_get_info_packs
   - find_pack_entry_one
     - this searches a single pack that is provided as an argument; the
       caller already knows (through other means) that the sought object
       is in a specific pack
 - find_sha1_pack
   - fast-import - appears to be an optimization to not store a file if
     it is already in a pack
   - http-walker - to search through a struct alt_base
   - http-push - to search through remote packs
 - has_sha1_pack
   - builtin/fsck - already knows about promisor objects
   - builtin/count-objects - informational purposes only (check if loose
     object is also packed)
   - builtin/prune-packed - check if object to be pruned is packed (if
     not, don't prune it)
   - revision - used to exclude packed objects if requested by user
   - diff - just for optimization

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/cat-file.c       |  2 ++
 builtin/fetch-pack.c     |  2 ++
 builtin/fsck.c           |  3 +++
 builtin/index-pack.c     |  6 ++++++
 cache.h                  |  8 ++++++++
 fetch-object.c           |  3 +++
 sha1_file.c              | 38 ++++++++++++++++++++++++------------
 t/t0410-partial-clone.sh | 51 ++++++++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 100 insertions(+), 13 deletions(-)

diff --git a/builtin/cat-file.c b/builtin/cat-file.c
index 4ccbfaac3..7aa1159ce 100644
--- a/builtin/cat-file.c
+++ b/builtin/cat-file.c
@@ -474,6 +474,8 @@ static int batch_objects(struct batch_options *opt)
 
 		for_each_loose_object(batch_loose_object, &sa, 0);
 		for_each_packed_object(batch_packed_object, &sa, 0);
+		if (repository_format_partial_clone)
+			warning("This repository has extensions.partialClone set. Some objects may not be loaded.");
 
 		cb.opt = opt;
 		cb.expand = &data;
diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index 9f303cf98..9a7ebf6e9 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -53,6 +53,8 @@ int cmd_fetch_pack(int argc, const char **argv, const char *prefix)
 	struct oid_array shallow = OID_ARRAY_INIT;
 	struct string_list deepen_not = STRING_LIST_INIT_DUP;
 
+	fetch_if_missing = 0;
+
 	packet_trace_identity("fetch-pack");
 
 	memset(&args, 0, sizeof(args));
diff --git a/builtin/fsck.c b/builtin/fsck.c
index f6cb4d755..155ca663b 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -683,6 +683,9 @@ int cmd_fsck(int argc, const char **argv, const char *prefix)
 	int i;
 	struct alternate_object_database *alt;
 
+	/* fsck knows how to handle missing promisor objects */
+	fetch_if_missing = 0;
+
 	errors_found = 0;
 	check_replace_refs = 0;
 
diff --git a/builtin/index-pack.c b/builtin/index-pack.c
index 14ebb2f17..440558046 100644
--- a/builtin/index-pack.c
+++ b/builtin/index-pack.c
@@ -1657,6 +1657,12 @@ int cmd_index_pack(int argc, const char **argv, const char *prefix)
 	unsigned foreign_nr = 1;	/* zero is a "good" value, assume bad */
 	int report_end_of_input = 0;
 
+	/*
+	 * index-pack never needs to fetch missing objects, since it only
+	 * accesses the repo to do hash collision checks
+	 */
+	fetch_if_missing = 0;
+
 	if (argc == 2 && !strcmp(argv[1], "-h"))
 		usage(index_pack_usage);
 
diff --git a/cache.h b/cache.h
index 437206d06..58d3f8986 100644
--- a/cache.h
+++ b/cache.h
@@ -1721,6 +1721,14 @@ struct object_info {
 #define OBJECT_INFO_QUICK 8
 extern int sha1_object_info_extended(const unsigned char *, struct object_info *, unsigned flags);
 
+/*
+ * Set this to 0 to prevent sha1_object_info_extended() from fetching missing
+ * blobs. This has a difference only if extensions.partialClone is set.
+ *
+ * Its default value is 1.
+ */
+extern int fetch_if_missing;
+
 /* Dumb servers support */
 extern int update_server_info(int);
 
diff --git a/fetch-object.c b/fetch-object.c
index f89dbba75..369b61c0e 100644
--- a/fetch-object.c
+++ b/fetch-object.c
@@ -9,7 +9,9 @@ void fetch_object(const char *remote_name, const unsigned char *sha1)
 	struct remote *remote;
 	struct transport *transport;
 	struct ref *ref;
+	int original_fetch_if_missing = fetch_if_missing;
 
+	fetch_if_missing = 0;
 	remote = remote_get(remote_name);
 	if (!remote->url[0])
 		die(_("Remote with no URL"));
@@ -20,4 +22,5 @@ void fetch_object(const char *remote_name, const unsigned char *sha1)
 	transport_set_option(transport, TRANS_OPT_FROM_PROMISOR, "1");
 	transport_set_option(transport, TRANS_OPT_NO_HAVES, "1");
 	transport_fetch_refs(transport, ref);
+	fetch_if_missing = original_fetch_if_missing;
 }
diff --git a/sha1_file.c b/sha1_file.c
index b4a67bb83..77aa0ffdf 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -29,6 +29,7 @@
 #include "mergesort.h"
 #include "quote.h"
 #include "packfile.h"
+#include "fetch-object.h"
 
 const unsigned char null_sha1[GIT_MAX_RAWSZ];
 const struct object_id null_oid;
@@ -1149,6 +1150,8 @@ static int sha1_loose_object_info(const unsigned char *sha1,
 	return (status < 0) ? status : 0;
 }
 
+int fetch_if_missing = 1;
+
 int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi, unsigned flags)
 {
 	static struct object_info blank_oi = OBJECT_INFO_INIT;
@@ -1157,6 +1160,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 	const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
 				    lookup_replace_object(sha1) :
 				    sha1;
+	int already_retried = 0;
 
 	if (!oi)
 		oi = &blank_oi;
@@ -1181,28 +1185,36 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
 		}
 	}
 
-	if (!find_pack_entry(real, &e)) {
-		/* Most likely it's a loose object. */
-		if (!sha1_loose_object_info(real, oi, flags))
-			return 0;
+retry:
+	if (find_pack_entry(real, &e))
+		goto found_packed;
 
-		/* Not a loose object; someone else may have just packed it. */
-		if (flags & OBJECT_INFO_QUICK) {
-			return -1;
-		} else {
-			reprepare_packed_git();
-			if (!find_pack_entry(real, &e))
-				return -1;
-		}
+	/* Most likely it's a loose object. */
+	if (!sha1_loose_object_info(real, oi, flags))
+		return 0;
+
+	/* Not a loose object; someone else may have just packed it. */
+	reprepare_packed_git();
+	if (find_pack_entry(real, &e))
+		goto found_packed;
+
+	/* Check if it is a missing object */
+	if (fetch_if_missing && repository_format_partial_clone &&
+	    !already_retried) {
+		fetch_object(repository_format_partial_clone, real);
+		already_retried = 1;
+		goto retry;
 	}
 
+	return -1;
+
+found_packed:
 	if (oi == &blank_oi)
 		/*
 		 * We know that the caller doesn't actually need the
 		 * information below, so return early.
 		 */
 		return 0;
-
 	rtype = packed_object_info(e.p, e.offset, oi);
 	if (rtype < 0) {
 		mark_bad_packed_object(e.p, real);
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index e96f436b0..8a90f6ab3 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -138,4 +138,55 @@ test_expect_success 'missing CLI object, but promised, passes fsck' '
 	git -C repo fsck "$A"
 '
 
+test_expect_success 'fetching of missing objects' '
+	rm -rf repo &&
+	test_create_repo server &&
+	test_commit -C server foo &&
+	git -C server repack -a -d --write-bitmap-index &&
+
+	git clone "file://$(pwd)/server" repo &&
+	HASH=$(git -C repo rev-parse foo) &&
+	rm -rf repo/.git/objects/* &&
+
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.partialclone "origin" &&
+	git -C repo cat-file -p "$HASH" &&
+
+	# Ensure that the .promisor file is written, and check that its
+	# associated packfile contains the object
+	ls repo/.git/objects/pack/pack-*.promisor >promisorlist &&
+	test_line_count = 1 promisorlist &&
+	IDX=$(cat promisorlist | sed "s/promisor$/idx/") &&
+	git verify-pack --verbose "$IDX" | grep "$HASH"
+'
+
+LIB_HTTPD_PORT=12345  # default port, 410, cannot be used as non-root
+. "$TEST_DIRECTORY"/lib-httpd.sh
+start_httpd
+
+test_expect_success 'fetching of missing objects from an HTTP server' '
+	rm -rf repo &&
+	SERVER="$HTTPD_DOCUMENT_ROOT_PATH/server" &&
+	test_create_repo "$SERVER" &&
+	test_commit -C "$SERVER" foo &&
+	git -C "$SERVER" repack -a -d --write-bitmap-index &&
+
+	git clone $HTTPD_URL/smart/server repo &&
+	HASH=$(git -C repo rev-parse foo) &&
+	rm -rf repo/.git/objects/* &&
+
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.partialclone "origin" &&
+	git -C repo cat-file -p "$HASH" &&
+
+	# Ensure that the .promisor file is written, and check that its
+	# associated packfile contains the object
+	ls repo/.git/objects/pack/pack-*.promisor >promisorlist &&
+	test_line_count = 1 promisorlist &&
+	IDX=$(cat promisorlist | sed "s/promisor$/idx/") &&
+	git verify-pack --verbose "$IDX" | grep "$HASH"
+'
+
+stop_httpd
+
 test_done
-- 
2.14.2.822.g60be5d43e6-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 08/18] rev-list: support termination at promisor objects
  2017-09-29 20:11 [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Jonathan Tan
                   ` (6 preceding siblings ...)
  2017-09-29 20:11 ` [PATCH 07/18] sha1_file: support lazily fetching missing objects Jonathan Tan
@ 2017-09-29 20:11 ` " Jonathan Tan
  2017-09-29 20:11 ` [PATCH 09/18] gc: do not repack promisor packfiles Jonathan Tan
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Tan @ 2017-09-29 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, git, peartben, christian.couder

Teach rev-list to support termination of an object traversal at any
object from a promisor remote (whether one that the local repo also has,
or one that the local repo knows about because it has another promisor
object that references it).

This will be used subsequently in gc and in the connectivity check used
by fetch.

For efficiency, if an object is referenced by a promisor object, and is
in the local repo only as a non-promisor object, object traversal will
not stop there. This is to avoid building the list of promisor object
references.

(In list-objects.c, the case where obj is NULL in process_blob() and
process_tree() do not need to be changed because those happen only when
there is a conflict between the expected type and the existing object.
If the object doesn't exist, an object will be synthesized, which is
fine.)

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/rev-list.c       |  13 ++++++
 list-objects.c           |  16 +++++++-
 object.c                 |   2 +-
 revision.c               |  33 +++++++++++++++-
 revision.h               |   5 ++-
 t/t0410-partial-clone.sh | 101 +++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 165 insertions(+), 5 deletions(-)

diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index c1c74d4a7..bba800cb9 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -12,6 +12,7 @@
 #include "bisect.h"
 #include "progress.h"
 #include "reflog-walk.h"
+#include "packfile.h"
 
 static const char rev_list_usage[] =
 "git rev-list [OPTION] <commit-id>... [ -- paths... ]\n"
@@ -287,6 +288,18 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 	init_revisions(&revs, prefix);
 	revs.abbrev = DEFAULT_ABBREV;
 	revs.commit_format = CMIT_FMT_UNSPECIFIED;
+
+	/*
+	 * Scan the argument list before invoking setup_revisions(), so that we
+	 * know if fetch_if_missing needs to be set to 0.
+	 */
+	for (i = 1; i < argc; i++) {
+		if (!strcmp(argv[i], "--exclude-promisor-objects")) {
+			fetch_if_missing = 0;
+			break;
+		}
+	}
+
 	argc = setup_revisions(argc, argv, &revs, NULL);
 
 	memset(&info, 0, sizeof(info));
diff --git a/list-objects.c b/list-objects.c
index b3931fa43..8a98192dd 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -7,6 +7,7 @@
 #include "tree-walk.h"
 #include "revision.h"
 #include "list-objects.h"
+#include "packfile.h"
 
 static void process_blob(struct rev_info *revs,
 			 struct blob *blob,
@@ -24,6 +25,13 @@ static void process_blob(struct rev_info *revs,
 		die("bad blob object");
 	if (obj->flags & (UNINTERESTING | SEEN))
 		return;
+	if (!has_object_file(&obj->oid)) {
+		if (revs->exclude_promisor_objects &&
+		    is_promisor_object(&obj->oid)) {
+			return;
+		}
+		/* error message will be reported later */
+	}
 	obj->flags |= SEEN;
 
 	pathlen = path->len;
@@ -77,6 +85,8 @@ static void process_tree(struct rev_info *revs,
 	enum interesting match = revs->diffopt.pathspec.nr == 0 ?
 		all_entries_interesting: entry_not_interesting;
 	int baselen = base->len;
+	int gently = revs->ignore_missing_links ||
+		     revs->exclude_promisor_objects;
 
 	if (!revs->tree_objects)
 		return;
@@ -84,9 +94,13 @@ static void process_tree(struct rev_info *revs,
 		die("bad tree object");
 	if (obj->flags & (UNINTERESTING | SEEN))
 		return;
-	if (parse_tree_gently(tree, revs->ignore_missing_links) < 0) {
+	if (parse_tree_gently(tree, gently) < 0) {
 		if (revs->ignore_missing_links)
 			return;
+		if (revs->exclude_promisor_objects &&
+		    is_promisor_object(&obj->oid)) {
+			return;
+		}
 		die("bad tree object %s", oid_to_hex(&obj->oid));
 	}
 
diff --git a/object.c b/object.c
index 321d7e920..0f2490145 100644
--- a/object.c
+++ b/object.c
@@ -252,7 +252,7 @@ struct object *parse_object(const struct object_id *oid)
 	if (obj && obj->parsed)
 		return obj;
 
-	if ((obj && obj->type == OBJ_BLOB) ||
+	if ((obj && obj->type == OBJ_BLOB && has_object_file(oid)) ||
 	    (!obj && has_object_file(oid) &&
 	     sha1_object_info(oid->hash, NULL) == OBJ_BLOB)) {
 		if (check_sha1_signature(repl, NULL, 0, NULL) < 0) {
diff --git a/revision.c b/revision.c
index f9a90d71d..f92d03bfe 100644
--- a/revision.c
+++ b/revision.c
@@ -197,6 +197,8 @@ static struct object *get_reference(struct rev_info *revs, const char *name,
 	if (!object) {
 		if (revs->ignore_missing)
 			return object;
+		if (revs->exclude_promisor_objects && is_promisor_object(oid))
+			return NULL;
 		die("bad object %s", name);
 	}
 	object->flags |= flags;
@@ -789,9 +791,17 @@ static int add_parents_to_list(struct rev_info *revs, struct commit *commit,
 
 	for (parent = commit->parents; parent; parent = parent->next) {
 		struct commit *p = parent->item;
-
-		if (parse_commit_gently(p, revs->ignore_missing_links) < 0)
+		int gently = revs->ignore_missing_links ||
+			     revs->exclude_promisor_objects;
+		if (parse_commit_gently(p, gently) < 0) {
+			if (revs->exclude_promisor_objects &&
+			    is_promisor_object(&p->object.oid)) {
+				if (revs->first_parent_only)
+					break;
+				continue;
+			}
 			return -1;
+		}
 		if (revs->show_source && !p->util)
 			p->util = commit->util;
 		p->object.flags |= left_flag;
@@ -2103,6 +2113,10 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
 		revs->limited = 1;
 	} else if (!strcmp(arg, "--ignore-missing")) {
 		revs->ignore_missing = 1;
+	} else if (!strcmp(arg, "--exclude-promisor-objects")) {
+		if (fetch_if_missing)
+			die("BUG: exclude_promisor_objects can only be used when fetch_if_missing is 0");
+		revs->exclude_promisor_objects = 1;
 	} else {
 		int opts = diff_opt_parse(&revs->diffopt, argv, argc, revs->prefix);
 		if (!opts)
@@ -2848,6 +2862,16 @@ void reset_revision_walk(void)
 	clear_object_flags(SEEN | ADDED | SHOWN);
 }
 
+static int mark_uninteresting(const struct object_id *oid,
+			      struct packed_git *pack,
+			      uint32_t pos,
+			      void *unused)
+{
+	struct object *o = parse_object(oid);
+	o->flags |= UNINTERESTING | SEEN;
+	return 0;
+}
+
 int prepare_revision_walk(struct rev_info *revs)
 {
 	int i;
@@ -2876,6 +2900,11 @@ int prepare_revision_walk(struct rev_info *revs)
 	    (revs->limited && limiting_can_increase_treesame(revs)))
 		revs->treesame.name = "treesame";
 
+	if (revs->exclude_promisor_objects) {
+		for_each_packed_object(mark_uninteresting, NULL,
+				       FOR_EACH_OBJECT_PROMISOR_ONLY);
+	}
+
 	if (revs->no_walk != REVISION_WALK_NO_WALK_UNSORTED)
 		commit_list_sort_by_date(&revs->commits);
 	if (revs->no_walk)
diff --git a/revision.h b/revision.h
index 3a3d3e2cf..341711ab7 100644
--- a/revision.h
+++ b/revision.h
@@ -121,7 +121,10 @@ struct rev_info {
 			bisect:1,
 			ancestry_path:1,
 			first_parent_only:1,
-			line_level_traverse:1;
+			line_level_traverse:1,
+
+			/* for internal use only */
+			exclude_promisor_objects:1;
 
 	/* Diff flags */
 	unsigned int	diff:1,
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 8a90f6ab3..3ca6af5cd 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -160,6 +160,107 @@ test_expect_success 'fetching of missing objects' '
 	git verify-pack --verbose "$IDX" | grep "$HASH"
 '
 
+test_expect_success 'rev-list stops traversal at missing and promised commit' '
+	rm -rf repo &&
+	test_create_repo repo &&
+	test_commit -C repo foo &&
+	test_commit -C repo bar &&
+
+	FOO=$(git -C repo rev-parse foo) &&
+	promise_and_delete "$FOO" &&
+
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.partialclone "arbitrary string" &&
+	git -C repo rev-list --exclude-promisor-objects --objects bar >out &&
+	grep $(git -C repo rev-parse bar) out &&
+	! grep $FOO out
+'
+
+test_expect_success 'rev-list stops traversal at missing and promised tree' '
+	rm -rf repo &&
+	test_create_repo repo &&
+	test_commit -C repo foo &&
+	mkdir repo/a_dir &&
+	echo something >repo/a_dir/something &&
+	git -C repo add a_dir/something &&
+	git -C repo commit -m bar &&
+
+	# foo^{tree} (tree referenced from commit)
+	TREE=$(git -C repo rev-parse foo^{tree}) &&
+
+	# a tree referenced by HEAD^{tree} (tree referenced from tree)
+	TREE2=$(git -C repo ls-tree HEAD^{tree} | grep " tree " | head -1 | cut -b13-52) &&
+
+	promise_and_delete "$TREE" &&
+	promise_and_delete "$TREE2" &&
+
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.partialclone "arbitrary string" &&
+	git -C repo rev-list --exclude-promisor-objects --objects HEAD >out &&
+	grep $(git -C repo rev-parse foo) out &&
+	! grep $TREE out &&
+	grep $(git -C repo rev-parse HEAD) out &&
+	! grep $TREE2 out
+'
+
+test_expect_success 'rev-list stops traversal at missing and promised blob' '
+	rm -rf repo &&
+	test_create_repo repo &&
+	echo something >repo/something &&
+	git -C repo add something &&
+	git -C repo commit -m foo &&
+
+	BLOB=$(git -C repo hash-object -w something) &&
+	promise_and_delete "$BLOB" &&
+
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.partialclone "arbitrary string" &&
+	git -C repo rev-list --exclude-promisor-objects --objects HEAD >out &&
+	grep $(git -C repo rev-parse HEAD) out &&
+	! grep $BLOB out
+'
+
+test_expect_success 'rev-list stops traversal at promisor commit, tree, and blob' '
+	rm -rf repo &&
+	test_create_repo repo &&
+	test_commit -C repo foo &&
+	test_commit -C repo bar &&
+	test_commit -C repo baz &&
+
+	COMMIT=$(git -C repo rev-parse foo) &&
+	TREE=$(git -C repo rev-parse bar^{tree}) &&
+	BLOB=$(git hash-object repo/baz.t) &&
+	printf "%s\n%s\n%s\n" $COMMIT $TREE $BLOB | pack_as_from_promisor &&
+
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.partialclone "arbitrary string" &&
+	git -C repo rev-list --exclude-promisor-objects --objects HEAD >out &&
+	! grep $COMMIT out &&
+	! grep $TREE out &&
+	! grep $BLOB out &&
+	grep $(git -C repo rev-parse bar) out  # sanity check that some walking was done
+'
+
+test_expect_success 'rev-list accepts missing and promised objects on command line' '
+	rm -rf repo &&
+	test_create_repo repo &&
+	test_commit -C repo foo &&
+	test_commit -C repo bar &&
+	test_commit -C repo baz &&
+
+	COMMIT=$(git -C repo rev-parse foo) &&
+	TREE=$(git -C repo rev-parse bar^{tree}) &&
+	BLOB=$(git hash-object repo/baz.t) &&
+
+	promise_and_delete $COMMIT &&
+	promise_and_delete $TREE &&
+	promise_and_delete $BLOB &&
+
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.partialclone "arbitrary string" &&
+	git -C repo rev-list --exclude-promisor-objects --objects "$COMMIT" "$TREE" "$BLOB"
+'
+
 LIB_HTTPD_PORT=12345  # default port, 410, cannot be used as non-root
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
-- 
2.14.2.822.g60be5d43e6-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 09/18] gc: do not repack promisor packfiles
  2017-09-29 20:11 [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Jonathan Tan
                   ` (7 preceding siblings ...)
  2017-09-29 20:11 ` [PATCH 08/18] rev-list: support termination at promisor " Jonathan Tan
@ 2017-09-29 20:11 ` Jonathan Tan
  2017-09-29 20:11 ` [PATCH 10/18] pack-objects: rename want_.* to ignore_.* Jonathan Tan
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Tan @ 2017-09-29 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, git, peartben, christian.couder

Teach gc to stop traversal at promisor objects, and to leave promisor
packfiles alone. This has the effect of only repacking non-promisor
packfiles, and preserves the distinction between promisor packfiles and
non-promisor packfiles.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/gc.c             |  3 +++
 builtin/pack-objects.c   | 10 ++++++++++
 builtin/prune.c          |  7 +++++++
 builtin/repack.c         |  7 +++++--
 t/t0410-partial-clone.sh | 52 +++++++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 76 insertions(+), 3 deletions(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index c22787ac7..7a9632bba 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -458,6 +458,9 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 			argv_array_push(&prune, prune_expire);
 			if (quiet)
 				argv_array_push(&prune, "--no-progress");
+			if (repository_format_partial_clone)
+				argv_array_push(&prune,
+						"--exclude-promisor-objects");
 			if (run_command_v_opt(prune.argv, RUN_GIT_CMD))
 				return error(FAILED_RUN, prune.argv[0]);
 		}
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index a57b4f058..958822bf4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -73,6 +73,8 @@ static int use_bitmap_index = -1;
 static int write_bitmap_index;
 static uint16_t write_bitmap_options;
 
+static int exclude_promisor_objects;
+
 static unsigned long delta_cache_size = 0;
 static unsigned long max_delta_cache_size = 256 * 1024 * 1024;
 static unsigned long cache_max_small_delta_size = 1000;
@@ -2952,6 +2954,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			 N_("use a bitmap index if available to speed up counting objects")),
 		OPT_BOOL(0, "write-bitmap-index", &write_bitmap_index,
 			 N_("write a bitmap index together with the pack index")),
+		OPT_BOOL(0, "exclude-promisor-objects", &exclude_promisor_objects,
+			 N_("do not pack objects in promisor packfiles")),
 		OPT_END(),
 	};
 
@@ -2997,6 +3001,12 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		argv_array_push(&rp, "--unpacked");
 	}
 
+	if (exclude_promisor_objects) {
+		use_internal_rev_list = 1;
+		fetch_if_missing = 0;
+		argv_array_push(&rp, "--exclude-promisor-objects");
+	}
+
 	if (!reuse_object)
 		reuse_delta = 0;
 	if (pack_compression_level == -1)
diff --git a/builtin/prune.c b/builtin/prune.c
index cddabf26a..be34645dc 100644
--- a/builtin/prune.c
+++ b/builtin/prune.c
@@ -101,12 +101,15 @@ int cmd_prune(int argc, const char **argv, const char *prefix)
 {
 	struct rev_info revs;
 	struct progress *progress = NULL;
+	int exclude_promisor_objects = 0;
 	const struct option options[] = {
 		OPT__DRY_RUN(&show_only, N_("do not remove, show only")),
 		OPT__VERBOSE(&verbose, N_("report pruned objects")),
 		OPT_BOOL(0, "progress", &show_progress, N_("show progress")),
 		OPT_EXPIRY_DATE(0, "expire", &expire,
 				N_("expire objects older than <time>")),
+		OPT_BOOL(0, "exclude-promisor-objects", &exclude_promisor_objects,
+			 N_("limit traversal to objects outside promisor packfiles")),
 		OPT_END()
 	};
 	char *s;
@@ -139,6 +142,10 @@ int cmd_prune(int argc, const char **argv, const char *prefix)
 		show_progress = isatty(2);
 	if (show_progress)
 		progress = start_delayed_progress(_("Checking connectivity"), 0);
+	if (exclude_promisor_objects) {
+		fetch_if_missing = 0;
+		revs.exclude_promisor_objects = 1;
+	}
 
 	mark_reachable_objects(&revs, 1, expire, progress);
 	stop_progress(&progress);
diff --git a/builtin/repack.c b/builtin/repack.c
index f17a68a17..f43317bb5 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -83,7 +83,8 @@ static void remove_pack_on_signal(int signo)
 
 /*
  * Adds all packs hex strings to the fname list, which do not
- * have a corresponding .keep file.
+ * have a corresponding .keep or .promisor file. These packs are not to
+ * be kept if we are going to pack everything into one file.
  */
 static void get_non_kept_pack_filenames(struct string_list *fname_list)
 {
@@ -101,7 +102,8 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list)
 
 		fname = xmemdupz(e->d_name, len);
 
-		if (!file_exists(mkpath("%s/%s.keep", packdir, fname)))
+		if (!file_exists(mkpath("%s/%s.keep", packdir, fname)) &&
+		    !file_exists(mkpath("%s/%s.promisor", packdir, fname)))
 			string_list_append_nodup(fname_list, fname);
 		else
 			free(fname);
@@ -232,6 +234,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	argv_array_push(&cmd.args, "--all");
 	argv_array_push(&cmd.args, "--reflog");
 	argv_array_push(&cmd.args, "--indexed-objects");
+	argv_array_push(&cmd.args, "--exclude-promisor-objects");
 	if (window)
 		argv_array_pushf(&cmd.args, "--window=%s", window);
 	if (window_memory)
diff --git a/t/t0410-partial-clone.sh b/t/t0410-partial-clone.sh
index 3ca6af5cd..071736c7b 100755
--- a/t/t0410-partial-clone.sh
+++ b/t/t0410-partial-clone.sh
@@ -11,13 +11,15 @@ delete_object () {
 pack_as_from_promisor () {
 	HASH=$(git -C repo pack-objects .git/objects/pack/pack) &&
 	>repo/.git/objects/pack/pack-$HASH.promisor
+	echo $HASH
 }
 
 promise_and_delete () {
 	HASH=$(git -C repo rev-parse "$1") &&
 	git -C repo tag -a -m message my_annotated_tag "$HASH" &&
 	git -C repo rev-parse my_annotated_tag | pack_as_from_promisor &&
-	git -C repo tag -d my_annotated_tag &&
+	# tag -d prints a message to stdout, so redirect it
+	git -C repo tag -d my_annotated_tag >/dev/null &&
 	delete_object repo "$HASH"
 }
 
@@ -261,6 +263,54 @@ test_expect_success 'rev-list accepts missing and promised objects on command li
 	git -C repo rev-list --exclude-promisor-objects --objects "$COMMIT" "$TREE" "$BLOB"
 '
 
+test_expect_success 'gc does not repack promisor objects' '
+	rm -rf repo &&
+	test_create_repo repo &&
+	test_commit -C repo my_commit &&
+
+	TREE_HASH=$(git -C repo rev-parse HEAD^{tree}) &&
+	HASH=$(printf "$TREE_HASH\n" | pack_as_from_promisor) &&
+
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.partialclone "arbitrary string" &&
+	git -C repo gc &&
+
+	# Ensure that the promisor packfile still exists, and remove it
+	test -e repo/.git/objects/pack/pack-$HASH.pack &&
+	rm repo/.git/objects/pack/pack-$HASH.* &&
+
+	# Ensure that the single other pack contains the commit, but not the tree
+	ls repo/.git/objects/pack/pack-*.pack >packlist &&
+	test_line_count = 1 packlist &&
+	git verify-pack repo/.git/objects/pack/pack-*.pack -v >out &&
+	grep "$(git -C repo rev-parse HEAD)" out &&
+	! grep "$TREE_HASH" out
+'
+
+test_expect_success 'gc stops traversal when a missing but promised object is reached' '
+	rm -rf repo &&
+	test_create_repo repo &&
+	test_commit -C repo my_commit &&
+
+	TREE_HASH=$(git -C repo rev-parse HEAD^{tree}) &&
+	HASH=$(promise_and_delete $TREE_HASH) &&
+
+	git -C repo config core.repositoryformatversion 1 &&
+	git -C repo config extensions.partialclone "arbitrary string" &&
+	git -C repo gc &&
+
+	# Ensure that the promisor packfile still exists, and remove it
+	test -e repo/.git/objects/pack/pack-$HASH.pack &&
+	rm repo/.git/objects/pack/pack-$HASH.* &&
+
+	# Ensure that the single other pack contains the commit, but not the tree
+	ls repo/.git/objects/pack/pack-*.pack >packlist &&
+	test_line_count = 1 packlist &&
+	git verify-pack repo/.git/objects/pack/pack-*.pack -v >out &&
+	grep "$(git -C repo rev-parse HEAD)" out &&
+	! grep "$TREE_HASH" out
+'
+
 LIB_HTTPD_PORT=12345  # default port, 410, cannot be used as non-root
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
-- 
2.14.2.822.g60be5d43e6-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 10/18] pack-objects: rename want_.* to ignore_.*
  2017-09-29 20:11 [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Jonathan Tan
                   ` (8 preceding siblings ...)
  2017-09-29 20:11 ` [PATCH 09/18] gc: do not repack promisor packfiles Jonathan Tan
@ 2017-09-29 20:11 ` Jonathan Tan
  2017-09-29 20:11 ` [PATCH 11/18] pack-objects: support --blob-max-bytes Jonathan Tan
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Tan @ 2017-09-29 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, git, peartben, christian.couder

Currently, in pack_objects, add_object_entry() distinguishes between 2
types of non-preferred-base objects:

 (1) objects that should not be in "to_pack" because an option like
     --local or --honor-pack-keep is set
 (2) objects that should be in "to_pack"

A subsequent commit will teach pack-objects to exclude certain objects
(specifically, blobs that are larger than a user-given threshold and are
not potentially a Git special file [1]), but unlike in (1) above, these
exclusions need to be reported. So now we have 3 types of
non-preferred-base objects:

 (a) objects that should not be in "to_pack" and should not be reported
     because an option like --local or --honor-pack-keep is set
 (b) objects that should not be in "to_pack" and should be reported
     because they are blobs that are oversized and non-Git-special
 (c) objects that should be in "to_pack"

add_object_entry() will be taught to distinguish between these 3 types
by the following:

 - renaming want_found_object() and want_object_in_pack() to ignore_.*
   to make it clear that they only check for (a) (this commit)
 - adding a new function to check for (b) and using it within
   add_object_entry() (a subsequent commit)

The alternative would be to retain the names want_found_object() and
want_object_in_pack() and have them handle both the cases (a) and (b),
but that would result in more complicated code.

We also take the opportunity to use the terminology "preferred_base"
instead of "excluded" in these methods. It is true that preferred bases
are not included in the final packfile generation, but at this point in
the code, there is no exclusion taking place - on the contrary, if
something is "excluded", it is in fact guaranteed to be in to_pack.

[1] For the purposes of pack_objects, a blob is a Git special file if it
    appears in a to-be-packed tree with a filename beginning with
    ".git".

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/pack-objects.c | 56 +++++++++++++++++++++++++-------------------------
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 958822bf4..ef0b61d5f 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -949,12 +949,12 @@ static int have_duplicate_entry(const unsigned char *sha1,
 	return 1;
 }
 
-static int want_found_object(int exclude, struct packed_git *p)
+static int ignore_found_object(int preferred_base, struct packed_git *p)
 {
-	if (exclude)
-		return 1;
-	if (incremental)
+	if (preferred_base)
 		return 0;
+	if (incremental)
+		return 1;
 
 	/*
 	 * When asked to do --local (do not include an object that appears in a
@@ -972,19 +972,19 @@ static int want_found_object(int exclude, struct packed_git *p)
 	 */
 	if (!ignore_packed_keep &&
 	    (!local || !have_non_local_packs))
-		return 1;
+		return 0;
 
 	if (local && !p->pack_local)
-		return 0;
+		return 1;
 	if (ignore_packed_keep && p->pack_local && p->pack_keep)
-		return 0;
+		return 1;
 
 	/* we don't know yet; keep looking for more packs */
 	return -1;
 }
 
 /*
- * Check whether we want the object in the pack (e.g., we do not want
+ * Check whether we should ignore this object (e.g., we do not want
  * objects found in non-local stores if the "--local" option was used).
  *
  * If the caller already knows an existing pack it wants to take the object
@@ -992,16 +992,16 @@ static int want_found_object(int exclude, struct packed_git *p)
  * function finds if there is any pack that has the object and returns the pack
  * and its offset in these variables.
  */
-static int want_object_in_pack(const unsigned char *sha1,
-			       int exclude,
-			       struct packed_git **found_pack,
-			       off_t *found_offset)
+static int ignore_object(const unsigned char *sha1,
+			 int preferred_base,
+			 struct packed_git **found_pack,
+			 off_t *found_offset)
 {
 	struct mru_entry *entry;
-	int want;
+	int ignore;
 
-	if (!exclude && local && has_loose_object_nonlocal(sha1))
-		return 0;
+	if (!preferred_base && local && has_loose_object_nonlocal(sha1))
+		return 1;
 
 	/*
 	 * If we already know the pack object lives in, start checks from that
@@ -1009,9 +1009,9 @@ static int want_object_in_pack(const unsigned char *sha1,
 	 * are present we will determine the answer right now.
 	 */
 	if (*found_pack) {
-		want = want_found_object(exclude, *found_pack);
-		if (want != -1)
-			return want;
+		ignore = ignore_found_object(preferred_base, *found_pack);
+		if (ignore != -1)
+			return ignore;
 	}
 
 	for (entry = packed_git_mru->head; entry; entry = entry->next) {
@@ -1030,15 +1030,15 @@ static int want_object_in_pack(const unsigned char *sha1,
 				*found_offset = offset;
 				*found_pack = p;
 			}
-			want = want_found_object(exclude, p);
-			if (!exclude && want > 0)
+			ignore = ignore_found_object(preferred_base, p);
+			if (!preferred_base && ignore > 0)
 				mru_mark(packed_git_mru, entry);
-			if (want != -1)
-				return want;
+			if (ignore != -1)
+				return ignore;
 		}
 	}
 
-	return 1;
+	return 0;
 }
 
 static void create_object_entry(const unsigned char *sha1,
@@ -1073,16 +1073,16 @@ static const char no_closure_warning[] = N_(
 );
 
 static int add_object_entry(const unsigned char *sha1, enum object_type type,
-			    const char *name, int exclude)
+			    const char *name, int preferred_base)
 {
 	struct packed_git *found_pack = NULL;
 	off_t found_offset = 0;
 	uint32_t index_pos;
 
-	if (have_duplicate_entry(sha1, exclude, &index_pos))
+	if (have_duplicate_entry(sha1, preferred_base, &index_pos))
 		return 0;
 
-	if (!want_object_in_pack(sha1, exclude, &found_pack, &found_offset)) {
+	if (ignore_object(sha1, preferred_base, &found_pack, &found_offset)) {
 		/* The pack is missing an object, so it will not have closure */
 		if (write_bitmap_index) {
 			warning(_(no_closure_warning));
@@ -1092,7 +1092,7 @@ static int add_object_entry(const unsigned char *sha1, enum object_type type,
 	}
 
 	create_object_entry(sha1, type, pack_name_hash(name),
-			    exclude, name && no_try_delta(name),
+			    preferred_base, name && no_try_delta(name),
 			    index_pos, found_pack, found_offset);
 
 	display_progress(progress_state, nr_result);
@@ -1109,7 +1109,7 @@ static int add_object_entry_from_bitmap(const unsigned char *sha1,
 	if (have_duplicate_entry(sha1, 0, &index_pos))
 		return 0;
 
-	if (!want_object_in_pack(sha1, 0, &pack, &offset))
+	if (ignore_object(sha1, 0, &pack, &offset))
 		return 0;
 
 	create_object_entry(sha1, type, name_hash, 0, 0, index_pos, pack, offset);
-- 
2.14.2.822.g60be5d43e6-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 11/18] pack-objects: support --blob-max-bytes
  2017-09-29 20:11 [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Jonathan Tan
                   ` (9 preceding siblings ...)
  2017-09-29 20:11 ` [PATCH 10/18] pack-objects: rename want_.* to ignore_.* Jonathan Tan
@ 2017-09-29 20:11 ` Jonathan Tan
  2017-09-29 20:11 ` [PATCH 12/18] fetch-pack: support excluding large blobs Jonathan Tan
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Tan @ 2017-09-29 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, git, peartben, christian.couder

As part of an effort to improve Git support for very large repositories
in which clients typically have only a subset of all version-controlled
blobs, teach pack-objects to support --blob-max-bytes, packing only
blobs not exceeding that size unless the blob corresponds to a file
whose name starts with ".git". upload-pack will eventually be taught to
use this new parameter if needed to exclude certain blobs during a fetch
or clone, potentially drastically reducing network consumption when
serving these very large repositories.

Since any excluded blob should not act as a delta base, I did this by
avoiding such oversized blobs from ever going into the "to_pack" data
structure, which contains both preferred bases (which do not go into the
final packfile but can act as delta bases) and the objects to be packed
(which go into the final packfile and also can act as delta bases). To
that end, add_object_entry() has been modified to exclude the
appropriate non-preferred-base objects. (Preferred bases, regardless of
size, still enter "to_pack" - they are something that the client
indicates that it has, not something that the server needs to serve, so
no exclusion occurs.)

If bitmaps are to be used, we would not know if a blob corresponded to a
file whose name starts with ".git". For this reason, when invoked with
--blob-max-bytes, pack-objects will not use bitmaps.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/git-pack-objects.txt | 12 +++++++++-
 builtin/pack-objects.c             | 33 ++++++++++++++++++++++++++--
 t/t5300-pack-object.sh             | 45 ++++++++++++++++++++++++++++++++++++++
 t/test-lib-functions.sh            | 12 ++++++++++
 4 files changed, 99 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index 473a16135..ddce2d4c1 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -12,7 +12,8 @@ SYNOPSIS
 'git pack-objects' [-q | --progress | --all-progress] [--all-progress-implied]
 	[--no-reuse-delta] [--delta-base-offset] [--non-empty]
 	[--local] [--incremental] [--window=<n>] [--depth=<n>]
-	[--revs [--unpacked | --all]] [--stdout | base-name]
+	[--revs [--unpacked | --all]]
+	[--stdout [--blob-max-bytes=<n>] | base-name]
 	[--shallow] [--keep-true-parents] < object-list
 
 
@@ -236,6 +237,15 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it creates a bundle.
 	With this option, parents that are hidden by grafts are packed
 	nevertheless.
 
+--blob-max-bytes=<n>::
+	This option can only be used with --stdout. If specified, a blob
+	larger than this will not be packed unless a to-be-packed tree
+	has that blob with a filename beginning with ".git".  The size
+	can be suffixed with "k", "m", or "g", and may be "0".
++
+If specified, after printing the packfile, pack-objects will print, in packet
+format (pkt-line), the names of excluded blobs and their sizes.
+
 SEE ALSO
 --------
 linkgit:git-rev-list[1]
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index ef0b61d5f..8686b4351 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -26,6 +26,8 @@
 #include "argv-array.h"
 #include "mru.h"
 #include "packfile.h"
+#include "pkt-line.h"
+#include "varint.h"
 
 static const char *pack_usage[] = {
 	N_("git pack-objects --stdout [<options>...] [< <ref-list> | < <object-list>]"),
@@ -81,6 +83,8 @@ static unsigned long cache_max_small_delta_size = 1000;
 
 static unsigned long window_memory_limit = 0;
 
+static long blob_max_bytes = -1;
+
 /*
  * stats
  */
@@ -1072,6 +1076,27 @@ static const char no_closure_warning[] = N_(
 "disabling bitmap writing, as some objects are not being packed"
 );
 
+/*
+ * Returns 1 if the given blob does not meet any defined blob size
+ * limits.  Blobs that appear as a tree entry whose basename begins with
+ * ".git" are never considered oversized.
+ */
+static int oversized(const unsigned char *sha1, const char *name) {
+	const char *last_slash, *basename;
+	unsigned long size;
+
+	if (blob_max_bytes < 0)
+		return 0;
+
+	last_slash = strrchr(name, '/');
+	basename = last_slash ? last_slash + 1 : name;
+	if (starts_with(basename, ".git"))
+		return 0;
+
+	sha1_object_info(sha1, &size);
+	return size > blob_max_bytes;
+}
+
 static int add_object_entry(const unsigned char *sha1, enum object_type type,
 			    const char *name, int preferred_base)
 {
@@ -1082,7 +1107,8 @@ static int add_object_entry(const unsigned char *sha1, enum object_type type,
 	if (have_duplicate_entry(sha1, preferred_base, &index_pos))
 		return 0;
 
-	if (ignore_object(sha1, preferred_base, &found_pack, &found_offset)) {
+	if (ignore_object(sha1, preferred_base, &found_pack, &found_offset) ||
+	    (!preferred_base && type == OBJ_BLOB && oversized(sha1, name))) {
 		/* The pack is missing an object, so it will not have closure */
 		if (write_bitmap_index) {
 			warning(_(no_closure_warning));
@@ -2956,6 +2982,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			 N_("write a bitmap index together with the pack index")),
 		OPT_BOOL(0, "exclude-promisor-objects", &exclude_promisor_objects,
 			 N_("do not pack objects in promisor packfiles")),
+		OPT_MAGNITUDE(0, "blob-max-bytes", &blob_max_bytes,
+			      N_("exclude blobs larger than this")),
 		OPT_END(),
 	};
 
@@ -3054,7 +3082,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		use_bitmap_index = use_bitmap_index_default;
 
 	/* "hard" reasons not to use bitmaps; these just won't work at all */
-	if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) || is_repository_shallow())
+	if (!use_internal_rev_list || (!pack_to_stdout && write_bitmap_index) ||
+	    is_repository_shallow() || (blob_max_bytes >= 0))
 		use_bitmap_index = 0;
 
 	if (pack_to_stdout || !rev_list_all)
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 9c68b9925..697cc20f2 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -457,6 +457,51 @@ test_expect_success !PTHREADS,C_LOCALE_OUTPUT 'pack-objects --threads=N or pack.
 	grep -F "no threads support, ignoring pack.threads" err
 '
 
+lcut () {
+	perl -e '$/ = undef; $_ = <>; s/^.{'$1'}//s; print $_'
+}
+
+test_expect_success '--blob-size-limit works with multiple excluded' '
+	rm -rf server &&
+	git init server &&
+	printf a > server/a &&
+	printf b > server/b &&
+	printf c-very-long-file > server/c &&
+	printf d-very-long-file > server/d &&
+	git -C server add a b c d &&
+	git -C server commit -m x &&
+
+	git -C server rev-parse HEAD >objects &&
+	git -C server pack-objects --revs --stdout --blob-max-bytes=10 <objects >my.pack &&
+
+	# Ensure that only the small blobs are in the packfile
+	git index-pack my.pack &&
+	git verify-pack -v my.idx >objectlist &&
+	grep $(git hash-object server/a) objectlist &&
+	grep $(git hash-object server/b) objectlist &&
+	! grep $(git hash-object server/c) objectlist &&
+	! grep $(git hash-object server/d) objectlist
+'
+
+test_expect_success '--blob-size-limit never excludes special files' '
+	rm -rf server &&
+	git init server &&
+	printf a-very-long-file > server/a &&
+	printf a-very-long-file > server/.git-a &&
+	printf b-very-long-file > server/b &&
+	git -C server add a .git-a b &&
+	git -C server commit -m x &&
+
+	git -C server rev-parse HEAD >objects &&
+	git -C server pack-objects --revs --stdout --blob-max-bytes=10 <objects >my.pack &&
+
+	# Ensure that the .git-a blob is in the packfile, despite also
+	# appearing as a non-.git file
+	git index-pack my.pack &&
+	git verify-pack -v my.idx >objectlist &&
+	grep $(git hash-object server/a) objectlist
+'
+
 #
 # WARNING!
 #
diff --git a/t/test-lib-functions.sh b/t/test-lib-functions.sh
index 1701fe2a0..07b79c7ad 100644
--- a/t/test-lib-functions.sh
+++ b/t/test-lib-functions.sh
@@ -1020,3 +1020,15 @@ nongit () {
 		"$@"
 	)
 }
+
+# Converts big-endian pairs of hexadecimal digits into bytes. For example,
+# "printf 61620d0a | hex_pack" results in "ab\r\n".
+hex_pack () {
+	perl -e '$/ = undef; $input = <>; print pack("H*", $input)'
+}
+
+# Converts bytes into big-endian pairs of hexadecimal digits. For example,
+# "printf 'ab\r\n' | hex_unpack" results in "61620d0a".
+hex_unpack () {
+	perl -e '$/ = undef; $input = <>; print unpack("H2" x length($input), $input)'
+}
-- 
2.14.2.822.g60be5d43e6-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 12/18] fetch-pack: support excluding large blobs
  2017-09-29 20:11 [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Jonathan Tan
                   ` (10 preceding siblings ...)
  2017-09-29 20:11 ` [PATCH 11/18] pack-objects: support --blob-max-bytes Jonathan Tan
@ 2017-09-29 20:11 ` Jonathan Tan
  2017-09-29 20:11 ` [PATCH 13/18] fetch: refactor calculation of remote list Jonathan Tan
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Tan @ 2017-09-29 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, git, peartben, christian.couder

Teach fetch-pack and upload-pack to support excluding large blobs
through a blob-max-bytes parameter.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 Documentation/technical/pack-protocol.txt         |  9 ++++++++
 Documentation/technical/protocol-capabilities.txt |  7 ++++++
 builtin/fetch-pack.c                              | 11 +++++++++
 fetch-pack.c                                      | 11 +++++++++
 fetch-pack.h                                      |  1 +
 t/t5500-fetch-pack.sh                             | 27 +++++++++++++++++++++++
 upload-pack.c                                     | 16 +++++++++++++-
 7 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/Documentation/technical/pack-protocol.txt b/Documentation/technical/pack-protocol.txt
index ed1eae8b8..db0e1150b 100644
--- a/Documentation/technical/pack-protocol.txt
+++ b/Documentation/technical/pack-protocol.txt
@@ -212,6 +212,7 @@ out of what the server said it could do with the first 'want' line.
   upload-request    =  want-list
 		       *shallow-line
 		       *1depth-request
+		       *1blob-max-bytes
 		       flush-pkt
 
   want-list         =  first-want
@@ -223,10 +224,14 @@ out of what the server said it could do with the first 'want' line.
 		       PKT-LINE("deepen-since" SP timestamp) /
 		       PKT-LINE("deepen-not" SP ref)
 
+  blob-max-bytes    =  PKT-LINE("blob-max-bytes" SP magnitude)
+
   first-want        =  PKT-LINE("want" SP obj-id SP capability-list)
   additional-want   =  PKT-LINE("want" SP obj-id)
 
   depth             =  1*DIGIT
+
+  magnitude         =  1*DIGIT
 ----
 
 Clients MUST send all the obj-ids it wants from the reference
@@ -249,6 +254,10 @@ complete those commits. Commits whose parents are not received as a
 result are defined as shallow and marked as such in the server. This
 information is sent back to the client in the next step.
 
+The client can optionally request a partial packfile that omits blobs
+above a certain size threshold using "blob-max-bytes". Files whose names
+start with ".git" are always included in the packfile, however.
+
 Once all the 'want's and 'shallow's (and optional 'deepen') are
 transferred, clients MUST send a flush-pkt, to tell the server side
 that it is done sending the list.
diff --git a/Documentation/technical/protocol-capabilities.txt b/Documentation/technical/protocol-capabilities.txt
index 26dcc6f50..7e878fce5 100644
--- a/Documentation/technical/protocol-capabilities.txt
+++ b/Documentation/technical/protocol-capabilities.txt
@@ -309,3 +309,10 @@ to accept a signed push certificate, and asks the <nonce> to be
 included in the push certificate.  A send-pack client MUST NOT
 send a push-cert packet unless the receive-pack server advertises
 this capability.
+
+blob-max-bytes
+--------------
+
+If the upload-pack server advertises this capability, fetch-pack
+may send "blob-max-bytes" to request the server to omit blobs above a
+certain size from the packfile.
diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index 9a7ebf6e9..116be9bf5 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -4,6 +4,7 @@
 #include "remote.h"
 #include "connect.h"
 #include "sha1-array.h"
+#include "config.h"
 
 static const char fetch_pack_usage[] =
 "git fetch-pack [--all] [--stdin] [--quiet | -q] [--keep | -k] [--thin] "
@@ -153,6 +154,16 @@ int cmd_fetch_pack(int argc, const char **argv, const char *prefix)
 			args.no_haves = 1;
 			continue;
 		}
+		if (skip_prefix(arg, "--blob-max-bytes=", &arg)) {
+			unsigned long *ptr = xmalloc(sizeof(*ptr));
+			if (!git_parse_ulong(arg, ptr)) {
+				error("Invalid --blob-max-bytes value: %s",
+				      arg);
+				usage(fetch_pack_usage);
+			}
+			args.blob_max_bytes = ptr;
+			continue;
+		}
 		usage(fetch_pack_usage);
 	}
 	if (deepen_not.nr)
diff --git a/fetch-pack.c b/fetch-pack.c
index d376c4ef1..19b8e9322 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -26,6 +26,7 @@ static int prefer_ofs_delta = 1;
 static int no_done;
 static int deepen_since_ok;
 static int deepen_not_ok;
+static int blob_max_bytes_ok;
 static int fetch_fsck_objects = -1;
 static int transfer_fsck_objects = -1;
 static int agent_supported;
@@ -407,6 +408,13 @@ static int find_common(struct fetch_pack_args *args,
 			packet_buf_write(&req_buf, "deepen-not %s", s->string);
 		}
 	}
+	if (args->blob_max_bytes) {
+		if (blob_max_bytes_ok)
+			packet_buf_write(&req_buf, "blob-max-bytes %ld",
+					 *args->blob_max_bytes);
+		else
+			warning("blob-max-bytes not recognized by server, ignoring");
+	}
 	packet_buf_flush(&req_buf);
 	state_len = req_buf.len;
 
@@ -983,6 +991,8 @@ static struct ref *do_fetch_pack(struct fetch_pack_args *args,
 		die(_("Server does not support --shallow-exclude"));
 	if (!server_supports("deepen-relative") && args->deepen_relative)
 		die(_("Server does not support --deepen"));
+	if (server_supports("blob-max-bytes"))
+		blob_max_bytes_ok = 1;
 
 	if (everything_local(args, &ref, sought, nr_sought)) {
 		packet_flush(fd[1]);
@@ -1169,6 +1179,7 @@ struct ref *fetch_pack(struct fetch_pack_args *args,
 		packet_flush(fd[1]);
 		die(_("no matching remote head"));
 	}
+
 	prepare_shallow_info(&si, shallow);
 	ref_cpy = do_fetch_pack(args, fd, ref, sought, nr_sought,
 				&si, pack_lockfile);
diff --git a/fetch-pack.h b/fetch-pack.h
index 84904c348..3743a0ab2 100644
--- a/fetch-pack.h
+++ b/fetch-pack.h
@@ -12,6 +12,7 @@ struct fetch_pack_args {
 	int depth;
 	const char *deepen_since;
 	const struct string_list *deepen_not;
+	unsigned long *blob_max_bytes;
 	unsigned deepen_relative:1;
 	unsigned quiet:1;
 	unsigned keep_pack:1;
diff --git a/t/t5500-fetch-pack.sh b/t/t5500-fetch-pack.sh
index 80a1a3239..62e384230 100755
--- a/t/t5500-fetch-pack.sh
+++ b/t/t5500-fetch-pack.sh
@@ -755,4 +755,31 @@ test_expect_success 'fetching deepen' '
 	)
 '
 
+test_expect_success '--blob-max-bytes' '
+	rm -rf server client &&
+	test_create_repo server &&
+	test_commit -C server one &&
+	test_config -C server uploadpack.advertiseblobmaxbytes 1 &&
+
+	test_create_repo client &&
+	git -C client fetch-pack --blob-max-bytes=0 ../server HEAD &&
+
+	# Ensure that object is not inadvertently fetched
+	test_must_fail git -C client cat-file -e $(git hash-object server/one.t)
+'
+
+test_expect_success '--blob-max-bytes has no effect if support for it is not advertised' '
+	rm -rf server client &&
+	test_create_repo server &&
+	test_commit -C server one &&
+
+	test_create_repo client &&
+	git -C client fetch-pack --blob-max-bytes=0 ../server HEAD 2> err &&
+
+	# Ensure that object is fetched
+	git -C client cat-file -e $(git hash-object server/one.t) &&
+
+	test_i18ngrep "blob-max-bytes not recognized by server" err
+'
+
 test_done
diff --git a/upload-pack.c b/upload-pack.c
index 7efff2fbf..484704eb6 100644
--- a/upload-pack.c
+++ b/upload-pack.c
@@ -63,6 +63,8 @@ static int use_sideband;
 static int advertise_refs;
 static int stateless_rpc;
 static const char *pack_objects_hook;
+static int advertise_blob_max_bytes;
+static char *blob_max_bytes;
 
 static void reset_timeout(void)
 {
@@ -131,6 +133,8 @@ static void create_pack_file(void)
 		argv_array_push(&pack_objects.args, "--delta-base-offset");
 	if (use_include_tag)
 		argv_array_push(&pack_objects.args, "--include-tag");
+	if (blob_max_bytes)
+		argv_array_push(&pack_objects.args, blob_max_bytes);
 
 	pack_objects.in = -1;
 	pack_objects.out = -1;
@@ -794,6 +798,13 @@ static void receive_needs(void)
 			deepen_rev_list = 1;
 			continue;
 		}
+		if (skip_prefix(line, "blob-max-bytes ", &arg)) {
+			unsigned long s;
+			if (!git_parse_ulong(arg, &s))
+				die("git upload-pack: invalid blob-max-bytes value: %s", line);
+			blob_max_bytes = xstrfmt("--blob-max-bytes=%lu", s);
+			continue;
+		}
 		if (!skip_prefix(line, "want ", &arg) ||
 		    get_oid_hex(arg, &oid_buf))
 			die("git upload-pack: protocol error, "
@@ -940,7 +951,7 @@ static int send_ref(const char *refname, const struct object_id *oid,
 		struct strbuf symref_info = STRBUF_INIT;
 
 		format_symref_info(&symref_info, cb_data);
-		packet_write_fmt(1, "%s %s%c%s%s%s%s%s agent=%s\n",
+		packet_write_fmt(1, "%s %s%c%s%s%s%s%s%s agent=%s\n",
 			     oid_to_hex(oid), refname_nons,
 			     0, capabilities,
 			     (allow_unadvertised_object_request & ALLOW_TIP_SHA1) ?
@@ -949,6 +960,7 @@ static int send_ref(const char *refname, const struct object_id *oid,
 				     " allow-reachable-sha1-in-want" : "",
 			     stateless_rpc ? " no-done" : "",
 			     symref_info.buf,
+			     advertise_blob_max_bytes ? " blob-max-bytes" : "",
 			     git_user_agent_sanitized());
 		strbuf_release(&symref_info);
 	} else {
@@ -1028,6 +1040,8 @@ static int upload_pack_config(const char *var, const char *value, void *unused)
 	} else if (current_config_scope() != CONFIG_SCOPE_REPO) {
 		if (!strcmp("uploadpack.packobjectshook", var))
 			return git_config_string(&pack_objects_hook, var, value);
+	} else if (!strcmp("uploadpack.advertiseblobmaxbytes", var)) {
+		advertise_blob_max_bytes = git_config_bool(var, value);
 	}
 	return parse_hide_refs_config(var, value, "uploadpack");
 }
-- 
2.14.2.822.g60be5d43e6-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 13/18] fetch: refactor calculation of remote list
  2017-09-29 20:11 [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Jonathan Tan
                   ` (11 preceding siblings ...)
  2017-09-29 20:11 ` [PATCH 12/18] fetch-pack: support excluding large blobs Jonathan Tan
@ 2017-09-29 20:11 ` Jonathan Tan
  2017-09-29 20:11 ` [PATCH 14/18] fetch: support excluding large blobs Jonathan Tan
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Tan @ 2017-09-29 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, git, peartben, christian.couder

Separate out the calculation of remotes to be fetched from and the
actual fetching. This will allow us to include an additional step before
the actual fetching in a subsequent commit.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/fetch.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 225c73492..1b1f03923 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1322,7 +1322,7 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 {
 	int i;
 	struct string_list list = STRING_LIST_INIT_DUP;
-	struct remote *remote;
+	struct remote *remote = NULL;
 	int result = 0;
 	struct argv_array argv_gc_auto = ARGV_ARRAY_INIT;
 
@@ -1367,17 +1367,14 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 		else if (argc > 1)
 			die(_("fetch --all does not make sense with refspecs"));
 		(void) for_each_remote(get_one_remote_for_fetch, &list);
-		result = fetch_multiple(&list);
 	} else if (argc == 0) {
 		/* No arguments -- use default remote */
 		remote = remote_get(NULL);
-		result = fetch_one(remote, argc, argv);
 	} else if (multiple) {
 		/* All arguments are assumed to be remotes or groups */
 		for (i = 0; i < argc; i++)
 			if (!add_remote_or_group(argv[i], &list))
 				die(_("No such remote or remote group: %s"), argv[i]);
-		result = fetch_multiple(&list);
 	} else {
 		/* Single remote or group */
 		(void) add_remote_or_group(argv[0], &list);
@@ -1385,14 +1382,19 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 			/* More than one remote */
 			if (argc > 1)
 				die(_("Fetching a group and specifying refspecs does not make sense"));
-			result = fetch_multiple(&list);
 		} else {
 			/* Zero or one remotes */
 			remote = remote_get(argv[0]);
-			result = fetch_one(remote, argc-1, argv+1);
+			argc--;
+			argv++;
 		}
 	}
 
+	if (remote)
+		result = fetch_one(remote, argc, argv);
+	else
+		result = fetch_multiple(&list);
+
 	if (!result && (recurse_submodules != RECURSE_SUBMODULES_OFF)) {
 		struct argv_array options = ARGV_ARRAY_INIT;
 
-- 
2.14.2.822.g60be5d43e6-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 14/18] fetch: support excluding large blobs
  2017-09-29 20:11 [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Jonathan Tan
                   ` (12 preceding siblings ...)
  2017-09-29 20:11 ` [PATCH 13/18] fetch: refactor calculation of remote list Jonathan Tan
@ 2017-09-29 20:11 ` Jonathan Tan
  2017-09-29 20:11 ` [PATCH 15/18] clone: " Jonathan Tan
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Tan @ 2017-09-29 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, git, peartben, christian.couder

Teach fetch to support excluding large blobs through a blob-max-bytes
parameter. This is only allowed for the remote configured in
extensions.partialclone.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/fetch.c       | 22 ++++++++++++++++++++--
 connected.c           |  1 +
 remote-curl.c         |  7 +++++++
 t/t5500-fetch-pack.sh | 36 ++++++++++++++++++++++++++++++++++++
 transport-helper.c    |  4 ++++
 transport.c           | 10 ++++++++++
 transport.h           |  4 ++++
 7 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/builtin/fetch.c b/builtin/fetch.c
index 1b1f03923..07beaf5b5 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -55,6 +55,7 @@ static int recurse_submodules_default = RECURSE_SUBMODULES_ON_DEMAND;
 static int shown_url = 0;
 static int refmap_alloc, refmap_nr;
 static const char **refmap_array;
+static const char *blob_max_bytes;
 
 static int git_fetch_config(const char *k, const char *v, void *cb)
 {
@@ -160,6 +161,8 @@ static struct option builtin_fetch_options[] = {
 			TRANSPORT_FAMILY_IPV4),
 	OPT_SET_INT('6', "ipv6", &family, N_("use IPv6 addresses only"),
 			TRANSPORT_FAMILY_IPV6),
+	OPT_STRING(0, "blob-max-bytes", &blob_max_bytes, N_("bytes"),
+		   N_("do not fetch blobs above this size")),
 	OPT_END()
 };
 
@@ -1044,6 +1047,10 @@ static struct transport *prepare_transport(struct remote *remote, int deepen)
 		set_option(transport, TRANS_OPT_DEEPEN_RELATIVE, "yes");
 	if (update_shallow)
 		set_option(transport, TRANS_OPT_UPDATE_SHALLOW, "yes");
+	if (blob_max_bytes) {
+		set_option(transport, TRANS_OPT_BLOB_MAX_BYTES, blob_max_bytes);
+		set_option(transport, TRANS_OPT_FROM_PROMISOR, "1");
+	}
 	return transport;
 }
 
@@ -1328,6 +1335,8 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 
 	packet_trace_identity("fetch");
 
+	fetch_if_missing = 0;
+
 	/* Record the command line for the reflog */
 	strbuf_addstr(&default_rla, "fetch");
 	for (i = 1; i < argc; i++)
@@ -1361,6 +1370,9 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 	if (depth || deepen_since || deepen_not.nr)
 		deepen = 1;
 
+	if (blob_max_bytes && !repository_format_partial_clone)
+		die("--blob-max-bytes can only be used when extensions.partialClone is set");
+
 	if (all) {
 		if (argc == 1)
 			die(_("fetch --all does not take a repository argument"));
@@ -1390,10 +1402,16 @@ int cmd_fetch(int argc, const char **argv, const char *prefix)
 		}
 	}
 
-	if (remote)
+	if (remote) {
+		if (blob_max_bytes &&
+		    strcmp(remote->name, repository_format_partial_clone))
+			die(_("--blob-max-bytes can only be used with the remote configured in core.partialClone"));
 		result = fetch_one(remote, argc, argv);
-	else
+	} else {
+		if (blob_max_bytes)
+			die(_("--blob-max-bytes can only be used with the remote configured in core.partialClone"));
 		result = fetch_multiple(&list);
+	}
 
 	if (!result && (recurse_submodules != RECURSE_SUBMODULES_OFF)) {
 		struct argv_array options = ARGV_ARRAY_INIT;
diff --git a/connected.c b/connected.c
index f416b0505..a51c01d63 100644
--- a/connected.c
+++ b/connected.c
@@ -56,6 +56,7 @@ int check_connected(sha1_iterate_fn fn, void *cb_data,
 	argv_array_push(&rev_list.args,"rev-list");
 	argv_array_push(&rev_list.args, "--objects");
 	argv_array_push(&rev_list.args, "--stdin");
+	argv_array_push(&rev_list.args, "--exclude-promisor-objects");
 	argv_array_push(&rev_list.args, "--not");
 	argv_array_push(&rev_list.args, "--all");
 	argv_array_push(&rev_list.args, "--quiet");
diff --git a/remote-curl.c b/remote-curl.c
index 34a81b8d3..18fd184c1 100644
--- a/remote-curl.c
+++ b/remote-curl.c
@@ -24,6 +24,7 @@ struct options {
 	char *deepen_since;
 	struct string_list deepen_not;
 	struct string_list push_options;
+	char *blob_max_bytes;
 	unsigned progress : 1,
 		check_self_contained_and_connected : 1,
 		cloning : 1,
@@ -165,6 +166,9 @@ static int set_option(const char *name, const char *value)
 	} else if (!strcmp(name, "no-haves")) {
 		options.no_haves = 1;
 		return 0;
+	} else if (!strcmp(name, "blob-max-bytes")) {
+		options.blob_max_bytes = xstrdup(value);
+		return 0;
 	} else {
 		return 1 /* unsupported */;
 	}
@@ -834,6 +838,9 @@ static int fetch_git(struct discovery *heads,
 		argv_array_push(&args, "--from-promisor");
 	if (options.no_haves)
 		argv_array_push(&args, "--no-haves");
+	if (options.blob_max_bytes)
+		argv_array_pushf(&args, "--blob-max-bytes=%s",
+				 options.blob_max_bytes);
 	argv_array_push(&args, url.buf);
 
 	for (i = 0; i < nr_heads; i++) {
diff --git a/t/t5500-fetch-pack.sh b/t/t5500-fetch-pack.sh
index 62e384230..b2682862f 100755
--- a/t/t5500-fetch-pack.sh
+++ b/t/t5500-fetch-pack.sh
@@ -782,4 +782,40 @@ test_expect_success '--blob-max-bytes has no effect if support for it is not adv
 	test_i18ngrep "blob-max-bytes not recognized by server" err
 '
 
+fetch_blob_max_bytes () {
+	SERVER="$1"
+	URL="$2"
+
+	rm -rf "$SERVER" client &&
+	test_create_repo "$SERVER" &&
+	test_commit -C "$SERVER" one &&
+	test_config -C "$SERVER" uploadpack.advertiseblobmaxbytes 1 &&
+
+	git clone "$URL" client &&
+	test_config -C client extensions.partialclone origin &&
+
+	test_commit -C "$SERVER" two &&
+
+	git -C client fetch --blob-max-bytes=0 origin HEAD:somewhere &&
+
+	# Ensure that commit is fetched, but blob is not
+	test_config -C client extensions.partialclone "arbitrary string" &&
+	git -C client cat-file -e $(git -C "$SERVER" rev-parse two) &&
+	test_must_fail git -C client cat-file -e $(git hash-object "$SERVER/two.t")
+}
+
+test_expect_success 'fetch with --blob-max-bytes' '
+	fetch_blob_max_bytes server server
+'
+
+. "$TEST_DIRECTORY"/lib-httpd.sh
+start_httpd
+
+test_expect_success 'fetch with --blob-max-bytes and HTTP' '
+	fetch_blob_max_bytes "$HTTPD_DOCUMENT_ROOT_PATH/server" "$HTTPD_URL/smart/server"
+'
+
+stop_httpd
+
+
 test_done
diff --git a/transport-helper.c b/transport-helper.c
index 42b960ff8..9c1874b70 100644
--- a/transport-helper.c
+++ b/transport-helper.c
@@ -672,6 +672,10 @@ static int fetch(struct transport *transport,
 	if (data->transport_options.update_shallow)
 		set_helper_option(transport, "update-shallow", "true");
 
+	if (data->transport_options.blob_max_bytes)
+		set_helper_option(transport, "blob-max-bytes",
+				  data->transport_options.blob_max_bytes);
+
 	if (data->fetch)
 		return fetch_with_fetch(transport, nr_heads, to_fetch);
 
diff --git a/transport.c b/transport.c
index fb46e33bf..6408af19e 100644
--- a/transport.c
+++ b/transport.c
@@ -167,6 +167,9 @@ static int set_git_option(struct git_transport_options *opts,
 	} else if (!strcmp(name, TRANS_OPT_NO_HAVES)) {
 		opts->no_haves = !!value;
 		return 0;
+	} else if (!strcmp(name, TRANS_OPT_BLOB_MAX_BYTES)) {
+		opts->blob_max_bytes = value;
+		return 0;
 	}
 	return 1;
 }
@@ -237,6 +240,13 @@ static int fetch_refs_via_pack(struct transport *transport,
 	args.update_shallow = data->options.update_shallow;
 	args.from_promisor = data->options.from_promisor;
 	args.no_haves = data->options.no_haves;
+	if (data->options.blob_max_bytes) {
+		unsigned long *ptr = xmalloc(sizeof(*ptr));
+		if (!git_parse_ulong(data->options.blob_max_bytes, ptr))
+			die("Invalid blob-max-bytes value: %s",
+			    data->options.blob_max_bytes);
+		args.blob_max_bytes = ptr;
+	}
 
 	if (!data->got_remote_heads) {
 		connect_setup(transport, 0);
diff --git a/transport.h b/transport.h
index 67428f6df..79d4442b1 100644
--- a/transport.h
+++ b/transport.h
@@ -23,6 +23,7 @@ struct git_transport_options {
 	const char *uploadpack;
 	const char *receivepack;
 	struct push_cas_option *cas;
+	const char *blob_max_bytes;
 };
 
 enum transport_family {
@@ -218,6 +219,9 @@ void transport_check_allowed(const char *type);
 /* Do not send "have" lines */
 #define TRANS_OPT_NO_HAVES "no-haves"
 
+/* Exclude blobs above a certain size */
+#define TRANS_OPT_BLOB_MAX_BYTES "blob-max-bytes"
+
 /**
  * Returns 0 if the option was used, non-zero otherwise. Prints a
  * message to stderr if the option is not used.
-- 
2.14.2.822.g60be5d43e6-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 15/18] clone: support excluding large blobs
  2017-09-29 20:11 [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Jonathan Tan
                   ` (13 preceding siblings ...)
  2017-09-29 20:11 ` [PATCH 14/18] fetch: support excluding large blobs Jonathan Tan
@ 2017-09-29 20:11 ` " Jonathan Tan
  2017-09-29 20:11 ` [PATCH 16/18] clone: configure blobmaxbytes in created repos Jonathan Tan
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Tan @ 2017-09-29 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, git, peartben, christian.couder

Teach clone to support excluding large blobs through a blob-max-bytes
parameter.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/clone.c  | 23 +++++++++++++++++++++--
 t/t5601-clone.sh | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index dbddd98f8..4c2193dc4 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -60,6 +60,7 @@ static struct string_list option_optional_reference = STRING_LIST_INIT_NODUP;
 static int option_dissociate;
 static int max_jobs = -1;
 static struct string_list option_recurse_submodules = STRING_LIST_INIT_NODUP;
+static char *blob_max_bytes;
 
 static int recurse_submodules_cb(const struct option *opt,
 				 const char *arg, int unset)
@@ -135,6 +136,8 @@ static struct option builtin_clone_options[] = {
 			TRANSPORT_FAMILY_IPV4),
 	OPT_SET_INT('6', "ipv6", &family, N_("use IPv6 addresses only"),
 			TRANSPORT_FAMILY_IPV6),
+	OPT_STRING(0, "blob-max-bytes", &blob_max_bytes, N_("bytes"),
+		   N_("do not fetch blobs above this size")),
 	OPT_END()
 };
 
@@ -886,6 +889,8 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	struct refspec *refspec;
 	const char *fetch_pattern;
 
+	fetch_if_missing = 0;
+
 	packet_trace_identity("clone");
 	argc = parse_options(argc, argv, prefix, builtin_clone_options,
 			     builtin_clone_usage, 0);
@@ -1104,7 +1109,13 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 		transport_set_option(transport, TRANS_OPT_UPLOADPACK,
 				     option_upload_pack);
 
-	if (transport->smart_options && !deepen)
+	if (blob_max_bytes) {
+		transport_set_option(transport, TRANS_OPT_BLOB_MAX_BYTES,
+				     blob_max_bytes);
+		transport_set_option(transport, TRANS_OPT_FROM_PROMISOR, "1");
+	}
+
+	if (transport->smart_options && !deepen && !blob_max_bytes)
 		transport->smart_options->check_self_contained_and_connected = 1;
 
 	refs = transport_get_remote_refs(transport);
@@ -1164,13 +1175,20 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	write_refspec_config(src_ref_prefix, our_head_points_at,
 			remote_head_points_at, &branch_top);
 
+	if (blob_max_bytes) {
+		git_config_set("core.repositoryformatversion", "1");
+		git_config_set("extensions.partialclone", "origin");
+		repository_format_partial_clone = "origin";
+	}
+
 	if (is_local)
 		clone_local(path, git_dir);
 	else if (refs && complete_refs_before_fetch)
 		transport_fetch_refs(transport, mapped_refs);
 
 	update_remote_refs(refs, mapped_refs, remote_head_points_at,
-			   branch_top.buf, reflog_msg.buf, transport, !is_local);
+			   branch_top.buf, reflog_msg.buf, transport,
+			   !is_local && !blob_max_bytes);
 
 	update_head(our_head_points_at, remote_head, reflog_msg.buf);
 
@@ -1191,6 +1209,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 	}
 
 	junk_mode = JUNK_LEAVE_REPO;
+	fetch_if_missing = 1;
 	err = checkout(submodule_progress);
 
 	strbuf_release(&reflog_msg);
diff --git a/t/t5601-clone.sh b/t/t5601-clone.sh
index 9c56f771b..951b1ffa8 100755
--- a/t/t5601-clone.sh
+++ b/t/t5601-clone.sh
@@ -571,4 +571,53 @@ test_expect_success 'GIT_TRACE_PACKFILE produces a usable pack' '
 	git -C replay.git index-pack -v --stdin <tmp.pack
 '
 
+partial_clone () {
+	SERVER="$1" &&
+	URL="$2" &&
+
+	rm -rf "$SERVER" client &&
+	test_create_repo "$SERVER" &&
+	test_commit -C "$SERVER" one &&
+	HASH1=$(git hash-object "$SERVER/one.t") &&
+	git -C "$SERVER" revert HEAD &&
+	test_commit -C "$SERVER" two &&
+	HASH2=$(git hash-object "$SERVER/two.t") &&
+	test_config -C "$SERVER" uploadpack.advertiseblobmaxbytes 1 &&
+	test_config -C "$SERVER" uploadpack.allowanysha1inwant 1 &&
+
+	git clone --blob-max-bytes=0 "$URL" client &&
+
+	git -C client fsck &&
+
+	# Ensure that unneeded blobs are not inadvertently fetched.
+	test_config -C client extensions.partialclone "not a remote" &&
+	test_must_fail git -C client cat-file -e "$HASH1" &&
+
+	# But this blob was fetched, because clone performs an initial checkout
+	git -C client cat-file -e "$HASH2"
+}
+
+test_expect_success 'partial clone' '
+	partial_clone server "file://$(pwd)/server"
+'
+
+test_expect_success 'partial clone: warn if server does not support blob-max-bytes' '
+	rm -rf server client &&
+	test_create_repo server &&
+	test_commit -C server one &&
+
+	git clone --blob-max-bytes=0 "file://$(pwd)/server" client 2> err &&
+
+	test_i18ngrep "blob-max-bytes not recognized by server" err
+'
+
+. "$TEST_DIRECTORY"/lib-httpd.sh
+start_httpd
+
+test_expect_success 'partial clone using HTTP' '
+	partial_clone "$HTTPD_DOCUMENT_ROOT_PATH/server" "$HTTPD_URL/smart/server"
+'
+
+stop_httpd
+
 test_done
-- 
2.14.2.822.g60be5d43e6-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 16/18] clone: configure blobmaxbytes in created repos
  2017-09-29 20:11 [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Jonathan Tan
                   ` (14 preceding siblings ...)
  2017-09-29 20:11 ` [PATCH 15/18] clone: " Jonathan Tan
@ 2017-09-29 20:11 ` Jonathan Tan
  2017-09-29 20:11 ` [PATCH 17/18] unpack-trees: batch fetching of missing blobs Jonathan Tan
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Tan @ 2017-09-29 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, git, peartben, christian.couder

Teach clone to configure blobmaxbytes in any repos that it generates
when the --blob-max-bytes parameter is set. Also teach fetch to use this
parameter.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 builtin/clone.c       |  1 +
 builtin/fetch.c       |  4 ++++
 remote.c              |  2 ++
 remote.h              |  2 ++
 t/t5500-fetch-pack.sh | 64 ++++++++++++++++++++++++++++++++++++++++++++++-----
 5 files changed, 67 insertions(+), 6 deletions(-)

diff --git a/builtin/clone.c b/builtin/clone.c
index 4c2193dc4..58cbc8ae3 100644
--- a/builtin/clone.c
+++ b/builtin/clone.c
@@ -1179,6 +1179,7 @@ int cmd_clone(int argc, const char **argv, const char *prefix)
 		git_config_set("core.repositoryformatversion", "1");
 		git_config_set("extensions.partialclone", "origin");
 		repository_format_partial_clone = "origin";
+		git_config_set("remote.origin.blobmaxbytes", blob_max_bytes);
 	}
 
 	if (is_local)
diff --git a/builtin/fetch.c b/builtin/fetch.c
index 07beaf5b5..ace238554 100644
--- a/builtin/fetch.c
+++ b/builtin/fetch.c
@@ -1050,6 +1050,10 @@ static struct transport *prepare_transport(struct remote *remote, int deepen)
 	if (blob_max_bytes) {
 		set_option(transport, TRANS_OPT_BLOB_MAX_BYTES, blob_max_bytes);
 		set_option(transport, TRANS_OPT_FROM_PROMISOR, "1");
+	} else if (remote->blob_max_bytes) {
+		set_option(transport, TRANS_OPT_BLOB_MAX_BYTES,
+			   remote->blob_max_bytes);
+		set_option(transport, TRANS_OPT_FROM_PROMISOR, "1");
 	}
 	return transport;
 }
diff --git a/remote.c b/remote.c
index 411309006..eade3c312 100644
--- a/remote.c
+++ b/remote.c
@@ -440,6 +440,8 @@ static int handle_config(const char *key, const char *value, void *cb)
 					 key, value);
 	} else if (!strcmp(subkey, "vcs")) {
 		return git_config_string(&remote->foreign_vcs, key, value);
+	} else if (!strcmp(subkey, "blobmaxbytes")) {
+		return git_config_string(&remote->blob_max_bytes, key, value);
 	}
 	return 0;
 }
diff --git a/remote.h b/remote.h
index 2ecf4c8c7..3d56e62b7 100644
--- a/remote.h
+++ b/remote.h
@@ -56,6 +56,8 @@ struct remote {
 	 */
 	char *http_proxy;
 	char *http_proxy_authmethod;
+
+	const char *blob_max_bytes;
 };
 
 struct remote *remote_get(const char *name);
diff --git a/t/t5500-fetch-pack.sh b/t/t5500-fetch-pack.sh
index b2682862f..ee533ea32 100755
--- a/t/t5500-fetch-pack.sh
+++ b/t/t5500-fetch-pack.sh
@@ -782,9 +782,9 @@ test_expect_success '--blob-max-bytes has no effect if support for it is not adv
 	test_i18ngrep "blob-max-bytes not recognized by server" err
 '
 
-fetch_blob_max_bytes () {
-	SERVER="$1"
-	URL="$2"
+setup_blob_max_bytes () {
+	SERVER="$1" &&
+	URL="$2" &&
 
 	rm -rf "$SERVER" client &&
 	test_create_repo "$SERVER" &&
@@ -794,7 +794,11 @@ fetch_blob_max_bytes () {
 	git clone "$URL" client &&
 	test_config -C client extensions.partialclone origin &&
 
-	test_commit -C "$SERVER" two &&
+	test_commit -C "$SERVER" two
+}
+
+do_blob_max_bytes() {
+	SERVER="$1" &&
 
 	git -C client fetch --blob-max-bytes=0 origin HEAD:somewhere &&
 
@@ -805,14 +809,62 @@ fetch_blob_max_bytes () {
 }
 
 test_expect_success 'fetch with --blob-max-bytes' '
-	fetch_blob_max_bytes server server
+	setup_blob_max_bytes server server &&
+	do_blob_max_bytes server
+'
+
+test_expect_success 'fetch respects configured blobmaxbytes' '
+	setup_blob_max_bytes server server &&
+
+	test_config -C client remote.origin.blobmaxbytes 0 &&
+
+	git -C client fetch origin HEAD:somewhere &&
+
+	# Ensure that commit is fetched, but blob is not
+	test_config -C client extensions.partialclone "arbitrary string" &&
+	git -C client cat-file -e $(git -C server rev-parse two) &&
+	test_must_fail git -C client cat-file -e $(git hash-object server/two.t)
+'
+
+test_expect_success 'pull respects configured blobmaxbytes' '
+	setup_blob_max_bytes server server &&
+
+	# Hide two.t from tip so that client does not load it upon the
+	# automatic checkout that pull performs
+	git -C server rm two.t &&
+	test_commit -C server three &&
+
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+	test_config -C client remote.origin.blobmaxbytes 0 &&
+
+	git -C client pull origin &&
+
+	# Ensure that commit is fetched, but blob is not
+	test_config -C client extensions.partialclone "arbitrary string" &&
+	git -C client cat-file -e $(git -C server rev-parse two) &&
+	test_must_fail git -C client cat-file -e $(git hash-object server/two.t)
+'
+
+test_expect_success 'clone configures blobmaxbytes' '
+	rm -rf server client &&
+	test_create_repo server &&
+	test_commit -C server one &&
+	test_commit -C server two &&
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+
+	git clone --blob-max-bytes=12345 server client &&
+
+	# Ensure that we can, for example, checkout HEAD^
+	rm -rf client/.git/objects/* &&
+	git -C client checkout HEAD^
 '
 
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
 
 test_expect_success 'fetch with --blob-max-bytes and HTTP' '
-	fetch_blob_max_bytes "$HTTPD_DOCUMENT_ROOT_PATH/server" "$HTTPD_URL/smart/server"
+	setup_blob_max_bytes "$HTTPD_DOCUMENT_ROOT_PATH/server" "$HTTPD_URL/smart/server" &&
+	do_blob_max_bytes "$HTTPD_DOCUMENT_ROOT_PATH/server"
 '
 
 stop_httpd
-- 
2.14.2.822.g60be5d43e6-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 17/18] unpack-trees: batch fetching of missing blobs
  2017-09-29 20:11 [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Jonathan Tan
                   ` (15 preceding siblings ...)
  2017-09-29 20:11 ` [PATCH 16/18] clone: configure blobmaxbytes in created repos Jonathan Tan
@ 2017-09-29 20:11 ` Jonathan Tan
  2017-09-29 20:11 ` [PATCH 18/18] fetch-pack: restore save_commit_buffer after use Jonathan Tan
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Tan @ 2017-09-29 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, git, peartben, christian.couder

When running checkout, first prefetch all blobs that are to be updated
but are missing. This means that only one pack is downloaded during such
operations, instead of one per missing blob.

This operates only on the blob level - if a repository has a missing
tree, they are still fetched one at a time.

This does not use the delayed checkout mechanism introduced in commit
2841e8f ("convert: add "status=delayed" to filter process protocol",
2017-06-30) due to significant conceptual differences - in particular,
for partial clones, we already know what needs to be fetched based on
the contents of the local repo alone, whereas for status=delayed, it is
the filter process that tells us what needs to be checked in the end.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 fetch-object.c   | 27 +++++++++++++++++++++++----
 fetch-object.h   |  5 +++++
 t/t5601-clone.sh | 52 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 unpack-trees.c   | 22 ++++++++++++++++++++++
 4 files changed, 102 insertions(+), 4 deletions(-)

diff --git a/fetch-object.c b/fetch-object.c
index 369b61c0e..21b4dfafc 100644
--- a/fetch-object.c
+++ b/fetch-object.c
@@ -3,12 +3,12 @@
 #include "pkt-line.h"
 #include "strbuf.h"
 #include "transport.h"
+#include "fetch-object.h"
 
-void fetch_object(const char *remote_name, const unsigned char *sha1)
+static void fetch_refs(const char *remote_name, struct ref *ref)
 {
 	struct remote *remote;
 	struct transport *transport;
-	struct ref *ref;
 	int original_fetch_if_missing = fetch_if_missing;
 
 	fetch_if_missing = 0;
@@ -17,10 +17,29 @@ void fetch_object(const char *remote_name, const unsigned char *sha1)
 		die(_("Remote with no URL"));
 	transport = transport_get(remote, remote->url[0]);
 
-	ref = alloc_ref(sha1_to_hex(sha1));
-	hashcpy(ref->old_oid.hash, sha1);
 	transport_set_option(transport, TRANS_OPT_FROM_PROMISOR, "1");
 	transport_set_option(transport, TRANS_OPT_NO_HAVES, "1");
 	transport_fetch_refs(transport, ref);
 	fetch_if_missing = original_fetch_if_missing;
 }
+
+void fetch_object(const char *remote_name, const unsigned char *sha1)
+{
+	struct ref *ref = alloc_ref(sha1_to_hex(sha1));
+	hashcpy(ref->old_oid.hash, sha1);
+	fetch_refs(remote_name, ref);
+}
+
+void fetch_objects(const char *remote_name, const struct oid_array *to_fetch)
+{
+	struct ref *ref = NULL;
+	int i;
+
+	for (i = 0; i < to_fetch->nr; i++) {
+		struct ref *new_ref = alloc_ref(oid_to_hex(&to_fetch->oid[i]));
+		oidcpy(&new_ref->old_oid, &to_fetch->oid[i]);
+		new_ref->next = ref;
+		ref = new_ref;
+	}
+	fetch_refs(remote_name, ref);
+}
diff --git a/fetch-object.h b/fetch-object.h
index f371300c8..4b269d07e 100644
--- a/fetch-object.h
+++ b/fetch-object.h
@@ -1,6 +1,11 @@
 #ifndef FETCH_OBJECT_H
 #define FETCH_OBJECT_H
 
+#include "sha1-array.h"
+
 extern void fetch_object(const char *remote_name, const unsigned char *sha1);
 
+extern void fetch_objects(const char *remote_name,
+			  const struct oid_array *to_fetch);
+
 #endif
diff --git a/t/t5601-clone.sh b/t/t5601-clone.sh
index 951b1ffa8..0ca24c215 100755
--- a/t/t5601-clone.sh
+++ b/t/t5601-clone.sh
@@ -611,6 +611,58 @@ test_expect_success 'partial clone: warn if server does not support blob-max-byt
 	test_i18ngrep "blob-max-bytes not recognized by server" err
 '
 
+test_expect_success 'batch missing blob request during checkout' '
+	rm -rf server client &&
+
+	test_create_repo server &&
+	echo a >server/a &&
+	echo b >server/b &&
+	git -C server add a b &&
+
+	git -C server commit -m x &&
+	echo aa >server/a &&
+	echo bb >server/b &&
+	git -C server add a b &&
+	git -C server commit -m x &&
+
+	test_config -C server uploadpack.advertiseblobmaxbytes 1 &&
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+
+	git clone --blob-max-bytes=0 "file://$(pwd)/server" client &&
+
+	# Ensure that there is only one negotiation by checking that there is
+	# only "done" line sent. ("done" marks the end of negotiation.)
+	GIT_TRACE_PACKET="$(pwd)/trace" git -C client checkout HEAD^ &&
+	grep "git> done" trace >done_lines &&
+	test_line_count = 1 done_lines
+'
+
+test_expect_success 'batch missing blob request does not inadvertently try to fetch gitlinks' '
+	rm -rf server client &&
+
+	test_create_repo repo_for_submodule &&
+	test_commit -C repo_for_submodule x &&
+
+	test_create_repo server &&
+	echo a >server/a &&
+	echo b >server/b &&
+	git -C server add a b &&
+	git -C server commit -m x &&
+
+	echo aa >server/a &&
+	echo bb >server/b &&
+	# Also add a gitlink pointing to an arbitrary repository
+	git -C server submodule add "$(pwd)/repo_for_submodule" c &&
+	git -C server add a b c &&
+	git -C server commit -m x &&
+
+	test_config -C server uploadpack.advertiseblobmaxbytes 1 &&
+	test_config -C server uploadpack.allowanysha1inwant 1 &&
+
+	# Make sure that it succeeds
+	git clone --blob-max-bytes=0 "file://$(pwd)/server" client
+'
+
 . "$TEST_DIRECTORY"/lib-httpd.sh
 start_httpd
 
diff --git a/unpack-trees.c b/unpack-trees.c
index 71b70ccb1..73a1cdb43 100644
--- a/unpack-trees.c
+++ b/unpack-trees.c
@@ -14,6 +14,7 @@
 #include "dir.h"
 #include "submodule.h"
 #include "submodule-config.h"
+#include "fetch-object.h"
 
 /*
  * Error messages expected by scripts out of plumbing commands such as
@@ -369,6 +370,27 @@ static int check_updates(struct unpack_trees_options *o)
 		load_gitmodules_file(index, &state);
 
 	enable_delayed_checkout(&state);
+	if (repository_format_partial_clone && o->update && !o->dry_run) {
+		/*
+		 * Prefetch the objects that are to be checked out in the loop
+		 * below.
+		 */
+		struct oid_array to_fetch = OID_ARRAY_INIT;
+		int fetch_if_missing_store = fetch_if_missing;
+		fetch_if_missing = 0;
+		for (i = 0; i < index->cache_nr; i++) {
+			struct cache_entry *ce = index->cache[i];
+			if ((ce->ce_flags & CE_UPDATE) &&
+			    !S_ISGITLINK(ce->ce_mode)) {
+				if (!has_object_file(&ce->oid))
+					oid_array_append(&to_fetch, &ce->oid);
+			}
+		}
+		if (to_fetch.nr)
+			fetch_objects(repository_format_partial_clone,
+				      &to_fetch);
+		fetch_if_missing = fetch_if_missing_store;
+	}
 	for (i = 0; i < index->cache_nr; i++) {
 		struct cache_entry *ce = index->cache[i];
 
-- 
2.14.2.822.g60be5d43e6-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 18/18] fetch-pack: restore save_commit_buffer after use
  2017-09-29 20:11 [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Jonathan Tan
                   ` (16 preceding siblings ...)
  2017-09-29 20:11 ` [PATCH 17/18] unpack-trees: batch fetching of missing blobs Jonathan Tan
@ 2017-09-29 20:11 ` Jonathan Tan
  2017-09-29 21:08 ` [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Johannes Schindelin
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 28+ messages in thread
From: Jonathan Tan @ 2017-09-29 20:11 UTC (permalink / raw)
  To: git; +Cc: Jonathan Tan, gitster, git, peartben, christian.couder

In fetch-pack, the global variable save_commit_buffer is set to 0, but
not restored to its original value after use.

In particular, if show_log() (in log-tree.c) is invoked after
fetch_pack() in the same process, show_log() will return before printing
out the commit message (because the invocation to
get_cached_commit_buffer() returns NULL, because the commit buffer was
not saved). I discovered this when attempting to run "git log -S" in a
partial clone, triggering the case where revision walking lazily loads
missing objects.

Therefore, restore save_commit_buffer to its original value after use.

An alternative to solve the problem I had is to replace
get_cached_commit_buffer() with get_commit_buffer(). That invocation was
introduced in commit a97934d ("use get_cached_commit_buffer where
appropriate", 2014-06-13) to replace "commit->buffer" introduced in
commit 3131b71 ("Add "--show-all" revision walker flag for debugging",
2008-02-13). In the latter commit, the commit author seems to be
deciding between not showing an unparsed commit at all and showing an
unparsed commit without the message (which is what the commit does), and
did not mention parsing the unparsed commit, so I prefer to preserve the
existing behavior.

Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
---
 fetch-pack.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fetch-pack.c b/fetch-pack.c
index 19b8e9322..8e6f54547 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -719,6 +719,7 @@ static int everything_local(struct fetch_pack_args *args,
 {
 	struct ref *ref;
 	int retval;
+	int old_save_commit_buffer = save_commit_buffer;
 	timestamp_t cutoff = 0;
 
 	save_commit_buffer = 0;
@@ -786,6 +787,9 @@ static int everything_local(struct fetch_pack_args *args,
 		print_verbose(args, _("already have %s (%s)"), oid_to_hex(remote),
 			      ref->name);
 	}
+
+	save_commit_buffer = old_save_commit_buffer;
+
 	return retval;
 }
 
-- 
2.14.2.822.g60be5d43e6-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
  2017-09-29 20:11 [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Jonathan Tan
                   ` (17 preceding siblings ...)
  2017-09-29 20:11 ` [PATCH 18/18] fetch-pack: restore save_commit_buffer after use Jonathan Tan
@ 2017-09-29 21:08 ` Johannes Schindelin
  2017-10-02  4:23 ` Junio C Hamano
  2017-10-03  6:15 ` Christian Couder
  20 siblings, 0 replies; 28+ messages in thread
From: Johannes Schindelin @ 2017-09-29 21:08 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, gitster, git, peartben, christian.couder

Hi Jonathan,

On Fri, 29 Sep 2017, Jonathan Tan wrote:

> Jeff Hostetler has sent out some object-filtering patches [1] that is a
> superset of the object-filtering functionality that I have (in the
> pack-objects patches). I have gone for the minimal approach here, but if
> his patches are merged, I'll update my patch set to use those.

I wish there was a way for you to work *with* Jeff on this. It seems that
your aims are similar enough for that (you both need changes in the
protocol) yet different enough to allow for talking past each other (big
blobs vs narrow clone).

And I get the impression that in this instance, it slows everything down
to build competing, large patch series rather than building on top of each
other's work.

Additionally, I am not helping by pestering Jeff all the time about
different issues, so it is partially my fault.

But maybe there is a chance to *really* go for a minimal approach, as in
"incremental enough that you can share the first <N> patches"? And even
better: "come up with those first <N> patches together"?

Ciao,
Dscho



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
  2017-09-29 20:11 [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Jonathan Tan
                   ` (18 preceding siblings ...)
  2017-09-29 21:08 ` [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Johannes Schindelin
@ 2017-10-02  4:23 ` Junio C Hamano
  2017-10-03  6:15 ` Christian Couder
  20 siblings, 0 replies; 28+ messages in thread
From: Junio C Hamano @ 2017-10-02  4:23 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, git, peartben, christian.couder

Jonathan Tan <jonathantanmy@google.com> writes:

> Jeff Hostetler has sent out some object-filtering patches [1] that is a
> superset of the object-filtering functionality that I have (in the
> pack-objects patches). I have gone for the minimal approach here, but if
> his patches are merged, I'll update my patch set to use those.
>
> [1] https://public-inbox.org/git/20170922203017.53986-6-git@jeffhostetler.com/

Sounds good.  Or perhaps rebasing the other way around, if we feel
that the "fsck with known-missing object" part of your series is
with a better done-ness than Jeff's series (which is my impression
but I has an obvious bias that I happened to have reviewed your
series with finer toothed comb before I saw Jeff's series).

Thanks for working well together ;-).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
  2017-09-29 20:11 [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Jonathan Tan
                   ` (19 preceding siblings ...)
  2017-10-02  4:23 ` Junio C Hamano
@ 2017-10-03  6:15 ` Christian Couder
  2017-10-03  8:50   ` Junio C Hamano
  20 siblings, 1 reply; 28+ messages in thread
From: Christian Couder @ 2017-10-03  6:15 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Junio C Hamano, Jeff Hostetler, Ben Peart

On Fri, Sep 29, 2017 at 10:11 PM, Jonathan Tan <jonathantanmy@google.com> wrote:
> These patches are also available online:
> https://github.com/jonathantanmy/git/commits/partialclone3
>
> (I've announced it in another e-mail, but am now sending the patches to the
> mailing list too.)
>
> Here's an update of my work so far. Notable features:
>  - These 18 patches allow a user to clone with --blob-max-bytes=<bytes>,
>    creating a partial clone that is automatically configured to lazily
>    fetch missing objects from the origin. The local repo also has fsck
>    working offline, and GC working (albeit only on locally created
>    objects).
>  - Cloning and fetching is currently only able to exclude blobs by a
>    size threshold, but the local repository is already capable of
>    fetching missing objects of any type. For example, if a repository
>    with missing trees or commits is generated by any tool (for example,
>    a future version of Git), current Git with my patches will still be
>    able to operate on them, automatically fetching those missing trees
>    and commits when needed.
>  - Missing blobs are fetched all at once during checkout.

Could you give a bit more details about the use cases this is designed for?
It seems that when people review my work they want a lot of details
about the use cases, so I guess they would also be interesting in
getting this kind of information for your work too.

Could this support users who would be interested in lazily cloning
only one kind of files, for example *.jpeg?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
  2017-10-03  6:15 ` Christian Couder
@ 2017-10-03  8:50   ` Junio C Hamano
  2017-10-03 14:39     ` Jeff Hostetler
  0 siblings, 1 reply; 28+ messages in thread
From: Junio C Hamano @ 2017-10-03  8:50 UTC (permalink / raw)
  To: Christian Couder; +Cc: Jonathan Tan, git, Jeff Hostetler, Ben Peart

Christian Couder <christian.couder@gmail.com> writes:

> Could you give a bit more details about the use cases this is designed for?
> It seems that when people review my work they want a lot of details
> about the use cases, so I guess they would also be interesting in
> getting this kind of information for your work too.
>
> Could this support users who would be interested in lazily cloning
> only one kind of files, for example *.jpeg?

I do not know about others, but the reason why I was not interested
in finding out "use cases" is because the value of this series is
use-case agnostic.

At least to me, the most interesting part of the series is that it
allows you to receive a set of objects transferred from the other
side that lack some of objects that would otherwise be required to
be here for connectivity purposes, and it introduces a mechanism to
allow object transfer layer, gc and fsck to work well together in
the resulting repository that deliberately lacks some objects.  The
transfer layer marks the objects obtained from a specific remote as
such, and gc and fsck are taught not to try to follow a missing link
or diagnose a missing link as an error, if a missing link is
expected using the mark the transfer layer left.

and it does so in such a way that it is use-case agnostic.  The
mechanism does not care how the objects to be omitted were chosen,
and how the omission criteria were negotiated between the sender and
the receiver of the pack.

I think the series comes with one filter that is size-based, but I
view it as a technology demonstration.  It does not have to be the
primary use case.  IOW, I do not think the series is meant as a
declaration that size-based filtering is the most important thing
and other omission criteria are less important.

You should be able to build path based omission (i.e. narrow clone)
or blobtype based omission.  Depending on your needs, you may want
different object omission criteria.  It is something you can build
on top.  And the work done on transfer/gc/fsck in this series does
not have to change to accommodate these different "use cases".



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
  2017-10-03  8:50   ` Junio C Hamano
@ 2017-10-03 14:39     ` Jeff Hostetler
  2017-10-03 23:42       ` Jonathan Tan
  0 siblings, 1 reply; 28+ messages in thread
From: Jeff Hostetler @ 2017-10-03 14:39 UTC (permalink / raw)
  To: Junio C Hamano, Christian Couder; +Cc: Jonathan Tan, git, Ben Peart



On 10/3/2017 4:50 AM, Junio C Hamano wrote:
> Christian Couder <christian.couder@gmail.com> writes:
> 
>> Could you give a bit more details about the use cases this is designed for?
>> It seems that when people review my work they want a lot of details
>> about the use cases, so I guess they would also be interesting in
>> getting this kind of information for your work too.
>>
>> Could this support users who would be interested in lazily cloning
>> only one kind of files, for example *.jpeg?
> 
> I do not know about others, but the reason why I was not interested
> in finding out "use cases" is because the value of this series is
> use-case agnostic.
> 
> At least to me, the most interesting part of the series is that it
> allows you to receive a set of objects transferred from the other
> side that lack some of objects that would otherwise be required to
> be here for connectivity purposes, and it introduces a mechanism to
> allow object transfer layer, gc and fsck to work well together in
> the resulting repository that deliberately lacks some objects.  The
> transfer layer marks the objects obtained from a specific remote as
> such, and gc and fsck are taught not to try to follow a missing link
> or diagnose a missing link as an error, if a missing link is
> expected using the mark the transfer layer left.
> 
> and it does so in such a way that it is use-case agnostic.  The
> mechanism does not care how the objects to be omitted were chosen,
> and how the omission criteria were negotiated between the sender and
> the receiver of the pack.
> 
> I think the series comes with one filter that is size-based, but I
> view it as a technology demonstration.  It does not have to be the
> primary use case.  IOW, I do not think the series is meant as a
> declaration that size-based filtering is the most important thing
> and other omission criteria are less important.
> 
> You should be able to build path based omission (i.e. narrow clone)
> or blobtype based omission.  Depending on your needs, you may want
> different object omission criteria.  It is something you can build
> on top.  And the work done on transfer/gc/fsck in this series does
> not have to change to accommodate these different "use cases".
> 

Agreed.

There are lots of reasons for wanting partial clones (and we've been
flinging lots of RFCs at each other that each seem to have a different
base assumption (small-blobs-only vs sparse-checkout vs <whatever>))
and not reaching consensus or closure.

The main thing is to allow the client to use partial clone to request
a "subset", let the server deliver that "subset", and let the client
tooling deal with an incomplete view of the repo.

As I see it there are the following major parts to partial clone:
1. How to let git-clone (and later git-fetch) specify the desired
    subset of objects that it wants?  (A ref-relative request.)
2. How to let the server and git-pack-objects build that incomplete
    packfile?
3. How to remember in the local config that a partial clone (or
    fetch) was used and that missing object should be expected?
4. How to dynamically fetch individual missing objects individually?
     (Not a ref-relative request.)
5. How to augment the local ODB with partial clone information and
    let git-fsck (and friends) perform limited consistency checking?
6. Methods to bulk fetching missing objects (whether in a pre-verb
    hook or in unpack-tree)
7. Miscellaneous issues (e.g. fixing places that accidentally cause
    a missing object to be fetched that don't really need it).

My proposal [1] includes a generic filtering mechanism that handles 3
types of filtering and makes it easy to add other techniques as we
see fit.  It slips in at the list-objects / traverse_commit_list
level and hides all of the details from rev-list and pack-objects.
I have a follow on proposal [2] that extends the filtering parameter
handling to git-clone, git-fetch, git-fetch-pack, git-upload-pack
and the pack protocol.  That takes care of items 1 and 2 above.

Jonathan's proposal [3] includes code to update the local config,
dynamically fetch individual objects, and handle the local ODB and
fsck consistency checking.  That takes care of items 3, 4, and 5.

As was suggested above, I think we should merge our efforts:
using my filtering for 1 and 2 and Jonathan's code for 3, 4, and 5.
I would need to eliminate the "relax" options in favor of his
is_promised() functionality for index-pack and similar.  And omit
his blob-max-bytes changes from pack-objects, the protocol and
related commands.

That should be a good first step.

We both have thoughts on bulk fetching (mine in pre-verb hooks and
his in unpack-tree).  We don't need this immediately, but can wait
until the above is working to revisit.

[1] https://github.com/jeffhostetler/git/pull/3
[2]https://github.com/jeffhostetler/git/pull/4
[3] https://github.com/jonathantanmy/git/tree/partialclone3

Thoughts?

Thanks,
Jeff

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
  2017-10-03 14:39     ` Jeff Hostetler
@ 2017-10-03 23:42       ` Jonathan Tan
  2017-10-04 13:30         ` Jeff Hostetler
  0 siblings, 1 reply; 28+ messages in thread
From: Jonathan Tan @ 2017-10-03 23:42 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: Junio C Hamano, Christian Couder, git, Ben Peart

On Tue, Oct 3, 2017 at 7:39 AM, Jeff Hostetler <git@jeffhostetler.com> wrote:
>
> As I see it there are the following major parts to partial clone:
> 1. How to let git-clone (and later git-fetch) specify the desired
>    subset of objects that it wants?  (A ref-relative request.)
> 2. How to let the server and git-pack-objects build that incomplete
>    packfile?
> 3. How to remember in the local config that a partial clone (or
>    fetch) was used and that missing object should be expected?
> 4. How to dynamically fetch individual missing objects individually?
>     (Not a ref-relative request.)
> 5. How to augment the local ODB with partial clone information and
>    let git-fsck (and friends) perform limited consistency checking?
> 6. Methods to bulk fetching missing objects (whether in a pre-verb
>    hook or in unpack-tree)
> 7. Miscellaneous issues (e.g. fixing places that accidentally cause
>    a missing object to be fetched that don't really need it).

Thanks for the enumeration.

> As was suggested above, I think we should merge our efforts:
> using my filtering for 1 and 2 and Jonathan's code for 3, 4, and 5.
> I would need to eliminate the "relax" options in favor of his
> is_promised() functionality for index-pack and similar.  And omit
> his blob-max-bytes changes from pack-objects, the protocol and
> related commands.
>
> That should be a good first step.

This sounds good to me. Jeff Hostetler's filtering (all blobs, blobs
by size, blobs by sparse checkout specification) is more comprehensive
than mine, so removing blob-max-bytes from my code is not a problem.

> We both have thoughts on bulk fetching (mine in pre-verb hooks and
> his in unpack-tree).  We don't need this immediately, but can wait
> until the above is working to revisit.

Agreed.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches)
  2017-10-03 23:42       ` Jonathan Tan
@ 2017-10-04 13:30         ` Jeff Hostetler
  0 siblings, 0 replies; 28+ messages in thread
From: Jeff Hostetler @ 2017-10-04 13:30 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Junio C Hamano, Christian Couder, git, Ben Peart



On 10/3/2017 7:42 PM, Jonathan Tan wrote:
> On Tue, Oct 3, 2017 at 7:39 AM, Jeff Hostetler <git@jeffhostetler.com> wrote:
>>
>> As I see it there are the following major parts to partial clone:
>> 1. How to let git-clone (and later git-fetch) specify the desired
>>     subset of objects that it wants?  (A ref-relative request.)
>> 2. How to let the server and git-pack-objects build that incomplete
>>     packfile?
>> 3. How to remember in the local config that a partial clone (or
>>     fetch) was used and that missing object should be expected?
>> 4. How to dynamically fetch individual missing objects individually?
>>      (Not a ref-relative request.)
>> 5. How to augment the local ODB with partial clone information and
>>     let git-fsck (and friends) perform limited consistency checking?
>> 6. Methods to bulk fetching missing objects (whether in a pre-verb
>>     hook or in unpack-tree)
>> 7. Miscellaneous issues (e.g. fixing places that accidentally cause
>>     a missing object to be fetched that don't really need it).
> 
> Thanks for the enumeration.
> 
>> As was suggested above, I think we should merge our efforts:
>> using my filtering for 1 and 2 and Jonathan's code for 3, 4, and 5.
>> I would need to eliminate the "relax" options in favor of his
>> is_promised() functionality for index-pack and similar.  And omit
>> his blob-max-bytes changes from pack-objects, the protocol and
>> related commands.
>>
>> That should be a good first step.
> 
> This sounds good to me. Jeff Hostetler's filtering (all blobs, blobs
> by size, blobs by sparse checkout specification) is more comprehensive
> than mine, so removing blob-max-bytes from my code is not a problem.
> 
>> We both have thoughts on bulk fetching (mine in pre-verb hooks and
>> his in unpack-tree).  We don't need this immediately, but can wait
>> until the above is working to revisit.
> 
> Agreed.
> 

Thanks.

I'll make a first pass at merging our efforts then and
post something shortly.

Jeff


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/18] sha1_file: support lazily fetching missing objects
  2017-09-29 20:11 ` [PATCH 07/18] sha1_file: support lazily fetching missing objects Jonathan Tan
@ 2017-10-12 14:42   ` Christian Couder
  2017-10-12 15:45     ` Christian Couder
  0 siblings, 1 reply; 28+ messages in thread
From: Christian Couder @ 2017-10-12 14:42 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Junio C Hamano, Jeff Hostetler, Ben Peart

On Fri, Sep 29, 2017 at 10:11 PM, Jonathan Tan <jonathantanmy@google.com> wrote:

> diff --git a/sha1_file.c b/sha1_file.c
> index b4a67bb83..77aa0ffdf 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -29,6 +29,7 @@
>  #include "mergesort.h"
>  #include "quote.h"
>  #include "packfile.h"
> +#include "fetch-object.h"
>
>  const unsigned char null_sha1[GIT_MAX_RAWSZ];
>  const struct object_id null_oid;
> @@ -1149,6 +1150,8 @@ static int sha1_loose_object_info(const unsigned char *sha1,
>         return (status < 0) ? status : 0;
>  }
>
> +int fetch_if_missing = 1;
> +
>  int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi, unsigned flags)
>  {
>         static struct object_info blank_oi = OBJECT_INFO_INIT;
> @@ -1157,6 +1160,7 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
>         const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
>                                     lookup_replace_object(sha1) :
>                                     sha1;
> +       int already_retried = 0;
>
>         if (!oi)
>                 oi = &blank_oi;
> @@ -1181,28 +1185,36 @@ int sha1_object_info_extended(const unsigned char *sha1, struct object_info *oi,
>                 }
>         }
>
> -       if (!find_pack_entry(real, &e)) {
> -               /* Most likely it's a loose object. */
> -               if (!sha1_loose_object_info(real, oi, flags))
> -                       return 0;
> +retry:
> +       if (find_pack_entry(real, &e))
> +               goto found_packed;
>
> -               /* Not a loose object; someone else may have just packed it. */
> -               if (flags & OBJECT_INFO_QUICK) {
> -                       return -1;
> -               } else {
> -                       reprepare_packed_git();
> -                       if (!find_pack_entry(real, &e))
> -                               return -1;
> -               }
> +       /* Most likely it's a loose object. */
> +       if (!sha1_loose_object_info(real, oi, flags))
> +               return 0;
> +
> +       /* Not a loose object; someone else may have just packed it. */
> +       reprepare_packed_git();
> +       if (find_pack_entry(real, &e))
> +               goto found_packed;
> +
> +       /* Check if it is a missing object */
> +       if (fetch_if_missing && repository_format_partial_clone &&
> +           !already_retried) {
> +               fetch_object(repository_format_partial_clone, real);
> +               already_retried = 1;
> +               goto retry;
>         }
>
> +       return -1;
> +
> +found_packed:
>         if (oi == &blank_oi)
>                 /*
>                  * We know that the caller doesn't actually need the
>                  * information below, so return early.
>                  */
>                 return 0;
> -
>         rtype = packed_object_info(e.p, e.offset, oi);
>         if (rtype < 0) {
>                 mark_bad_packed_object(e.p, real);

Instead of adding labels and gotos, I would suggest adding a new
function like the following does on top of your changes:

diff --git a/sha1_file.c b/sha1_file.c
index cc1aa0bd7f..02a6ed1e9b 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1171,18 +1171,40 @@ static int sha1_loose_object_info(const
unsigned char *sha1,
        return (status < 0) ? status : 0;
 }

+int try_find_packed_entry_or_loose_object(const unsigned char *real,
struct object_info *oi,
+                                         unsigned flags, struct
pack_entry *e, int retry)
+{
+       if (find_pack_entry(real, e))
+               return 1;
+
+       /* Most likely it's a loose object. */
+       if (!sha1_loose_object_info(real, oi, flags))
+               return 0;
+
+       /* Not a loose object; someone else may have just packed it. */
+       reprepare_packed_git();
+       if (find_pack_entry(real, e))
+               return 1;
+
+       /* Check if it is a missing object */
+       if (fetch_if_missing && repository_format_partial_clone && retry) {
+               fetch_object(repository_format_partial_clone, real);
+               return try_find_packed_entry_or_loose_object(real, oi,
flags, e, 0);
+       }
+
+       return -1;
+}
+
 int fetch_if_missing = 1;

 int sha1_object_info_extended(const unsigned char *sha1, struct
object_info *oi, unsigned flags)
 {
        static struct object_info blank_oi = OBJECT_INFO_INIT;
        struct pack_entry e;
-       int rtype;
+       int rtype, res;
        const unsigned char *real = (flags & OBJECT_INFO_LOOKUP_REPLACE) ?
                                    lookup_replace_object(sha1) :
                                    sha1;
-       int already_retried = 0;
-
        if (!oi)
                oi = &blank_oi;

@@ -1206,30 +1228,10 @@ int sha1_object_info_extended(const unsigned
char *sha1, struct object_info *oi,
                }
        }

-retry:
-       if (find_pack_entry(real, &e))
-               goto found_packed;
-
-       /* Most likely it's a loose object. */
-       if (!sha1_loose_object_info(real, oi, flags))
-               return 0;
-
-       /* Not a loose object; someone else may have just packed it. */
-       reprepare_packed_git();
-       if (find_pack_entry(real, &e))
-               goto found_packed;
-
-       /* Check if it is a missing object */
-       if (fetch_if_missing && repository_format_partial_clone &&
-           !already_retried) {
-               fetch_object(repository_format_partial_clone, real);
-               already_retried = 1;
-               goto retry;
-       }
-
-       return -1;
+       res = try_find_packed_entry_or_loose_object(real, oi, flags, &e, 1);
+       if (res < 1)
+               return res;

-found_packed:
        if (oi == &blank_oi)
                /*
                 * We know that the caller doesn't actually need the

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 07/18] sha1_file: support lazily fetching missing objects
  2017-10-12 14:42   ` Christian Couder
@ 2017-10-12 15:45     ` Christian Couder
  0 siblings, 0 replies; 28+ messages in thread
From: Christian Couder @ 2017-10-12 15:45 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, Junio C Hamano, Jeff Hostetler, Ben Peart

On Thu, Oct 12, 2017 at 4:42 PM, Christian Couder
<christian.couder@gmail.com> wrote:
>
> Instead of adding labels and gotos, I would suggest adding a new
> function like the following does on top of your changes:

Sorry for the usual Gmail related problems. You can find the patch and
a further refactoring that adds an object_info_from_pack_entry()
function there:

https://github.com/chriscool/git/commits/pu-partial-clone-lazy-fetch-improved

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, back to index

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-29 20:11 [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Jonathan Tan
2017-09-29 20:11 ` [PATCH 01/18] fsck: introduce partialclone extension Jonathan Tan
2017-09-29 20:11 ` [PATCH 02/18] fsck: support refs pointing to promisor objects Jonathan Tan
2017-09-29 20:11 ` [PATCH 03/18] fsck: support referenced " Jonathan Tan
2017-09-29 20:11 ` [PATCH 04/18] fsck: support promisor objects as CLI argument Jonathan Tan
2017-09-29 20:11 ` [PATCH 05/18] index-pack: refactor writing of .keep files Jonathan Tan
2017-09-29 20:11 ` [PATCH 06/18] introduce fetch-object: fetch one promisor object Jonathan Tan
2017-09-29 20:11 ` [PATCH 07/18] sha1_file: support lazily fetching missing objects Jonathan Tan
2017-10-12 14:42   ` Christian Couder
2017-10-12 15:45     ` Christian Couder
2017-09-29 20:11 ` [PATCH 08/18] rev-list: support termination at promisor " Jonathan Tan
2017-09-29 20:11 ` [PATCH 09/18] gc: do not repack promisor packfiles Jonathan Tan
2017-09-29 20:11 ` [PATCH 10/18] pack-objects: rename want_.* to ignore_.* Jonathan Tan
2017-09-29 20:11 ` [PATCH 11/18] pack-objects: support --blob-max-bytes Jonathan Tan
2017-09-29 20:11 ` [PATCH 12/18] fetch-pack: support excluding large blobs Jonathan Tan
2017-09-29 20:11 ` [PATCH 13/18] fetch: refactor calculation of remote list Jonathan Tan
2017-09-29 20:11 ` [PATCH 14/18] fetch: support excluding large blobs Jonathan Tan
2017-09-29 20:11 ` [PATCH 15/18] clone: " Jonathan Tan
2017-09-29 20:11 ` [PATCH 16/18] clone: configure blobmaxbytes in created repos Jonathan Tan
2017-09-29 20:11 ` [PATCH 17/18] unpack-trees: batch fetching of missing blobs Jonathan Tan
2017-09-29 20:11 ` [PATCH 18/18] fetch-pack: restore save_commit_buffer after use Jonathan Tan
2017-09-29 21:08 ` [PATCH 00/18] Partial clone (from clone to lazy fetch in 18 patches) Johannes Schindelin
2017-10-02  4:23 ` Junio C Hamano
2017-10-03  6:15 ` Christian Couder
2017-10-03  8:50   ` Junio C Hamano
2017-10-03 14:39     ` Jeff Hostetler
2017-10-03 23:42       ` Jonathan Tan
2017-10-04 13:30         ` Jeff Hostetler

git@vger.kernel.org mailing list mirror (one of many)

Archives are clonable:
	git clone --mirror https://public-inbox.org/git
	git clone --mirror http://ou63pmih66umazou.onion/git
	git clone --mirror http://czquwvybam4bgbro.onion/git
	git clone --mirror http://hjrcffqmbrq6wope.onion/git

Newsgroups are available over NNTP:
	nntp://news.public-inbox.org/inbox.comp.version-control.git
	nntp://ou63pmih66umazou.onion/inbox.comp.version-control.git
	nntp://czquwvybam4bgbro.onion/inbox.comp.version-control.git
	nntp://hjrcffqmbrq6wope.onion/inbox.comp.version-control.git
	nntp://news.gmane.org/gmane.comp.version-control.git

 note: .onion URLs require Tor: https://www.torproject.org/
       or Tor2web: https://www.tor2web.org/

AGPL code for this site: git clone https://public-inbox.org/ public-inbox