git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [PATCH 00/13] WIP Partial clone part 1: object filtering
@ 2017-10-24 18:53 Jeff Hostetler
  2017-10-24 18:53 ` [PATCH 01/13] dir: allow exclusions from blob in addition to file Jeff Hostetler
                   ` (14 more replies)
  0 siblings, 15 replies; 37+ messages in thread
From: Jeff Hostetler @ 2017-10-24 18:53 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

I've been working with Jonathan Tan to combine our partial clone
proposals.  This patch series represents a first step in that effort
and introduces an object filtering mechanism to select unwanted
objects.

[1] traverse_commit_list and list-objects is extended to allow
    various filters.
[2] rev-list is extended to expose filtering.  This allows testing
    of the filtering options.  And can be used later to predict
    missing objects before commands like checkout or merge.
[3] pack-objects is extended to use filtering parameters and build
    packfiles that omit unwanted objects.

This patch series lays the ground work for subsequent parts which
will extend clone, fetch, fetch-pack, upload-pack, fsck, and etc.


Jeff Hostetler (13):
  dir: allow exclusions from blob in addition to file
  list-objects-filter-map: extend oidmap to collect omitted objects
  list-objects: filter objects in traverse_commit_list
  list-objects-filter-blobs-none: add filter to omit all blobs
  list-objects-filter-blobs-limit: add large blob filtering
  list-objects-filter-sparse: add sparse filter
  list-objects-filter-options: common argument parsing
  list-objects: add traverse_commit_list_filtered method
  extension.partialclone: introduce partial clone extension
  rev-list: add list-objects filtering support
  t6112: rev-list object filtering test
  pack-objects: add list-objects filtering
  t5317: pack-objects object filtering test

 Documentation/git-pack-objects.txt             |   8 +-
 Documentation/git-rev-list.txt                 |   5 +-
 Documentation/rev-list-options.txt             |  30 ++
 Documentation/technical/repository-version.txt |  22 ++
 Makefile                                       |   6 +
 builtin/pack-objects.c                         |  18 +-
 builtin/rev-list.c                             |  84 +++++-
 cache.h                                        |   4 +
 config.h                                       |   3 +
 dir.c                                          |  51 +++-
 dir.h                                          |   3 +
 environment.c                                  |   2 +
 list-objects-filter-blobs-limit.c              | 146 ++++++++++
 list-objects-filter-blobs-limit.h              |  18 ++
 list-objects-filter-blobs-none.c               |  83 ++++++
 list-objects-filter-blobs-none.h               |  18 ++
 list-objects-filter-map.c                      |  63 ++++
 list-objects-filter-map.h                      |  26 ++
 list-objects-filter-options.c                  | 101 +++++++
 list-objects-filter-options.h                  |  50 ++++
 list-objects-filter-sparse.c                   | 241 ++++++++++++++++
 list-objects-filter-sparse.h                   |  30 ++
 list-objects.c                                 | 111 +++++--
 list-objects.h                                 |  43 ++-
 partial-clone-utils.c                          |  99 +++++++
 partial-clone-utils.h                          |  34 +++
 setup.c                                        |  15 +
 t/t5317-pack-objects-filter-objects.sh         | 384 +++++++++++++++++++++++++
 t/t6112-rev-list-filters-objects.sh            | 223 ++++++++++++++
 29 files changed, 1897 insertions(+), 24 deletions(-)
 create mode 100644 list-objects-filter-blobs-limit.c
 create mode 100644 list-objects-filter-blobs-limit.h
 create mode 100644 list-objects-filter-blobs-none.c
 create mode 100644 list-objects-filter-blobs-none.h
 create mode 100644 list-objects-filter-map.c
 create mode 100644 list-objects-filter-map.h
 create mode 100644 list-objects-filter-options.c
 create mode 100644 list-objects-filter-options.h
 create mode 100644 list-objects-filter-sparse.c
 create mode 100644 list-objects-filter-sparse.h
 create mode 100644 partial-clone-utils.c
 create mode 100644 partial-clone-utils.h
 create mode 100755 t/t5317-pack-objects-filter-objects.sh
 create mode 100755 t/t6112-rev-list-filters-objects.sh

-- 
2.9.3


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 01/13] dir: allow exclusions from blob in addition to file
  2017-10-24 18:53 [PATCH 00/13] WIP Partial clone part 1: object filtering Jeff Hostetler
@ 2017-10-24 18:53 ` Jeff Hostetler
  2017-10-25  4:05   ` Eric Sunshine
  2017-10-25  6:43   ` Junio C Hamano
  2017-10-24 18:53 ` [PATCH 02/13] list-objects-filter-map: extend oidmap to collect omitted objects Jeff Hostetler
                   ` (13 subsequent siblings)
  14 siblings, 2 replies; 37+ messages in thread
From: Jeff Hostetler @ 2017-10-24 18:53 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Refactor add_excludes() to separate the reading of the
exclude file into a buffer and the parsing of the buffer
into exclude_list items.

Add add_excludes_from_blob_to_list() to allow an exclude
file be specified with an OID.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 dir.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
 dir.h |  3 +++
 2 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/dir.c b/dir.c
index 1d17b80..d848f2b 100644
--- a/dir.c
+++ b/dir.c
@@ -739,6 +739,10 @@ static void invalidate_directory(struct untracked_cache *uc,
 		dir->dirs[i]->recurse = 0;
 }
 
+static int add_excludes_from_buffer(char *buf, size_t size,
+				    const char *base, int baselen,
+				    struct exclude_list *el);
+
 /*
  * Given a file with name "fname", read it (either from disk, or from
  * an index if 'istate' is non-null), parse it and store the
@@ -754,9 +758,9 @@ static int add_excludes(const char *fname, const char *base, int baselen,
 			struct sha1_stat *sha1_stat)
 {
 	struct stat st;
-	int fd, i, lineno = 1;
+	int fd;
 	size_t size = 0;
-	char *buf, *entry;
+	char *buf;
 
 	fd = open(fname, O_RDONLY);
 	if (fd < 0 || fstat(fd, &st) < 0) {
@@ -813,6 +817,17 @@ static int add_excludes(const char *fname, const char *base, int baselen,
 		}
 	}
 
+	add_excludes_from_buffer(buf, size, base, baselen, el);
+	return 0;
+}
+
+static int add_excludes_from_buffer(char *buf, size_t size,
+				    const char *base, int baselen,
+				    struct exclude_list *el)
+{
+	int i, lineno = 1;
+	char *entry;
+
 	el->filebuf = buf;
 
 	if (skip_utf8_bom(&buf, size))
@@ -841,6 +856,38 @@ int add_excludes_from_file_to_list(const char *fname, const char *base,
 	return add_excludes(fname, base, baselen, el, istate, NULL);
 }
 
+int add_excludes_from_blob_to_list(
+	struct object_id *oid,
+	const char *base, int baselen,
+	struct exclude_list *el)
+{
+	char *buf;
+	unsigned long size;
+	enum object_type type;
+
+	buf = read_sha1_file(oid->hash, &type, &size);
+	if (!buf)
+		return -1;
+
+	if (type != OBJ_BLOB) {
+		free(buf);
+		return -1;
+	}
+
+	if (size == 0) {
+		free(buf);
+		return 0;
+	}
+
+	if (buf[size - 1] != '\n') {
+		buf = xrealloc(buf, st_add(size, 1));
+		buf[size++] = '\n';
+	}
+
+	add_excludes_from_buffer(buf, size, base, baselen, el);
+	return 0;
+}
+
 struct exclude_list *add_exclude_list(struct dir_struct *dir,
 				      int group_type, const char *src)
 {
diff --git a/dir.h b/dir.h
index e371705..1bcf391 100644
--- a/dir.h
+++ b/dir.h
@@ -256,6 +256,9 @@ extern struct exclude_list *add_exclude_list(struct dir_struct *dir,
 extern int add_excludes_from_file_to_list(const char *fname, const char *base, int baselen,
 					  struct exclude_list *el, struct  index_state *istate);
 extern void add_excludes_from_file(struct dir_struct *, const char *fname);
+extern int add_excludes_from_blob_to_list(struct object_id *oid,
+					  const char *base, int baselen,
+					  struct exclude_list *el);
 extern void parse_exclude_pattern(const char **string, int *patternlen, unsigned *flags, int *nowildcardlen);
 extern void add_exclude(const char *string, const char *base,
 			int baselen, struct exclude_list *el, int srcpos);
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 02/13] list-objects-filter-map: extend oidmap to collect omitted objects
  2017-10-24 18:53 [PATCH 00/13] WIP Partial clone part 1: object filtering Jeff Hostetler
  2017-10-24 18:53 ` [PATCH 01/13] dir: allow exclusions from blob in addition to file Jeff Hostetler
@ 2017-10-24 18:53 ` Jeff Hostetler
  2017-10-25  7:10   ` Junio C Hamano
  2017-10-24 18:53 ` [PATCH 03/13] list-objects: filter objects in traverse_commit_list Jeff Hostetler
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Jeff Hostetler @ 2017-10-24 18:53 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Create helper class to extend oidmap to collect a list of
omitted or missing objects during traversal.

This will be used in a later commit by the list-object filtering
code.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 Makefile                  |  1 +
 list-objects-filter-map.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++
 list-objects-filter-map.h | 26 +++++++++++++++++++
 3 files changed, 90 insertions(+)
 create mode 100644 list-objects-filter-map.c
 create mode 100644 list-objects-filter-map.h

diff --git a/Makefile b/Makefile
index cd75985..e59f12d 100644
--- a/Makefile
+++ b/Makefile
@@ -807,6 +807,7 @@ LIB_OBJS += levenshtein.o
 LIB_OBJS += line-log.o
 LIB_OBJS += line-range.o
 LIB_OBJS += list-objects.o
+LIB_OBJS += list-objects-filter-map.o
 LIB_OBJS += ll-merge.o
 LIB_OBJS += lockfile.o
 LIB_OBJS += log-tree.o
diff --git a/list-objects-filter-map.c b/list-objects-filter-map.c
new file mode 100644
index 0000000..7e496b3
--- /dev/null
+++ b/list-objects-filter-map.c
@@ -0,0 +1,63 @@
+#include "cache.h"
+#include "list-objects-filter-map.h"
+
+int list_objects_filter_map_insert(struct oidmap *map,
+				   const struct object_id *oid,
+				   const char *pathname, enum object_type type)
+{
+	size_t len, size;
+	struct list_objects_filter_map_entry *e;
+
+	if (oidmap_get(map, oid))
+		return 1;
+
+	len = ((pathname && *pathname) ? strlen(pathname) : 0);
+	size = (offsetof(struct list_objects_filter_map_entry, pathname) + len + 1);
+	e = xcalloc(1, size);
+
+	oidcpy(&e->entry.oid, oid);
+	e->type = type;
+	if (pathname && *pathname)
+		strcpy(e->pathname, pathname);
+
+	oidmap_put(map, e);
+	return 0;
+}
+
+static int my_cmp(const void *a, const void *b)
+{
+	const struct oidmap_entry *ea, *eb;
+
+	ea = *(const struct oidmap_entry **)a;
+	eb = *(const struct oidmap_entry **)b;
+
+	return oidcmp(&ea->oid, &eb->oid);
+}
+
+void list_objects_filter_map_foreach(struct oidmap *map,
+				     list_objects_filter_map_foreach_cb cb,
+				     void *cb_data)
+{
+	struct hashmap_iter iter;
+	struct list_objects_filter_map_entry **array;
+	struct list_objects_filter_map_entry *e;
+	int k, nr;
+
+	nr = hashmap_get_size(&map->map);
+	if (!nr)
+		return;
+
+	array = xcalloc(nr, sizeof(*e));
+
+	k = 0;
+	hashmap_iter_init(&map->map, &iter);
+	while ((e = hashmap_iter_next(&iter)))
+		array[k++] = e;
+
+	QSORT(array, nr, my_cmp);
+
+	for (k = 0; k < nr; k++)
+		cb(k, nr, array[k], cb_data);
+
+	free(array);
+}
diff --git a/list-objects-filter-map.h b/list-objects-filter-map.h
new file mode 100644
index 0000000..794fc81
--- /dev/null
+++ b/list-objects-filter-map.h
@@ -0,0 +1,26 @@
+#ifndef LIST_OBJECTS_FILTER_MAP_H
+#define LIST_OBJECTS_FILTER_MAP_H
+
+#include "oidmap.h"
+
+struct list_objects_filter_map_entry {
+	struct oidmap_entry entry; /* must be first */
+	enum object_type type;
+	char pathname[FLEX_ARRAY];
+};
+
+extern int list_objects_filter_map_insert(
+	struct oidmap *map,
+	const struct object_id *oid,
+	const char *pathname, enum object_type type);
+
+typedef void (*list_objects_filter_map_foreach_cb)(
+	int i, int i_limit,
+	struct list_objects_filter_map_entry *e, void *cb_data);
+
+extern void list_objects_filter_map_foreach(
+	struct oidmap *map,
+	list_objects_filter_map_foreach_cb cb,
+	void *cb_data);
+
+#endif /* LIST_OBJECTS_FILTER_MAP_H */
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 03/13] list-objects: filter objects in traverse_commit_list
  2017-10-24 18:53 [PATCH 00/13] WIP Partial clone part 1: object filtering Jeff Hostetler
  2017-10-24 18:53 ` [PATCH 01/13] dir: allow exclusions from blob in addition to file Jeff Hostetler
  2017-10-24 18:53 ` [PATCH 02/13] list-objects-filter-map: extend oidmap to collect omitted objects Jeff Hostetler
@ 2017-10-24 18:53 ` Jeff Hostetler
  2017-10-25  4:05   ` Jonathan Tan
  2017-10-24 18:53 ` [PATCH 04/13] list-objects-filter-blobs-none: add filter to omit all blobs Jeff Hostetler
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Jeff Hostetler @ 2017-10-24 18:53 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Create traverse_commit_list_filtered() and add filtering
interface to allow certain objects to be omitted (not shown)
during a traversal.

Update traverse_commit_list() to be a wrapper for the above.

Filtering will be used in a future commit by rev-list and
pack-objects for narrow/partial clone/fetch to omit certain
blobs from the output.

traverse_bitmap_commit_list() does not work with filtering.
If a packfile bitmap is present, it will not be used.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 list-objects.c | 66 ++++++++++++++++++++++++++++++++++++++++++++--------------
 list-objects.h | 32 +++++++++++++++++++++++++++-
 2 files changed, 81 insertions(+), 17 deletions(-)

diff --git a/list-objects.c b/list-objects.c
index b3931fa..3e86008 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -13,10 +13,13 @@ static void process_blob(struct rev_info *revs,
 			 show_object_fn show,
 			 struct strbuf *path,
 			 const char *name,
-			 void *cb_data)
+			 void *cb_data,
+			 filter_object_fn filter,
+			 void *filter_data)
 {
 	struct object *obj = &blob->object;
 	size_t pathlen;
+	list_objects_filter_result r = LOFR_MARK_SEEN | LOFR_SHOW;
 
 	if (!revs->blob_objects)
 		return;
@@ -24,11 +27,15 @@ static void process_blob(struct rev_info *revs,
 		die("bad blob object");
 	if (obj->flags & (UNINTERESTING | SEEN))
 		return;
-	obj->flags |= SEEN;
 
 	pathlen = path->len;
 	strbuf_addstr(path, name);
-	show(obj, path->buf, cb_data);
+	if (filter)
+		r = filter(LOFT_BLOB, obj, path->buf, &path->buf[pathlen], filter_data);
+	if (r & LOFR_MARK_SEEN)
+		obj->flags |= SEEN;
+	if (r & LOFR_SHOW)
+		show(obj, path->buf, cb_data);
 	strbuf_setlen(path, pathlen);
 }
 
@@ -69,7 +76,9 @@ static void process_tree(struct rev_info *revs,
 			 show_object_fn show,
 			 struct strbuf *base,
 			 const char *name,
-			 void *cb_data)
+			 void *cb_data,
+			 filter_object_fn filter,
+			 void *filter_data)
 {
 	struct object *obj = &tree->object;
 	struct tree_desc desc;
@@ -77,6 +86,7 @@ static void process_tree(struct rev_info *revs,
 	enum interesting match = revs->diffopt.pathspec.nr == 0 ?
 		all_entries_interesting: entry_not_interesting;
 	int baselen = base->len;
+	list_objects_filter_result r = LOFR_MARK_SEEN | LOFR_SHOW;
 
 	if (!revs->tree_objects)
 		return;
@@ -90,9 +100,13 @@ static void process_tree(struct rev_info *revs,
 		die("bad tree object %s", oid_to_hex(&obj->oid));
 	}
 
-	obj->flags |= SEEN;
 	strbuf_addstr(base, name);
-	show(obj, base->buf, cb_data);
+	if (filter)
+		r = filter(LOFT_BEGIN_TREE, obj, base->buf, &base->buf[baselen], filter_data);
+	if (r & LOFR_MARK_SEEN)
+		obj->flags |= SEEN;
+	if (r & LOFR_SHOW)
+		show(obj, base->buf, cb_data);
 	if (base->len)
 		strbuf_addch(base, '/');
 
@@ -112,7 +126,7 @@ static void process_tree(struct rev_info *revs,
 			process_tree(revs,
 				     lookup_tree(entry.oid),
 				     show, base, entry.path,
-				     cb_data);
+				     cb_data, filter, filter_data);
 		else if (S_ISGITLINK(entry.mode))
 			process_gitlink(revs, entry.oid->hash,
 					show, base, entry.path,
@@ -121,8 +135,17 @@ static void process_tree(struct rev_info *revs,
 			process_blob(revs,
 				     lookup_blob(entry.oid),
 				     show, base, entry.path,
-				     cb_data);
+				     cb_data, filter, filter_data);
 	}
+
+	if (filter) {
+		r = filter(LOFT_END_TREE, obj, base->buf, &base->buf[baselen], filter_data);
+		if (r & LOFR_MARK_SEEN)
+			obj->flags |= SEEN;
+		if (r & LOFR_SHOW)
+			show(obj, base->buf, cb_data);
+	}
+
 	strbuf_setlen(base, baselen);
 	free_tree_buffer(tree);
 }
@@ -183,10 +206,10 @@ static void add_pending_tree(struct rev_info *revs, struct tree *tree)
 	add_pending_object(revs, &tree->object, "");
 }
 
-void traverse_commit_list(struct rev_info *revs,
-			  show_commit_fn show_commit,
-			  show_object_fn show_object,
-			  void *data)
+void traverse_commit_list_worker(
+	struct rev_info *revs,
+	show_commit_fn show_commit, show_object_fn show_object, void *show_data,
+	filter_object_fn filter, void *filter_data)
 {
 	int i;
 	struct commit *commit;
@@ -200,7 +223,7 @@ void traverse_commit_list(struct rev_info *revs,
 		 */
 		if (commit->tree)
 			add_pending_tree(revs, commit->tree);
-		show_commit(commit, data);
+		show_commit(commit, show_data);
 	}
 	for (i = 0; i < revs->pending.nr; i++) {
 		struct object_array_entry *pending = revs->pending.objects + i;
@@ -211,19 +234,19 @@ void traverse_commit_list(struct rev_info *revs,
 			continue;
 		if (obj->type == OBJ_TAG) {
 			obj->flags |= SEEN;
-			show_object(obj, name, data);
+			show_object(obj, name, show_data);
 			continue;
 		}
 		if (!path)
 			path = "";
 		if (obj->type == OBJ_TREE) {
 			process_tree(revs, (struct tree *)obj, show_object,
-				     &base, path, data);
+				     &base, path, show_data, filter, filter_data);
 			continue;
 		}
 		if (obj->type == OBJ_BLOB) {
 			process_blob(revs, (struct blob *)obj, show_object,
-				     &base, path, data);
+				     &base, path, show_data, filter, filter_data);
 			continue;
 		}
 		die("unknown pending object %s (%s)",
@@ -232,3 +255,14 @@ void traverse_commit_list(struct rev_info *revs,
 	object_array_clear(&revs->pending);
 	strbuf_release(&base);
 }
+
+void traverse_commit_list(struct rev_info *revs,
+			  show_commit_fn show_commit,
+			  show_object_fn show_object,
+			  void *show_data)
+{
+	traverse_commit_list_worker(
+		revs,
+		show_commit, show_object, show_data,
+		NULL, NULL);
+}
diff --git a/list-objects.h b/list-objects.h
index 0cebf85..43a06fb 100644
--- a/list-objects.h
+++ b/list-objects.h
@@ -8,4 +8,34 @@ void traverse_commit_list(struct rev_info *, show_commit_fn, show_object_fn, voi
 typedef void (*show_edge_fn)(struct commit *);
 void mark_edges_uninteresting(struct rev_info *, show_edge_fn);
 
-#endif
+enum list_objects_filter_result {
+	LOFR_ZERO      = 0,
+	LOFR_MARK_SEEN = 1<<0,
+	LOFR_SHOW      = 1<<1,
+};
+
+/* See object.h and revision.h */
+#define FILTER_REVISIT (1<<25)
+
+enum list_objects_filter_type {
+	LOFT_BEGIN_TREE,
+	LOFT_END_TREE,
+	LOFT_BLOB
+};
+
+typedef enum list_objects_filter_result list_objects_filter_result;
+typedef enum list_objects_filter_type list_objects_filter_type;
+
+typedef list_objects_filter_result (*filter_object_fn)(
+	list_objects_filter_type filter_type,
+	struct object *obj,
+	const char *pathname,
+	const char *filename,
+	void *filter_data);
+
+void traverse_commit_list_worker(
+	struct rev_info *,
+	show_commit_fn, show_object_fn, void *show_data,
+	filter_object_fn filter, void *filter_data);
+
+#endif /* LIST_OBJECTS_H */
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 04/13] list-objects-filter-blobs-none: add filter to omit all blobs
  2017-10-24 18:53 [PATCH 00/13] WIP Partial clone part 1: object filtering Jeff Hostetler
                   ` (2 preceding siblings ...)
  2017-10-24 18:53 ` [PATCH 03/13] list-objects: filter objects in traverse_commit_list Jeff Hostetler
@ 2017-10-24 18:53 ` Jeff Hostetler
  2017-10-24 18:53 ` [PATCH 05/13] list-objects-filter-blobs-limit: add large blob filtering Jeff Hostetler
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Jeff Hostetler @ 2017-10-24 18:53 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Create a simple filter for traverse_commit_list_worker() to omit
all blobs from the result.

This filter will be used in a future commit by rev-list and pack-objects
to create a "commits and trees" result.  This is intended for partial
clone and fetch support.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 Makefile                         |  1 +
 list-objects-filter-blobs-none.c | 83 ++++++++++++++++++++++++++++++++++++++++
 list-objects-filter-blobs-none.h | 18 +++++++++
 3 files changed, 102 insertions(+)
 create mode 100644 list-objects-filter-blobs-none.c
 create mode 100644 list-objects-filter-blobs-none.h

diff --git a/Makefile b/Makefile
index e59f12d..7e9d1f4 100644
--- a/Makefile
+++ b/Makefile
@@ -807,6 +807,7 @@ LIB_OBJS += levenshtein.o
 LIB_OBJS += line-log.o
 LIB_OBJS += line-range.o
 LIB_OBJS += list-objects.o
+LIB_OBJS += list-objects-filter-blobs-none.o
 LIB_OBJS += list-objects-filter-map.o
 LIB_OBJS += ll-merge.o
 LIB_OBJS += lockfile.o
diff --git a/list-objects-filter-blobs-none.c b/list-objects-filter-blobs-none.c
new file mode 100644
index 0000000..1b548b9
--- /dev/null
+++ b/list-objects-filter-blobs-none.c
@@ -0,0 +1,83 @@
+#include "cache.h"
+#include "dir.h"
+#include "tag.h"
+#include "commit.h"
+#include "tree.h"
+#include "blob.h"
+#include "diff.h"
+#include "tree-walk.h"
+#include "revision.h"
+#include "list-objects.h"
+#include "list-objects-filter-blobs-none.h"
+
+#define DEFAULT_MAP_SIZE (16*1024)
+
+/*
+ * A filter for list-objects to omit ALL blobs from the traversal.
+ * And to OPTIONALLY collect a list of the omitted OIDs.
+ */
+struct filter_blobs_none_data {
+	struct oidmap *omits;
+};
+
+static list_objects_filter_result filter_blobs_none(
+	list_objects_filter_type filter_type,
+	struct object *obj,
+	const char *pathname,
+	const char *filename,
+	void *filter_data_)
+{
+	struct filter_blobs_none_data *filter_data = filter_data_;
+
+	switch (filter_type) {
+	default:
+		die("unkown filter_type");
+		return LOFR_ZERO;
+
+	case LOFT_BEGIN_TREE:
+		assert(obj->type == OBJ_TREE);
+		/* always include all tree objects */
+		return LOFR_MARK_SEEN | LOFR_SHOW;
+
+	case LOFT_END_TREE:
+		assert(obj->type == OBJ_TREE);
+		return LOFR_ZERO;
+
+	case LOFT_BLOB:
+		assert(obj->type == OBJ_BLOB);
+		assert((obj->flags & SEEN) == 0);
+
+		if (filter_data->omits)
+			list_objects_filter_map_insert(
+				filter_data->omits, &obj->oid, pathname,
+				obj->type);
+
+		return LOFR_MARK_SEEN; /* but not LOFR_SHOW (hard omit) */
+	}
+}
+
+void traverse_commit_list__blobs_none(
+	struct rev_info *revs,
+	show_commit_fn show_commit,
+	show_object_fn show_object,
+	list_objects_filter_map_foreach_cb print_omitted_object,
+	void *ctx_data)
+{
+	struct filter_blobs_none_data d;
+
+	memset(&d, 0, sizeof(d));
+	if (print_omitted_object) {
+		d.omits = xcalloc(1, sizeof(*d.omits));
+		oidmap_init(d.omits, DEFAULT_MAP_SIZE);
+	}
+
+	traverse_commit_list_worker(revs, show_commit, show_object, ctx_data,
+				    filter_blobs_none, &d);
+
+	if (print_omitted_object) {
+		list_objects_filter_map_foreach(d.omits,
+						print_omitted_object,
+						ctx_data);
+		oidmap_free(d.omits, 1);
+	}
+}
diff --git a/list-objects-filter-blobs-none.h b/list-objects-filter-blobs-none.h
new file mode 100644
index 0000000..363c9de
--- /dev/null
+++ b/list-objects-filter-blobs-none.h
@@ -0,0 +1,18 @@
+#ifndef LIST_OBJECTS_FILTER_BLOBS_NONE_H
+#define LIST_OBJECTS_FILTER_BLOBS_NONE_H
+
+#include "list-objects-filter-map.h"
+
+/*
+ * A filter for list-objects to omit ALL blobs
+ * from the traversal.
+ */
+void traverse_commit_list__blobs_none(
+	struct rev_info *revs,
+	show_commit_fn show_commit,
+	show_object_fn show_object,
+	list_objects_filter_map_foreach_cb print_omitted_object,
+	void *ctx_data);
+
+#endif /* LIST_OBJECTS_FILTER_BLOBS_NONE_H */
+
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 05/13] list-objects-filter-blobs-limit: add large blob filtering
  2017-10-24 18:53 [PATCH 00/13] WIP Partial clone part 1: object filtering Jeff Hostetler
                   ` (3 preceding siblings ...)
  2017-10-24 18:53 ` [PATCH 04/13] list-objects-filter-blobs-none: add filter to omit all blobs Jeff Hostetler
@ 2017-10-24 18:53 ` Jeff Hostetler
  2017-10-24 18:53 ` [PATCH 06/13] list-objects-filter-sparse: add sparse filter Jeff Hostetler
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Jeff Hostetler @ 2017-10-24 18:53 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Create a filter for traverse_commit_list_worker() to omit blobs
larger than a requested size from the result, but always include
".git*" special files.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 Makefile                          |   1 +
 list-objects-filter-blobs-limit.c | 146 ++++++++++++++++++++++++++++++++++++++
 list-objects-filter-blobs-limit.h |  18 +++++
 3 files changed, 165 insertions(+)
 create mode 100644 list-objects-filter-blobs-limit.c
 create mode 100644 list-objects-filter-blobs-limit.h

diff --git a/Makefile b/Makefile
index 7e9d1f4..0fdeabb 100644
--- a/Makefile
+++ b/Makefile
@@ -807,6 +807,7 @@ LIB_OBJS += levenshtein.o
 LIB_OBJS += line-log.o
 LIB_OBJS += line-range.o
 LIB_OBJS += list-objects.o
+LIB_OBJS += list-objects-filter-blobs-limit.o
 LIB_OBJS += list-objects-filter-blobs-none.o
 LIB_OBJS += list-objects-filter-map.o
 LIB_OBJS += ll-merge.o
diff --git a/list-objects-filter-blobs-limit.c b/list-objects-filter-blobs-limit.c
new file mode 100644
index 0000000..f68963d
--- /dev/null
+++ b/list-objects-filter-blobs-limit.c
@@ -0,0 +1,146 @@
+#include "cache.h"
+#include "dir.h"
+#include "tag.h"
+#include "commit.h"
+#include "tree.h"
+#include "blob.h"
+#include "diff.h"
+#include "tree-walk.h"
+#include "revision.h"
+#include "list-objects.h"
+#include "list-objects-filter-blobs-limit.h"
+
+#define DEFAULT_MAP_SIZE (16*1024)
+
+/*
+ * A filter for list-objects to omit large blobs,
+ * but always include ".git*" special files.
+ * And to OPTIONALLY collect a list of the omitted OIDs.
+ */
+struct filter_blobs_limit_data {
+	struct oidmap *omits;
+	unsigned long max_bytes;
+};
+
+static list_objects_filter_result filter_blobs_limit(
+	list_objects_filter_type filter_type,
+	struct object *obj,
+	const char *pathname,
+	const char *filename,
+	void *filter_data_)
+{
+	struct filter_blobs_limit_data *filter_data = filter_data_;
+	struct list_objects_filter_data_entry *entry;
+	unsigned long object_length;
+	enum object_type t;
+	int is_special_filename;
+
+	switch (filter_type) {
+	default:
+		die("unkown filter_type");
+		return LOFR_ZERO;
+
+	case LOFT_BEGIN_TREE:
+		assert(obj->type == OBJ_TREE);
+		/* always include all tree objects */
+		return LOFR_MARK_SEEN | LOFR_SHOW;
+
+	case LOFT_END_TREE:
+		assert(obj->type == OBJ_TREE);
+		return LOFR_ZERO;
+
+	case LOFT_BLOB:
+		assert(obj->type == OBJ_BLOB);
+		assert((obj->flags & SEEN) == 0);
+
+		is_special_filename = ((strncmp(filename, ".git", 4) == 0) &&
+				       filename[4]);
+
+		/*
+		 * If we are keeping a list of the omitted objects
+		 * for the caller *AND* we previously "provisionally"
+		 * omitted this object (because of size) *AND* it now
+		 * has a special filename, make it not-omitted.
+		 * Otherwise, continue to provisionally omit it.
+		 */
+		if (filter_data->omits &&
+		    oidmap_get(filter_data->omits, &obj->oid)) {
+			if (!is_special_filename)
+				return LOFR_ZERO;
+			entry = oidmap_remove(filter_data->omits, &obj->oid);
+			free(entry);
+			return LOFR_MARK_SEEN | LOFR_SHOW;
+		}
+
+		/*
+		 * If filename matches ".git*", always include it (regardless
+		 * of size).  (This may include blobs that we do not have
+		 * locally.)
+		 */
+		if (is_special_filename)
+			return LOFR_MARK_SEEN | LOFR_SHOW;
+
+		t = sha1_object_info(obj->oid.hash, &object_length);
+		if (t != OBJ_BLOB) { /* probably OBJ_NONE */
+			/*
+			 * We DO NOT have the blob locally, so we cannot
+			 * apply the size filter criteria.  Be conservative
+			 * and force show it (and let the caller deal with
+			 * the ambiguity).  (This matches the behavior above
+			 * when the special filename matches.)
+			 */
+			return LOFR_MARK_SEEN | LOFR_SHOW;
+		}
+
+		if (object_length < filter_data->max_bytes)
+			return LOFR_MARK_SEEN | LOFR_SHOW;
+
+		/*
+		 * Provisionally omit it.  We've already established
+		 * that this blob is too big and doesn't have a special
+		 * filename, so we *WANT* to omit it.  However, there
+		 * may be a special file elsewhere in the tree that
+		 * references this same blob, so we cannot reject it
+		 * just yet.  Leave the LOFR_ bits unset so that *IF*
+		 * the blob appears again in the traversal, we will
+		 * be asked again.
+		 *
+		 * If we are keeping a list of the ommitted objects,
+		 * provisionally add it to the list.
+		 */
+
+		if (filter_data->omits)
+			list_objects_filter_map_insert(filter_data->omits,
+						       &obj->oid, pathname,
+						       obj->type);
+
+		return LOFR_ZERO;
+	}
+}
+
+void traverse_commit_list__blobs_limit(
+	struct rev_info *revs,
+	show_commit_fn show_commit,
+	show_object_fn show_object,
+	list_objects_filter_map_foreach_cb print_omitted_object,
+	void *ctx_data,
+	unsigned long large_byte_limit)
+{
+	struct filter_blobs_limit_data d;
+
+	memset(&d, 0, sizeof(d));
+	if (print_omitted_object) {
+		d.omits = xcalloc(1, sizeof(*d.omits));
+		oidmap_init(d.omits, DEFAULT_MAP_SIZE);
+	}
+	d.max_bytes = large_byte_limit;
+
+	traverse_commit_list_worker(revs, show_commit, show_object, ctx_data,
+				    filter_blobs_limit, &d);
+
+	if (print_omitted_object) {
+		list_objects_filter_map_foreach(d.omits, print_omitted_object,
+						ctx_data);
+		oidmap_free(d.omits, 1);
+	}
+}
diff --git a/list-objects-filter-blobs-limit.h b/list-objects-filter-blobs-limit.h
new file mode 100644
index 0000000..ea05088
--- /dev/null
+++ b/list-objects-filter-blobs-limit.h
@@ -0,0 +1,18 @@
+#ifndef LIST_OBJECTS_FILTER_BLOBS_LIMIT_H
+#define LIST_OBJECTS_FILTER_BLOBS_LIMIT_H
+
+#include "list-objects-filter-map.h"
+
+/*
+ * A filter for list-objects to omit large blobs,
+ * but always include ".git*" special files.
+ */
+void traverse_commit_list__blobs_limit(
+	struct rev_info *revs,
+	show_commit_fn show_commit,
+	show_object_fn show_object,
+	list_objects_filter_map_foreach_cb print_omitted_object,
+	void *ctx_data,
+	unsigned long large_byte_limit);
+
+#endif /* LIST_OBJECTS_FILTER_BLOBS_LIMIT_H */
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 06/13] list-objects-filter-sparse: add sparse filter
  2017-10-24 18:53 [PATCH 00/13] WIP Partial clone part 1: object filtering Jeff Hostetler
                   ` (4 preceding siblings ...)
  2017-10-24 18:53 ` [PATCH 05/13] list-objects-filter-blobs-limit: add large blob filtering Jeff Hostetler
@ 2017-10-24 18:53 ` Jeff Hostetler
  2017-10-24 18:53 ` [PATCH 07/13] list-objects-filter-options: common argument parsing Jeff Hostetler
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Jeff Hostetler @ 2017-10-24 18:53 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Create a filter for traverse_commit_list_worker() to only include
the blobs the would be referenced by a sparse-checkout using the
given specification.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 Makefile                     |   1 +
 list-objects-filter-sparse.c | 241 +++++++++++++++++++++++++++++++++++++++++++
 list-objects-filter-sparse.h |  30 ++++++
 3 files changed, 272 insertions(+)
 create mode 100644 list-objects-filter-sparse.c
 create mode 100644 list-objects-filter-sparse.h

diff --git a/Makefile b/Makefile
index 0fdeabb..fc82664 100644
--- a/Makefile
+++ b/Makefile
@@ -810,6 +810,7 @@ LIB_OBJS += list-objects.o
 LIB_OBJS += list-objects-filter-blobs-limit.o
 LIB_OBJS += list-objects-filter-blobs-none.o
 LIB_OBJS += list-objects-filter-map.o
+LIB_OBJS += list-objects-filter-sparse.o
 LIB_OBJS += ll-merge.o
 LIB_OBJS += lockfile.o
 LIB_OBJS += log-tree.o
diff --git a/list-objects-filter-sparse.c b/list-objects-filter-sparse.c
new file mode 100644
index 0000000..386b667
--- /dev/null
+++ b/list-objects-filter-sparse.c
@@ -0,0 +1,241 @@
+#include "cache.h"
+#include "dir.h"
+#include "tag.h"
+#include "commit.h"
+#include "tree.h"
+#include "blob.h"
+#include "diff.h"
+#include "tree-walk.h"
+#include "revision.h"
+#include "list-objects.h"
+#include "list-objects-filter-sparse.h"
+
+#define DEFAULT_MAP_SIZE (16*1024)
+
+/*
+ * A filter driven by a sparse-checkout specification to only
+ * include blobs that a sparse checkout would populate.
+ *
+ * The sparse-checkout spec can be loaded from a blob with the
+ * given OID or from a local pathname.  We allow an OID because
+ * the repo may be bare or we may be doing the filtering on the
+ * server.
+ */
+struct frame {
+	int defval;
+	int child_prov_omit : 1;
+};
+
+struct filter_use_sparse_data {
+	struct oidmap *omits;
+	struct exclude_list el;
+
+	size_t nr, alloc;
+	struct frame *array_frame;
+};
+
+static list_objects_filter_result filter_use_sparse(
+	list_objects_filter_type filter_type,
+	struct object *obj,
+	const char *pathname,
+	const char *filename,
+	void *filter_data_)
+{
+	struct filter_use_sparse_data *filter_data = filter_data_;
+	struct list_objects_filter_map_entry *entry_prev = NULL;
+	int val, dtype;
+	struct frame *frame;
+
+	switch (filter_type) {
+	default:
+		die("unkown filter_type");
+		return LOFR_ZERO;
+
+	case LOFT_BEGIN_TREE:
+		assert(obj->type == OBJ_TREE);
+		dtype = DT_DIR;
+		val = is_excluded_from_list(pathname, strlen(pathname),
+					    filename, &dtype, &filter_data->el,
+					    &the_index);
+		if (val < 0)
+			val = filter_data->array_frame[filter_data->nr].defval;
+
+		ALLOC_GROW(filter_data->array_frame, filter_data->nr + 1,
+			   filter_data->alloc);
+		filter_data->nr++;
+		filter_data->array_frame[filter_data->nr].defval = val;
+		filter_data->array_frame[filter_data->nr].child_prov_omit = 0;
+
+		/*
+		 * A directory with this tree OID may appear in multiple
+		 * places in the tree. (Think of a directory move, with
+		 * no other changes.)  And with a different pathname, the
+		 * is_excluded...() results for this directory and items
+		 * contained within it may be different.  So we cannot
+		 * mark it SEEN (yet), since that will prevent process_tree()
+		 * from revisiting this tree object with other pathnames.
+		 *
+		 * Only SHOW the tree object the first time we visit this
+		 * tree object.
+		 *
+		 * We always show all tree objects.  A future optimization
+		 * may want to attempt to narrow this.
+		 */
+		if (obj->flags & FILTER_REVISIT)
+			return LOFR_ZERO;
+		obj->flags |= FILTER_REVISIT;
+		return LOFR_SHOW;
+
+	case LOFT_END_TREE:
+		assert(obj->type == OBJ_TREE);
+		assert(filter_data->nr > 0);
+
+		frame = &filter_data->array_frame[filter_data->nr];
+		filter_data->nr--;
+
+		/*
+		 * Tell our parent directory if any of our children were
+		 * provisionally omitted.
+		 */
+		filter_data->array_frame[filter_data->nr].child_prov_omit |=
+			frame->child_prov_omit;
+
+		/*
+		 * If there are NO provisionally omitted child objects (ALL child
+		 * objects in this folder were INCLUDED), then we can mark the
+		 * folder as SEEN (so we will not have to revisit it again).
+		 */
+		if (!frame->child_prov_omit)
+			return LOFR_MARK_SEEN;
+		return LOFR_ZERO;
+
+	case LOFT_BLOB:
+		assert(obj->type == OBJ_BLOB);
+		assert((obj->flags & SEEN) == 0);
+
+		frame = &filter_data->array_frame[filter_data->nr];
+
+		/*
+		 * If we are keeping a list of the omitted objects
+		 * for the caller *AND* we previsously provisionally
+		 * omitted this object (because the THEN pathname
+		 * is excluded) *AND* it has the same pathname, we
+		 * can avoid duplicating the is_excluded lookup
+		 * costs and continue provisionally omitting it.
+		 */
+		if (filter_data->omits) {
+			entry_prev = oidmap_get(
+				filter_data->omits, &obj->oid);
+			if (entry_prev &&
+			    !strcmp(pathname, entry_prev->pathname)) {
+				frame->child_prov_omit = 1;
+				return LOFR_ZERO;
+			}
+		}
+
+		dtype = DT_REG;
+		val = is_excluded_from_list(pathname, strlen(pathname),
+					    filename, &dtype, &filter_data->el,
+					    &the_index);
+		if (val < 0)
+			val = frame->defval;
+		if (val > 0) {
+			if (entry_prev) {
+				entry_prev = oidmap_remove(filter_data->omits,
+							   &obj->oid);
+				free(entry_prev);
+			}
+			return LOFR_MARK_SEEN | LOFR_SHOW;
+		}
+
+		/*
+		 * Provisionally omit it.  We've already established that
+		 * this pathname is not in the sparse-checkout specification
+		 * with the CURRENT pathname, so we *WANT* to omit this blob.
+		 *
+		 * However, a pathname elsewhere in the tree may also
+		 * reference this same blob, so we cannot reject it yet.
+		 * Leave the LOFR_ bits unset so that if the blob appears
+		 * again in the traversal, we will be asked again.
+		 *
+		 * The pathname that we associate with this omit is just
+		 * the first one we saw for this blob.  Other instances of
+		 * this blob may have other pathnames and that is fine.
+		 * We just use it for perf to do the entry_prev lookup
+		 * above (because most of the time, the blob will be in
+		 * the same place as we walk the commits).
+		 */
+		if (filter_data->omits)
+			list_objects_filter_map_insert(filter_data->omits,
+						       &obj->oid, pathname,
+						       obj->type);
+
+		frame->child_prov_omit = 1;
+		return LOFR_ZERO;
+	}
+}
+
+static void do_sparse(
+	struct filter_use_sparse_data *d,
+	struct rev_info *revs,
+	show_commit_fn show_commit,
+	show_object_fn show_object,
+	list_objects_filter_map_foreach_cb print_omitted_object,
+	void *ctx_data)
+{
+	ALLOC_GROW(d->array_frame, d->nr + 1, d->alloc);
+	d->array_frame[d->nr].defval = 0; /* default to include */
+	d->array_frame[d->nr].child_prov_omit = 0;
+
+	traverse_commit_list_worker(revs, show_commit, show_object, ctx_data,
+				    filter_use_sparse, d);
+
+	if (print_omitted_object) {
+		list_objects_filter_map_foreach(d->omits, print_omitted_object, ctx_data);
+		oidmap_free(d->omits, 1);
+	}
+}
+
+void traverse_commit_list__sparse_oid(
+	struct rev_info *revs,
+	show_commit_fn show_commit,
+	show_object_fn show_object,
+	list_objects_filter_map_foreach_cb print_omitted_object,
+	void *ctx_data,
+	struct object_id *oid)
+{
+	struct filter_use_sparse_data d;
+
+	memset(&d, 0, sizeof(d));
+	if (print_omitted_object) {
+		d.omits = xcalloc(1, sizeof(*d.omits));
+		oidmap_init(d.omits, DEFAULT_MAP_SIZE);
+	}
+	if (add_excludes_from_blob_to_list(oid, NULL, 0, &d.el) < 0)
+		die("could not load filter specification");
+
+	do_sparse(&d, revs, show_commit, show_object, print_omitted_object,
+		  ctx_data);
+}
+
+void traverse_commit_list__sparse_path(
+	struct rev_info *revs,
+	show_commit_fn show_commit,
+	show_object_fn show_object,
+	list_objects_filter_map_foreach_cb print_omitted_object,
+	void *ctx_data,
+	const char *path)
+{
+	struct filter_use_sparse_data d;
+
+	memset(&d, 0, sizeof(d));
+	if (print_omitted_object) {
+		d.omits = xcalloc(1, sizeof(*d.omits));
+		oidmap_init(d.omits, DEFAULT_MAP_SIZE);
+	}
+	if (add_excludes_from_file_to_list(path, NULL, 0, &d.el, NULL) < 0)
+		die("could not load filter specification");
+
+	do_sparse(&d, revs, show_commit, show_object, print_omitted_object,
+		  ctx_data);
+}
diff --git a/list-objects-filter-sparse.h b/list-objects-filter-sparse.h
new file mode 100644
index 0000000..6c715bf
--- /dev/null
+++ b/list-objects-filter-sparse.h
@@ -0,0 +1,30 @@
+#ifndef LIST_OBJECTS_FILTERS_SPARSE_H
+#define LIST_OBJECTS_FILTERS_SPARSE_H
+
+#include "list-objects-filter-map.h"
+
+/*
+ * A filter driven by a sparse-checkout specification to only
+ * include blobs that a sparse checkout would populate.
+ *
+ * The sparse-checkout spec can be loaded from a blob with the
+ * given OID, a blob with a blob-ish path, or from a local pathname.
+ * We allow an OID because the repo may be bare or we may be doing
+ * the filtering on the server.
+ */
+void traverse_commit_list__sparse_oid(
+	struct rev_info *revs,
+	show_commit_fn show_commit,
+	show_object_fn show_object,
+	list_objects_filter_map_foreach_cb print_omitted_object,
+	void *ctx_data,
+	struct object_id *oid);
+void traverse_commit_list__sparse_path(
+	struct rev_info *revs,
+	show_commit_fn show_commit,
+	show_object_fn show_object,
+	list_objects_filter_map_foreach_cb print_omitted_object,
+	void *ctx_data,
+	const char *path);
+
+#endif /* LIST_OBJECTS_FILTERS_SPARSE_H */
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 07/13] list-objects-filter-options: common argument parsing
  2017-10-24 18:53 [PATCH 00/13] WIP Partial clone part 1: object filtering Jeff Hostetler
                   ` (5 preceding siblings ...)
  2017-10-24 18:53 ` [PATCH 06/13] list-objects-filter-sparse: add sparse filter Jeff Hostetler
@ 2017-10-24 18:53 ` Jeff Hostetler
  2017-10-25  4:14   ` Jonathan Tan
  2017-10-24 18:53 ` [PATCH 08/13] list-objects: add traverse_commit_list_filtered method Jeff Hostetler
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Jeff Hostetler @ 2017-10-24 18:53 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Create common routines and defines for parsing
list-objects-filter-related command line arguments and
pack-protocol fields.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 Makefile                      |   1 +
 list-objects-filter-options.c | 101 ++++++++++++++++++++++++++++++++++++++++++
 list-objects-filter-options.h |  50 +++++++++++++++++++++
 3 files changed, 152 insertions(+)
 create mode 100644 list-objects-filter-options.c
 create mode 100644 list-objects-filter-options.h

diff --git a/Makefile b/Makefile
index fc82664..b9ff0b4 100644
--- a/Makefile
+++ b/Makefile
@@ -810,6 +810,7 @@ LIB_OBJS += list-objects.o
 LIB_OBJS += list-objects-filter-blobs-limit.o
 LIB_OBJS += list-objects-filter-blobs-none.o
 LIB_OBJS += list-objects-filter-map.o
+LIB_OBJS += list-objects-filter-options.o
 LIB_OBJS += list-objects-filter-sparse.o
 LIB_OBJS += ll-merge.o
 LIB_OBJS += lockfile.o
diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
new file mode 100644
index 0000000..40f48ac
--- /dev/null
+++ b/list-objects-filter-options.c
@@ -0,0 +1,101 @@
+#include "cache.h"
+#include "commit.h"
+#include "config.h"
+#include "revision.h"
+#include "list-objects.h"
+#include "list-objects-filter-options.h"
+
+/*
+ * Parse value of the argument to the "filter" keword.
+ * On the command line this looks like: --filter=<arg>
+ * and in the pack protocol as: filter <arg>
+ *
+ * <arg> ::= blob:none
+ *           blob:limit:<n>[kmg]
+ *           sparse:oid:<oid-expression>
+ *           sparse:path:<pathname>
+ */
+int parse_list_objects_filter(struct list_objects_filter_options *filter_options,
+			      const char *arg)
+{
+	struct object_context oc;
+	struct object_id sparse_oid;
+	const char *v0;
+	const char *v1;
+
+	if (filter_options->choice)
+		die(_("multiple object filter types cannot be combined"));
+
+	/*
+	 * TODO consider rejecting 'arg' if it contains any
+	 * TODO injection characters (since we might send this
+	 * TODO to a sub-command or to the server and we don't
+	 * TODO want to deal with legacy quoting/escaping for
+	 * TODO a new feature).
+	 */
+
+	filter_options->raw_value = strdup(arg);
+
+	if (skip_prefix(arg, "blob:", &v0) || skip_prefix(arg, "blobs:", &v0)) {
+		if (!strcmp(v0, "none")) {
+			filter_options->choice = LOFC_BLOB_NONE;
+			return 0;
+		}
+
+		if (skip_prefix(v0, "limit=", &v1) &&
+		    git_parse_ulong(v1, &filter_options->blob_limit_value)) {
+			filter_options->choice = LOFC_BLOB_LIMIT;
+			return 0;
+		}
+	}
+	else if (skip_prefix(arg, "sparse:", &v0)) {
+		if (skip_prefix(v0, "oid=", &v1)) {
+			filter_options->choice = LOFC_SPARSE_OID;
+			if (!get_oid_with_context(v1, GET_OID_BLOB,
+						  &sparse_oid, &oc)) {
+				/*
+				 * We successfully converted the <oid-expr>
+				 * into an actual OID.  Rewrite the raw_value
+				 * in canonoical form with just the OID.
+				 * (If we send this request to the server, we
+				 * want an absolute expression rather than a
+				 * local-ref-relative expression.)
+				 */
+				free((char *)filter_options->raw_value);
+				filter_options->raw_value =
+					xstrfmt("sparse:oid=%s",
+						oid_to_hex(&sparse_oid));
+				filter_options->sparse_oid_value =
+					oiddup(&sparse_oid);
+			} else {
+				/*
+				 * We could not turn the <oid-expr> into an
+				 * OID.  Leave the raw_value as is in case
+				 * the server can parse it.  (It may refer to
+				 * a branch, commit, or blob we don't have.)
+				 */
+			}
+			return 0;
+		}
+
+		if (skip_prefix(v0, "path=", &v1)) {
+			filter_options->choice = LOFC_SPARSE_PATH;
+			filter_options->sparse_path_value = strdup(v1);
+			return 0;
+		}
+	}
+
+	die(_("invalid filter expression '%s'"), arg);
+	return 0;
+}
+
+int opt_parse_list_objects_filter(const struct option *opt,
+				  const char *arg, int unset)
+{
+	struct list_objects_filter_options *filter_options = opt->value;
+
+	assert(arg);
+	assert(!unset);
+
+	return parse_list_objects_filter(filter_options, arg);
+}
diff --git a/list-objects-filter-options.h b/list-objects-filter-options.h
new file mode 100644
index 0000000..23bd68e
--- /dev/null
+++ b/list-objects-filter-options.h
@@ -0,0 +1,50 @@
+#ifndef LIST_OBJECTS_FILTER_OPTIONS_H
+#define LIST_OBJECTS_FILTER_OPTIONS_H
+
+#include "parse-options.h"
+
+/*
+ * Common declarations and utilities for filtering objects (such as omitting
+ * large blobs) in list_objects:traverse_commit_list() and git-rev-list.
+ */
+
+enum list_objects_filter_choice {
+	LOFC_DISABLED = 0,
+	LOFC_BLOB_NONE,
+	LOFC_BLOB_LIMIT,
+	LOFC_SPARSE_OID,
+	LOFC_SPARSE_PATH,
+};
+
+struct list_objects_filter_options {
+	/*
+	 * The raw argument value given on the command line or
+	 * protocol request.  (The part after the "--keyword=".)
+	 */
+	char *raw_value;
+
+	/*
+	 * Parsed values. Only 1 will be set depending on the flags below.
+	 */
+	struct object_id *sparse_oid_value;
+	char *sparse_path_value;
+	unsigned long blob_limit_value;
+
+	enum list_objects_filter_choice choice;
+};
+
+/* Normalized command line arguments */
+#define CL_ARG__FILTER "filter"
+
+int parse_list_objects_filter(struct list_objects_filter_options *filter_options,
+			      const char *arg);
+
+int opt_parse_list_objects_filter(const struct option *opt,
+				  const char *arg, int unset);
+
+#define OPT_PARSE_LIST_OBJECTS_FILTER(fo) \
+	{ OPTION_CALLBACK, 0, CL_ARG__FILTER, fo, N_("args"), \
+	  N_("object filtering"), PARSE_OPT_NONEG, \
+	  opt_parse_list_objects_filter }
+
+#endif /* LIST_OBJECTS_FILTER_OPTIONS_H */
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 08/13] list-objects: add traverse_commit_list_filtered method
  2017-10-24 18:53 [PATCH 00/13] WIP Partial clone part 1: object filtering Jeff Hostetler
                   ` (6 preceding siblings ...)
  2017-10-24 18:53 ` [PATCH 07/13] list-objects-filter-options: common argument parsing Jeff Hostetler
@ 2017-10-24 18:53 ` Jeff Hostetler
  2017-10-25  4:24   ` Jonathan Tan
  2017-10-24 18:53 ` [PATCH 09/13] extension.partialclone: introduce partial clone extension Jeff Hostetler
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Jeff Hostetler @ 2017-10-24 18:53 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Add traverse_commit_list_filtered() wrapper around the various
filter methods using common data in object_filter_options.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 list-objects.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 list-objects.h | 11 +++++++++++
 2 files changed, 56 insertions(+)

diff --git a/list-objects.c b/list-objects.c
index 3e86008..4ce2593 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -7,6 +7,9 @@
 #include "tree-walk.h"
 #include "revision.h"
 #include "list-objects.h"
+#include "list-objects-filter-blobs-none.h"
+#include "list-objects-filter-blobs-limit.h"
+#include "list-objects-filter-sparse.h"
 
 static void process_blob(struct rev_info *revs,
 			 struct blob *blob,
@@ -266,3 +269,45 @@ void traverse_commit_list(struct rev_info *revs,
 		show_commit, show_object, show_data,
 		NULL, NULL);
 }
+
+void traverse_commit_list_filtered(
+	struct list_objects_filter_options *filter_options,
+	struct rev_info *revs,
+	show_commit_fn show_commit,
+	show_object_fn show_object,
+	list_objects_filter_map_foreach_cb print_omitted_object,
+	void *show_data)
+{
+	switch (filter_options->choice) {
+	case LOFC_DISABLED:
+		traverse_commit_list(revs, show_commit, show_object, show_data);
+		return;
+
+	case LOFC_BLOB_NONE:
+		traverse_commit_list__blobs_none(
+			revs, show_commit, show_object, print_omitted_object,
+			show_data);
+		return;
+
+	case LOFC_BLOB_LIMIT:
+		traverse_commit_list__blobs_limit(
+			revs, show_commit, show_object, print_omitted_object,
+			show_data, filter_options->blob_limit_value);
+		return;
+
+	case LOFC_SPARSE_OID:
+		traverse_commit_list__sparse_oid(
+			revs, show_commit, show_object, print_omitted_object,
+			show_data, filter_options->sparse_oid_value);
+		return;
+
+	case LOFC_SPARSE_PATH:
+		traverse_commit_list__sparse_path(
+			revs, show_commit, show_object, print_omitted_object,
+			show_data, filter_options->sparse_path_value);
+		return;
+
+	default:
+		die("unspecified list-objects filter");
+	}
+}
diff --git a/list-objects.h b/list-objects.h
index 43a06fb..d14b0e0 100644
--- a/list-objects.h
+++ b/list-objects.h
@@ -1,6 +1,9 @@
 #ifndef LIST_OBJECTS_H
 #define LIST_OBJECTS_H
 
+#include "list-objects-filter-map.h"
+#include "list-objects-filter-options.h"
+
 typedef void (*show_commit_fn)(struct commit *, void *);
 typedef void (*show_object_fn)(struct object *, const char *, void *);
 void traverse_commit_list(struct rev_info *, show_commit_fn, show_object_fn, void *);
@@ -38,4 +41,12 @@ void traverse_commit_list_worker(
 	show_commit_fn, show_object_fn, void *show_data,
 	filter_object_fn filter, void *filter_data);
 
+void traverse_commit_list_filtered(
+	struct list_objects_filter_options *filter_options,
+	struct rev_info *revs,
+	show_commit_fn show_commit,
+	show_object_fn show_object,
+	list_objects_filter_map_foreach_cb print_omitted_object,
+	void *show_data);
+
 #endif /* LIST_OBJECTS_H */
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 09/13] extension.partialclone: introduce partial clone extension
  2017-10-24 18:53 [PATCH 00/13] WIP Partial clone part 1: object filtering Jeff Hostetler
                   ` (7 preceding siblings ...)
  2017-10-24 18:53 ` [PATCH 08/13] list-objects: add traverse_commit_list_filtered method Jeff Hostetler
@ 2017-10-24 18:53 ` Jeff Hostetler
  2017-10-24 18:53 ` [PATCH 10/13] rev-list: add list-objects filtering support Jeff Hostetler
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Jeff Hostetler @ 2017-10-24 18:53 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Introduce the ability to have missing objects in a repo.  This
functionality is guarded by new repository extension options:
    `extensions.partialcloneremote` and
    `extensions.partialclonefilter`.

See the update to Documentation/technical/repository-version.txt
in this patch for more information.

This patch is part of a patch originally authored by:
Jonathan Tan <jonathantanmy@google.com>

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 Documentation/technical/repository-version.txt | 22 ++++++
 Makefile                                       |  1 +
 cache.h                                        |  4 ++
 config.h                                       |  3 +
 environment.c                                  |  2 +
 partial-clone-utils.c                          | 99 ++++++++++++++++++++++++++
 partial-clone-utils.h                          | 34 +++++++++
 setup.c                                        | 15 ++++
 8 files changed, 180 insertions(+)
 create mode 100644 partial-clone-utils.c
 create mode 100644 partial-clone-utils.h

diff --git a/Documentation/technical/repository-version.txt b/Documentation/technical/repository-version.txt
index 00ad379..9d488db 100644
--- a/Documentation/technical/repository-version.txt
+++ b/Documentation/technical/repository-version.txt
@@ -86,3 +86,25 @@ for testing format-1 compatibility.
 When the config key `extensions.preciousObjects` is set to `true`,
 objects in the repository MUST NOT be deleted (e.g., by `git-prune` or
 `git repack -d`).
+
+`partialcloneremote`
+~~~~~~~~~~~~~~~~~~~~
+
+When the config key `extensions.partialcloneremote` is set, it indicates
+that the repo was created with a partial clone (or later performed
+a partial fetch) and that the remote may have omitted sending
+certain unwanted objects.  Such a remote is called a "promisor remote"
+and it promises that all such omitted objects can be fetched from it
+in the future.
+
+The value of this key is the name of the promisor remote.
+
+`partialclonefilter`
+~~~~~~~~~~~~~~~~~~~~
+
+When the config key `extensions.partialclonefilter` is set, it gives
+the initial filter expression used to create the partial clone.
+This value becomed the default filter expression for subsequent
+fetches (called "partial fetches") from the promisor remote.  This
+value may also be set by the first explicit partial fetch following a
+normal clone.
diff --git a/Makefile b/Makefile
index b9ff0b4..38632fb 100644
--- a/Makefile
+++ b/Makefile
@@ -841,6 +841,7 @@ LIB_OBJS += pack-write.o
 LIB_OBJS += pager.o
 LIB_OBJS += parse-options.o
 LIB_OBJS += parse-options-cb.o
+LIB_OBJS += partial-clone-utils.o
 LIB_OBJS += patch-delta.o
 LIB_OBJS += patch-ids.o
 LIB_OBJS += path.o
diff --git a/cache.h b/cache.h
index 6440e2b..4b785c0 100644
--- a/cache.h
+++ b/cache.h
@@ -860,12 +860,16 @@ extern int grafts_replace_parents;
 #define GIT_REPO_VERSION 0
 #define GIT_REPO_VERSION_READ 1
 extern int repository_format_precious_objects;
+extern char *repository_format_partial_clone_remote;
+extern char *repository_format_partial_clone_filter;
 
 struct repository_format {
 	int version;
 	int precious_objects;
 	int is_bare;
 	char *work_tree;
+	char *partial_clone_remote; /* value of extensions.partialcloneremote */
+	char *partial_clone_filter; /* value of extensions.partialclonefilter */
 	struct string_list unknown_extensions;
 };
 
diff --git a/config.h b/config.h
index a49d264..90544ef 100644
--- a/config.h
+++ b/config.h
@@ -34,6 +34,9 @@ struct config_options {
 	const char *git_dir;
 };
 
+#define KEY_PARTIALCLONEREMOTE "partialcloneremote"
+#define KEY_PARTIALCLONEFILTER "partialclonefilter"
+
 typedef int (*config_fn_t)(const char *, const char *, void *);
 extern int git_default_config(const char *, const char *, void *);
 extern int git_config_from_file(config_fn_t fn, const char *, void *);
diff --git a/environment.c b/environment.c
index 8289c25..2fcf9bb 100644
--- a/environment.c
+++ b/environment.c
@@ -27,6 +27,8 @@ int warn_ambiguous_refs = 1;
 int warn_on_object_refname_ambiguity = 1;
 int ref_paranoia = -1;
 int repository_format_precious_objects;
+char *repository_format_partial_clone_remote;
+char *repository_format_partial_clone_filter;
 const char *git_commit_encoding;
 const char *git_log_output_encoding;
 const char *apply_default_whitespace;
diff --git a/partial-clone-utils.c b/partial-clone-utils.c
new file mode 100644
index 0000000..8c925ae
--- /dev/null
+++ b/partial-clone-utils.c
@@ -0,0 +1,99 @@
+#include "cache.h"
+#include "config.h"
+#include "partial-clone-utils.h"
+
+int is_partial_clone_registered(void)
+{
+	if (repository_format_partial_clone_remote ||
+	    repository_format_partial_clone_filter)
+		return 1;
+
+	return 0;
+}
+
+void partial_clone_utils_register(
+	const struct list_objects_filter_options *filter_options,
+	const char *remote,
+	const char *cmd_name)
+{
+	struct strbuf buf = STRBUF_INIT;
+
+	if (is_partial_clone_registered()) {
+		/*
+		 * The original partial-clone or a previous partial-fetch
+		 * already registered the partial-clone settings.
+		 * If we get here, we are in a subsequent partial-* command
+		 * (with explicit filter args on the command line).
+		 *
+		 * For now, we restrict subsequent commands to one
+		 * consistent with the original request.  We may relax
+		 * this later after we get more experience with the
+		 * partial-clone feature.
+		 *
+		 * [] Restrict to same remote because our dynamic
+		 *    object loading only knows how to fetch objects
+		 *    from 1 remote.
+		 */
+		assert(filter_options && filter_options->choice);
+		assert(remote && *remote);
+
+		if (strcmp(remote, repository_format_partial_clone_remote))
+			die("%s --%s currently limited to remote '%s'",
+			    cmd_name, CL_ARG__FILTER,
+			    repository_format_partial_clone_remote);
+
+		/*
+		 * Treat the (possibly new) filter-spec as transient;
+		 * use it for the current command, but do not overwrite
+		 * the default.
+		 */
+		return;
+	}
+
+	repository_format_partial_clone_remote = xstrdup(remote);
+	repository_format_partial_clone_filter = xstrdup(filter_options->raw_value);
+
+	/*
+	 * Force repo version > 0 to enable extensions namespace.
+	 */
+	git_config_set("core.repositoryformatversion", "1");
+
+	/*
+	 * Use the "extensions" namespace in the config to record
+	 * the name of the remote used in the partial clone.
+	 * This will help us return to that server when we need
+	 * to backfill missing objects.
+	 *
+	 * It is also used to indicate that there *MAY* be
+	 * missing objects so that subsequent commands don't
+	 * immediately die if they hit one.
+	 *
+	 * Also remember the initial filter settings used by
+	 * clone as a default for future fetches.
+	 */
+	git_config_set("extensions." KEY_PARTIALCLONEREMOTE,
+		       repository_format_partial_clone_remote);
+	git_config_set("extensions." KEY_PARTIALCLONEFILTER,
+		       repository_format_partial_clone_filter);
+
+	/*
+	 * TODO Do we need to record both partial-clone
+	 * parameters in the extensions namespace and in the
+	 * section for the remote?
+	 *
+	 * Or should we just remember 1 in each, as in:
+	 *     "extension.partialcloneremote=<remote>"
+	 *     "remote.<remote>.filter=<filter-spec>"
+	 * The issue is when can we set both of the
+	 * repository_format_partial_clone_* globals
+	 * durint subsequent startups.
+	 * See setup.c:check_repo_format().
+	 */
+	strbuf_addf(&buf, "remote.%s.%s", remote, KEY_PARTIALCLONEREMOTE);
+	git_config_set(buf.buf, repository_format_partial_clone_remote);
+
+	strbuf_addf(&buf, "remote.%s.%s", remote, KEY_PARTIALCLONEFILTER);
+	git_config_set(buf.buf, repository_format_partial_clone_filter);
+
+	strbuf_release(&buf);
+}
diff --git a/partial-clone-utils.h b/partial-clone-utils.h
new file mode 100644
index 0000000..b527570
--- /dev/null
+++ b/partial-clone-utils.h
@@ -0,0 +1,34 @@
+#ifndef PARTIAL_CLONE_UTILS_H
+#define PARTIAL_CLONE_UTILS_H
+
+#include "list-objects-filter-options.h"
+
+/*
+ * Register that partial-clone was used to create the repo and
+ * update the config on disk.
+ *
+ * If nothing else, this indicates that the ODB may have missing
+ * objects and that various commands should handle that gracefully.
+ *
+ * Record the remote used for the clone so that we know where
+ * to get missing objects in the future.
+ *
+ * Also record the filter expression so that we know something
+ * about the missing objects (e.g., size-limit vs sparse).
+ *
+ * May also be used by a partial-fetch following a normal clone
+ * to turn on the above tracking.
+ */ 
+extern void partial_clone_utils_register(
+	const struct list_objects_filter_options *filter_options,
+	const char *remote,
+	const char *cmd_name);
+
+/*
+ * Return 1 if partial-clone was used to create the repo
+ * or a subsequent partial-fetch was used.  This is an
+ * indicator that there may be missing objects.
+ */
+extern int is_partial_clone_registered(void);
+
+#endif /* PARTIAL_CLONE_UTILS_H */
diff --git a/setup.c b/setup.c
index 03f51e0..bc4133d 100644
--- a/setup.c
+++ b/setup.c
@@ -420,6 +420,19 @@ static int check_repo_format(const char *var, const char *value, void *vdata)
 			;
 		else if (!strcmp(ext, "preciousobjects"))
 			data->precious_objects = git_config_bool(var, value);
+
+		else if (!strcmp(ext, KEY_PARTIALCLONEREMOTE))
+			if (!value)
+				return config_error_nonbool(var);
+			else
+				data->partial_clone_remote = xstrdup(value);
+
+		else if (!strcmp(ext, KEY_PARTIALCLONEFILTER))
+			if (!value)
+				return config_error_nonbool(var);
+			else
+				data->partial_clone_filter = xstrdup(value);
+
 		else
 			string_list_append(&data->unknown_extensions, ext);
 	} else if (strcmp(var, "core.bare") == 0) {
@@ -463,6 +476,8 @@ static int check_repository_format_gently(const char *gitdir, int *nongit_ok)
 	}
 
 	repository_format_precious_objects = candidate.precious_objects;
+	repository_format_partial_clone_remote = candidate.partial_clone_remote;
+	repository_format_partial_clone_filter = candidate.partial_clone_filter;
 	string_list_clear(&candidate.unknown_extensions, 0);
 	if (!has_common) {
 		if (candidate.is_bare != -1) {
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 10/13] rev-list: add list-objects filtering support
  2017-10-24 18:53 [PATCH 00/13] WIP Partial clone part 1: object filtering Jeff Hostetler
                   ` (8 preceding siblings ...)
  2017-10-24 18:53 ` [PATCH 09/13] extension.partialclone: introduce partial clone extension Jeff Hostetler
@ 2017-10-24 18:53 ` Jeff Hostetler
  2017-10-25  4:41   ` Jonathan Tan
  2017-10-24 18:53 ` [PATCH 11/13] t6112: rev-list object filtering test Jeff Hostetler
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Jeff Hostetler @ 2017-10-24 18:53 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Teach rev-list to use the filtering provided by the
traverse_commit_list_filtered() interface to omit
unwanted objects from the result.

This feature is only enabled when one of the "--objects*"
options are used.

Furthermore, when the "--filter-print-omitted" option is
used, the omitted objects are printed at the end.  These
are marked with a "~".  This option can be combined with
"--quiet" to get a list of just the omitted objects.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 Documentation/git-rev-list.txt     |  5 ++-
 Documentation/rev-list-options.txt | 30 ++++++++++++++
 builtin/rev-list.c                 | 84 +++++++++++++++++++++++++++++++++++++-
 3 files changed, 116 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-rev-list.txt b/Documentation/git-rev-list.txt
index ef22f17..6d2e60d 100644
--- a/Documentation/git-rev-list.txt
+++ b/Documentation/git-rev-list.txt
@@ -47,7 +47,10 @@ SYNOPSIS
 	     [ --fixed-strings | -F ]
 	     [ --date=<format>]
 	     [ [ --objects | --objects-edge | --objects-edge-aggressive ]
-	       [ --unpacked ] ]
+	       [ --unpacked ]
+	       [ --filter=<filter-spec> ] ]
+	     [ --filter-print-missing ]
+	     [ --filter-print-omitted ]
 	     [ --pretty | --header ]
 	     [ --bisect ]
 	     [ --bisect-vars ]
diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
index 7d860bf..88f8878 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -706,6 +706,36 @@ ifdef::git-rev-list[]
 --unpacked::
 	Only useful with `--objects`; print the object IDs that are not
 	in packs.
+
+--filter=<filter-spec>::
+	Only useful with one of the `--objects*`; omits objects (usually
+	blobs) from the list of printed objects.  The '<filter-spec>'
+	may be one of the following:
++
+The form '--filter=blob:none' omits all blobs.
++
+The form '--filter=blob:limit=<n>[kmg]' omits blobs larger than n bytes
+or units.  The value may be zero.  Special files matching '.git*' are
+alwayse included, regardless of size.
++
+The form '--filter=sparse:oid=<oid-ish>' uses a sparse-checkout
+specification contained in the object (or the object that the expression
+evaluates to) to omit blobs not required by the corresponding sparse
+checkout.
++
+The form '--filter=sparse:path=<path>' similarly uses a sparse-checkout
+specification contained in <path>.
+
+--filter-print-missing::
+	Prints a list of the missing objects for the requested traversal.
+	Object IDs are prefixed with a ``?'' character.  The object type
+	is printed after the ID.  This may be used with or without any of
+	the above filtering options.
+
+--filter-print-omitted::
+	Only useful with one of the above `--filter*`; prints a list
+	of the omitted objects.  Object IDs are prefixed with a ``~''
+	character.
 endif::git-rev-list[]
 
 --no-walk[=(sorted|unsorted)]::
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index c1c74d4..7a0353f 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -12,6 +12,7 @@
 #include "bisect.h"
 #include "progress.h"
 #include "reflog-walk.h"
+#include "partial-clone-utils.h"
 
 static const char rev_list_usage[] =
 "git rev-list [OPTION] <commit-id>... [ -- paths... ]\n"
@@ -54,6 +55,11 @@ static const char rev_list_usage[] =
 
 static struct progress *progress;
 static unsigned progress_counter;
+static struct list_objects_filter_options filter_options;
+static struct oidmap missing_objects;
+static int arg_print_missing;
+static int arg_print_omitted;
+#define DEFAULT_MAP_SIZE (16*1024)
 
 static void finish_commit(struct commit *commit, void *data);
 static void show_commit(struct commit *commit, void *data)
@@ -181,8 +187,26 @@ static void finish_commit(struct commit *commit, void *data)
 static void finish_object(struct object *obj, const char *name, void *cb_data)
 {
 	struct rev_list_info *info = cb_data;
-	if (obj->type == OBJ_BLOB && !has_object_file(&obj->oid))
+	if (obj->type == OBJ_BLOB && !has_object_file(&obj->oid)) {
+		if (arg_print_missing) {
+			list_objects_filter_map_insert(
+				&missing_objects, &obj->oid, name, obj->type);
+			return;
+		}
+
+		/*
+		 * Relax consistency checks when we expect missing
+		 * objects because of partial-clone or a previous
+		 * partial-fetch.
+		 *
+		 * Note that this is independent of any filtering that
+		 * we are doing in this run.
+		 */
+		if (is_partial_clone_registered())
+			return;
+
 		die("missing blob object '%s'", oid_to_hex(&obj->oid));
+	}
 	if (info->revs->verify_objects && !obj->parsed && obj->type != OBJ_COMMIT)
 		parse_object(&obj->oid);
 }
@@ -202,6 +226,22 @@ static void show_edge(struct commit *commit)
 	printf("-%s\n", oid_to_hex(&commit->object.oid));
 }
 
+static void print_omitted_object(int i, int i_limit, struct list_objects_filter_map_entry *e, void *cb_data)
+{
+	/* struct rev_list_info *info = cb_data; */
+	const char *tn = typename(e->type);
+
+	printf("~%s %s\n", oid_to_hex(&e->entry.oid), tn);
+}
+
+static void print_missing_object(int i, int i_limit, struct list_objects_filter_map_entry *e, void *cb_data)
+{
+	/* struct rev_list_info *info = cb_data; */
+	const char *tn = typename(e->type);
+
+	printf("?%s %s\n", oid_to_hex(&e->entry.oid), tn);
+}
+
 static void print_var_str(const char *var, const char *val)
 {
 	printf("%s='%s'\n", var, val);
@@ -335,6 +375,26 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 			show_progress = arg;
 			continue;
 		}
+
+		if (skip_prefix(arg, ("--" CL_ARG__FILTER "="), &arg)) {
+			parse_list_objects_filter(&filter_options, arg);
+			if (filter_options.choice && !revs.blob_objects)
+				die(_("object filtering requires --objects"));
+			if (filter_options.choice == LOFC_SPARSE_OID &&
+			    !filter_options.sparse_oid_value)
+				die(_("invalid sparse value '%s'"),
+				    filter_options.raw_value);
+			continue;
+		}
+		if (!strcmp(arg, "--filter-print-missing")) {
+			arg_print_missing = 1;
+			continue;
+		}
+		if (!strcmp(arg, "--filter-print-omitted")) {
+			arg_print_omitted = 1;
+			continue;
+		}
+		
 		usage(rev_list_usage);
 
 	}
@@ -360,6 +420,9 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 	if (revs.show_notes)
 		die(_("rev-list does not support display of notes"));
 
+	if (filter_options.choice && use_bitmap_index)
+		die(_("cannot combine --use-bitmap-index with object filtering"));
+
 	save_commit_buffer = (revs.verbose_header ||
 			      revs.grep_filter.pattern_list ||
 			      revs.grep_filter.header_list);
@@ -404,7 +467,24 @@ int cmd_rev_list(int argc, const char **argv, const char *prefix)
 			return show_bisect_vars(&info, reaches, all);
 	}
 
-	traverse_commit_list(&revs, show_commit, show_object, &info);
+	if (arg_print_missing) {
+		memset(&missing_objects, 0, sizeof(missing_objects));
+		oidmap_init(&missing_objects, DEFAULT_MAP_SIZE);
+	}
+
+	if (filter_options.choice)
+		traverse_commit_list_filtered(&filter_options, &revs,
+			show_commit, show_object,
+			(arg_print_omitted ? print_omitted_object : NULL),
+			&info);
+	else
+		traverse_commit_list(&revs, show_commit, show_object, &info);
+
+	if (arg_print_missing) {
+		list_objects_filter_map_foreach(&missing_objects,
+						print_missing_object, &info);
+		oidmap_free(&missing_objects, 1);
+	}
 
 	stop_progress(&progress);
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 11/13] t6112: rev-list object filtering test
  2017-10-24 18:53 [PATCH 00/13] WIP Partial clone part 1: object filtering Jeff Hostetler
                   ` (9 preceding siblings ...)
  2017-10-24 18:53 ` [PATCH 10/13] rev-list: add list-objects filtering support Jeff Hostetler
@ 2017-10-24 18:53 ` Jeff Hostetler
  2017-10-24 18:53 ` [PATCH 12/13] pack-objects: add list-objects filtering Jeff Hostetler
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Jeff Hostetler @ 2017-10-24 18:53 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 t/t6112-rev-list-filters-objects.sh | 223 ++++++++++++++++++++++++++++++++++++
 1 file changed, 223 insertions(+)
 create mode 100755 t/t6112-rev-list-filters-objects.sh

diff --git a/t/t6112-rev-list-filters-objects.sh b/t/t6112-rev-list-filters-objects.sh
new file mode 100755
index 0000000..26fa12f
--- /dev/null
+++ b/t/t6112-rev-list-filters-objects.sh
@@ -0,0 +1,223 @@
+#!/bin/sh
+
+test_description='git rev-list with object filtering for partial clone'
+
+. ./test-lib.sh
+
+# Test the blob:none filter.
+
+test_expect_success 'setup r1' '
+	echo "{print \$1}" >print_1.awk &&
+	echo "{print \$2}" >print_2.awk &&
+
+	git init r1 &&
+	for n in 1 2 3 4 5
+	do
+		echo "This is file: $n" > r1/file.$n
+		git -C r1 add file.$n
+		git -C r1 commit -m "$n"
+	done
+'
+
+test_expect_success 'verify blob:none omits all 5 blobs' '
+	git -C r1 ls-files -s file.1 file.2 file.3 file.4 file.5 \
+		| awk -f print_2.awk \
+		| sort >expected &&
+	git -C r1 rev-list HEAD --quiet --objects --filter-print-omitted --filter=blob:none \
+		| awk -f print_1.awk \
+		| sed "s/~//" >observed &&
+	test_cmp observed expected
+'
+
+test_expect_success 'verify emitted+omitted == all' '
+	git -C r1 rev-list HEAD --objects \
+		| awk -f print_1.awk \
+		| sort >expected &&
+	git -C r1 rev-list HEAD --objects --filter-print-omitted --filter=blob:none \
+		| awk -f print_1.awk \
+		| sed "s/~//" \
+		| sort >observed &&
+	test_cmp observed expected
+'
+
+
+# Test blob:limit=<n>[kmg] filter.
+# We boundary test around the size parameter.  The filter is strictly less than
+# the value, so size 500 and 1000 should have the same results, but 1001 should
+# filter more.
+
+test_expect_success 'setup r2' '
+	git init r2 &&
+	for n in 1000 10000
+	do
+		printf "%"$n"s" X > r2/large.$n
+		git -C r2 add large.$n
+		git -C r2 commit -m "$n"
+	done
+'
+
+test_expect_success 'verify blob:limit=500 omits all blobs' '
+	git -C r2 ls-files -s large.1000 large.10000 \
+		| awk -f print_2.awk \
+		| sort >expected &&
+	git -C r2 rev-list HEAD --quiet --objects --filter-print-omitted --filter=blob:limit=500 \
+		| awk -f print_1.awk \
+		| sed "s/~//" >observed &&
+	test_cmp observed expected
+'
+
+test_expect_success 'verify emitted+omitted == all' '
+	git -C r2 rev-list HEAD --objects \
+		| awk -f print_1.awk \
+		| sort >expected &&
+	git -C r2 rev-list HEAD --objects --filter-print-omitted --filter=blob:limit=500 \
+		| awk -f print_1.awk \
+		| sed "s/~//" \
+		| sort >observed &&
+	test_cmp observed expected
+'
+
+test_expect_success 'verify blob:limit=1000' '
+	git -C r2 ls-files -s large.1000 large.10000 \
+		| awk -f print_2.awk \
+		| sort >expected &&
+	git -C r2 rev-list HEAD --quiet --objects --filter-print-omitted --filter=blob:limit=1000 \
+		| awk -f print_1.awk \
+		| sed "s/~//" >observed &&
+	test_cmp observed expected
+'
+
+test_expect_success 'verify blob:limit=1001' '
+	git -C r2 ls-files -s large.10000 \
+		| awk -f print_2.awk \
+		| sort >expected &&
+	git -C r2 rev-list HEAD --quiet --objects --filter-print-omitted --filter=blob:limit=1001 \
+		| awk -f print_1.awk \
+		| sed "s/~//" >observed &&
+	test_cmp observed expected
+'
+
+test_expect_success 'verify blob:limit=1k' '
+	git -C r2 ls-files -s large.10000 \
+		| awk -f print_2.awk \
+		| sort >expected &&
+	git -C r2 rev-list HEAD --quiet --objects --filter-print-omitted --filter=blob:limit=1k \
+		| awk -f print_1.awk \
+		| sed "s/~//" >observed &&
+	test_cmp observed expected
+'
+
+test_expect_success 'verify blob:limit=1m' '
+	cat </dev/null >expected &&
+	git -C r2 rev-list HEAD --quiet --objects --filter-print-omitted --filter=blob:limit=1m \
+		| awk -f print_1.awk \
+		| sed "s/~//" >observed &&
+	test_cmp observed expected
+'
+
+# Test sparse:path=<path> filter.
+# Use a local file containing a sparse-checkout specification to filter
+# out blobs not required for the corresponding sparse-checkout.  We do not
+# require sparse-checkout to actually be enabled.
+
+test_expect_success 'setup r3' '
+	git init r3 &&
+	mkdir r3/dir1 &&
+	for n in sparse1 sparse2
+	do
+		echo "This is file: $n" > r3/$n
+		git -C r3 add $n
+		echo "This is file: dir1/$n" > r3/dir1/$n
+		git -C r3 add dir1/$n
+	done &&
+	git -C r3 commit -m "sparse" &&
+	echo dir1/ >pattern1 &&
+	echo sparse1 >pattern2
+'
+
+test_expect_success 'verify sparse:path=pattern1 omits top-level files' '
+	git -C r3 ls-files -s sparse1 sparse2 \
+		| awk -f print_2.awk \
+		| sort >expected &&
+	git -C r3 rev-list HEAD --quiet --objects --filter-print-omitted --filter=sparse:path=../pattern1 \
+		| awk -f print_1.awk \
+		| sed "s/~//" >observed &&
+	test_cmp observed expected
+'
+
+test_expect_success 'verify sparse:path=pattern2 omits both sparse2 files' '
+	git -C r3 ls-files -s sparse2 dir1/sparse2 \
+		| awk -f print_2.awk \
+		| sort >expected &&
+	git -C r3 rev-list HEAD --quiet --objects --filter-print-omitted --filter=sparse:path=../pattern2 \
+		| awk -f print_1.awk \
+		| sed "s/~//" >observed &&
+	test_cmp observed expected
+'
+
+# Test sparse:oid=<oid-ish> filter.
+# Like sparse:path, but we get the sparse-checkout specification from
+# a blob rather than a file on disk.
+
+test_expect_success 'setup r3 part 2' '
+	echo dir1/ >r3/pattern &&
+	git -C r3 add pattern &&
+	git -C r3 commit -m "pattern"
+'
+
+test_expect_success 'verify sparse:oid=OID omits top-level files' '
+	git -C r3 ls-files -s pattern sparse1 sparse2 \
+		| awk -f print_2.awk \
+		| sort >expected &&
+	oid=$(git -C r3 ls-files -s pattern | awk -f print_2.awk) &&
+	git -C r3 rev-list HEAD --quiet --objects --filter-print-omitted --filter=sparse:oid=$oid \
+		| awk -f print_1.awk \
+		| sed "s/~//" >observed &&
+	test_cmp observed expected
+'
+
+test_expect_success 'verify sparse:oid=oid-ish omits top-level files' '
+	git -C r3 ls-files -s pattern sparse1 sparse2 \
+		| awk -f print_2.awk \
+		| sort >expected &&
+	git -C r3 rev-list HEAD --quiet --objects --filter-print-omitted --filter=sparse:oid=master:pattern \
+		| awk -f print_1.awk \
+		| sed "s/~//" >observed &&
+	test_cmp observed expected
+'
+
+# Delete some loose objects and use rev-list, but WITHOUT any filtering.
+# This models previously omitted objects that we did not receive.
+
+test_expect_success 'rev-list W/ print-missing' '
+	git -C r1 ls-files -s file.1 file.2 file.3 file.4 file.5 \
+		| awk -f print_2.awk \
+		| sort >expected &&
+	for id in `cat expected | sed "s|..|&/|"`
+	do
+		rm r1/.git/objects/$id
+	done &&
+	git -C r1 rev-list --quiet HEAD --filter-print-missing --objects \
+		| awk -f print_1.awk \
+		| sed "s/?//" \
+		| sort >observed &&
+	test_cmp observed expected
+'
+
+test_expect_success 'rev-list W/O print-missing fails' '
+	test_must_fail git -C r1 rev-list --quiet --objects HEAD
+'
+
+test_expect_success 'rev-list W/ extension.partialcloneremote set succeeds' '
+	git -C r1 config --local core.repositoryformatversion 1 &&
+	git -C r1 config --local extensions.partialcloneremote "origin" &&
+	git -C r1 rev-list --quiet --objects HEAD
+'
+
+test_expect_success 'rev-list W/ extension.partialclonefilter set succeeds' '
+	git -C r1 config --local core.repositoryformatversion 1 &&
+	git -C r1 config --local extensions.partialclonefilter "something" &&
+	git -C r1 rev-list --quiet --objects HEAD
+'
+
+test_done
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 12/13] pack-objects: add list-objects filtering
  2017-10-24 18:53 [PATCH 00/13] WIP Partial clone part 1: object filtering Jeff Hostetler
                   ` (10 preceding siblings ...)
  2017-10-24 18:53 ` [PATCH 11/13] t6112: rev-list object filtering test Jeff Hostetler
@ 2017-10-24 18:53 ` Jeff Hostetler
  2017-10-24 18:53 ` [PATCH 13/13] t5317: pack-objects object filtering test Jeff Hostetler
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Jeff Hostetler @ 2017-10-24 18:53 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Teach pack-objects to use the filtering provided by the
traverse_commit_list_filtered() interface to omit unwanted
objects from the resulting packfile.

This feature is intended for partial clone/fetch.

Filtering requires the use of the "--stdout" option.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 Documentation/git-pack-objects.txt |  8 +++++++-
 builtin/pack-objects.c             | 18 +++++++++++++++++-
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/Documentation/git-pack-objects.txt b/Documentation/git-pack-objects.txt
index 473a161..8b4a223 100644
--- a/Documentation/git-pack-objects.txt
+++ b/Documentation/git-pack-objects.txt
@@ -12,7 +12,8 @@ SYNOPSIS
 'git pack-objects' [-q | --progress | --all-progress] [--all-progress-implied]
 	[--no-reuse-delta] [--delta-base-offset] [--non-empty]
 	[--local] [--incremental] [--window=<n>] [--depth=<n>]
-	[--revs [--unpacked | --all]] [--stdout | base-name]
+	[--revs [--unpacked | --all]]
+	[--stdout [--filter=<filter-spec>] | base-name]
 	[--shallow] [--keep-true-parents] < object-list
 
 
@@ -236,6 +237,11 @@ So does `git bundle` (see linkgit:git-bundle[1]) when it creates a bundle.
 	With this option, parents that are hidden by grafts are packed
 	nevertheless.
 
+--filter=<filter-spec>::
+	Requires `--stdout`.  Omits certain objects (usually blobs) from
+	the resulting packfile.  See linkgit:git-rev-list[1] for valid
+	`<filter-spec>` forms.
+
 SEE ALSO
 --------
 linkgit:git-rev-list[1]
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 6e77dfd..a251850 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -79,6 +79,8 @@ static unsigned long cache_max_small_delta_size = 1000;
 
 static unsigned long window_memory_limit = 0;
 
+static struct list_objects_filter_options filter_options;
+
 /*
  * stats
  */
@@ -2816,7 +2818,12 @@ static void get_object_list(int ac, const char **av)
 	if (prepare_revision_walk(&revs))
 		die("revision walk setup failed");
 	mark_edges_uninteresting(&revs, show_edge);
-	traverse_commit_list(&revs, show_commit, show_object, NULL);
+	if (filter_options.choice)
+		traverse_commit_list_filtered(&filter_options, &revs,
+					      show_commit, show_object,
+					      NULL, NULL);
+	else
+		traverse_commit_list(&revs, show_commit, show_object, NULL);
 
 	if (unpack_unreachable_expiration) {
 		revs.ignore_missing_links = 1;
@@ -2952,6 +2959,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			 N_("use a bitmap index if available to speed up counting objects")),
 		OPT_BOOL(0, "write-bitmap-index", &write_bitmap_index,
 			 N_("write a bitmap index together with the pack index")),
+
+		OPT_PARSE_LIST_OBJECTS_FILTER(&filter_options),
+
 		OPT_END(),
 	};
 
@@ -3028,6 +3038,12 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
 		unpack_unreachable_expiration = 0;
 
+	if (filter_options.choice) {
+		if (!pack_to_stdout)
+			die("cannot use filtering with an indexable pack.");
+		use_bitmap_index = 0;
+	}
+
 	/*
 	 * "soft" reasons not to use bitmaps - for on-disk repack by default we want
 	 *
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 13/13] t5317: pack-objects object filtering test
  2017-10-24 18:53 [PATCH 00/13] WIP Partial clone part 1: object filtering Jeff Hostetler
                   ` (11 preceding siblings ...)
  2017-10-24 18:53 ` [PATCH 12/13] pack-objects: add list-objects filtering Jeff Hostetler
@ 2017-10-24 18:53 ` Jeff Hostetler
  2017-10-25  4:57 ` [PATCH 00/13] WIP Partial clone part 1: object filtering Jonathan Tan
  2017-10-25  5:00 ` Junio C Hamano
  14 siblings, 0 replies; 37+ messages in thread
From: Jeff Hostetler @ 2017-10-24 18:53 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, Jeff Hostetler

From: Jeff Hostetler <jeffhost@microsoft.com>

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 t/t5317-pack-objects-filter-objects.sh | 384 +++++++++++++++++++++++++++++++++
 1 file changed, 384 insertions(+)
 create mode 100755 t/t5317-pack-objects-filter-objects.sh

diff --git a/t/t5317-pack-objects-filter-objects.sh b/t/t5317-pack-objects-filter-objects.sh
new file mode 100755
index 0000000..ef7a8f6
--- /dev/null
+++ b/t/t5317-pack-objects-filter-objects.sh
@@ -0,0 +1,384 @@
+#!/bin/sh
+
+test_description='git pack-objects with object filtering for partial clone'
+
+. ./test-lib.sh
+
+# Test blob:none filter.
+
+test_expect_success 'setup r1' '
+	echo "{print \$1}" >print_1.awk &&
+	echo "{print \$2}" >print_2.awk &&
+
+	git init r1 &&
+	for n in 1 2 3 4 5
+	do
+		echo "This is file: $n" > r1/file.$n
+		git -C r1 add file.$n
+		git -C r1 commit -m "$n"
+	done
+'
+
+test_expect_success 'verify blob count in normal packfile' '
+	git -C r1 ls-files -s file.1 file.2 file.3 file.4 file.5 \
+		| awk -f print_2.awk \
+		| sort >expected &&
+	git -C r1 pack-objects --rev --stdout >all.pack <<-EOF &&
+	HEAD
+	EOF
+	git -C r1 index-pack ../all.pack &&
+	git -C r1 verify-pack -v ../all.pack \
+		| grep blob \
+		| awk -f print_1.awk \
+		| sort >observed &&
+	test_cmp observed expected
+'
+
+test_expect_success 'verify blob:none packfile has no blobs' '
+	git -C r1 pack-objects --rev --stdout --filter=blob:none >filter.pack <<-EOF &&
+	HEAD
+	EOF
+	git -C r1 index-pack ../filter.pack &&
+	git -C r1 verify-pack -v ../filter.pack \
+		| grep blob \
+		| awk -f print_1.awk \
+		| sort >observed &&
+	nr=$(wc -l <observed) &&
+	test 0 -eq $nr
+'
+
+test_expect_success 'verify normal and blob:none packfiles have same commits/trees' '
+	git -C r1 verify-pack -v ../all.pack \
+		| grep -E "commit|tree" \
+		| awk -f print_1.awk \
+		| sort >expected &&
+	git -C r1 verify-pack -v ../filter.pack \
+		| grep -E "commit|tree" \
+		| awk -f print_1.awk \
+		| sort >observed &&
+	test_cmp observed expected
+'
+
+# Test blob:limit=<n>[kmg] filter.
+# We boundary test around the size parameter.  The filter is strictly less than
+# the value, so size 500 and 1000 should have the same results, but 1001 should
+# filter more.
+
+test_expect_success 'setup r2' '
+	git init r2 &&
+	for n in 1000 10000
+	do
+		printf "%"$n"s" X > r2/large.$n
+		git -C r2 add large.$n
+		git -C r2 commit -m "$n"
+	done
+'
+
+test_expect_success 'verify blob count in normal packfile' '
+	git -C r2 ls-files -s large.1000 large.10000 \
+		| awk -f print_2.awk \
+		| sort >expected &&
+	git -C r2 pack-objects --rev --stdout >all.pack <<-EOF &&
+	HEAD
+	EOF
+	git -C r2 index-pack ../all.pack &&
+	git -C r2 verify-pack -v ../all.pack \
+		| grep blob \
+		| awk -f print_1.awk \
+		| sort >observed &&
+	test_cmp observed expected
+'
+
+test_expect_success 'verify blob:limit=500 omits all blobs' '
+	git -C r2 pack-objects --rev --stdout --filter=blob:limit=500 >filter.pack <<-EOF &&
+	HEAD
+	EOF
+	git -C r2 index-pack ../filter.pack &&
+	git -C r2 verify-pack -v ../filter.pack \
+		| grep blob \
+		| awk -f print_1.awk \
+		| sort >observed &&
+	nr=$(wc -l <observed) &&
+	test 0 -eq $nr
+'
+
+test_expect_success 'verify blob:limit=1000' '
+	git -C r2 pack-objects --rev --stdout --filter=blob:limit=1000 >filter.pack <<-EOF &&
+	HEAD
+	EOF
+	git -C r2 index-pack ../filter.pack &&
+	git -C r2 verify-pack -v ../filter.pack \
+		| grep blob \
+		| awk -f print_1.awk \
+		| sort >observed &&
+	nr=$(wc -l <observed) &&
+	test 0 -eq $nr
+'
+
+test_expect_success 'verify blob:limit=1001' '
+	git -C r2 ls-files -s large.1000 \
+		| awk -f print_2.awk \
+		| sort >expected &&
+	git -C r2 pack-objects --rev --stdout --filter=blob:limit=1001 >filter.pack <<-EOF &&
+	HEAD
+	EOF
+	git -C r2 index-pack ../filter.pack &&
+	git -C r2 verify-pack -v ../filter.pack \
+		| grep blob \
+		| awk -f print_1.awk \
+		| sort >observed &&
+	test_cmp observed expected
+'
+
+test_expect_success 'verify blob:limit=10001' '
+	git -C r2 ls-files -s large.1000 large.10000 \
+		| awk -f print_2.awk \
+		| sort >expected &&
+	git -C r2 pack-objects --rev --stdout --filter=blob:limit=10001 >filter.pack <<-EOF &&
+	HEAD
+	EOF
+	git -C r2 index-pack ../filter.pack &&
+	git -C r2 verify-pack -v ../filter.pack \
+		| grep blob \
+		| awk -f print_1.awk \
+		| sort >observed &&
+	test_cmp observed expected
+'
+
+test_expect_success 'verify blob:limit=1k' '
+	git -C r2 ls-files -s large.1000 \
+		| awk -f print_2.awk \
+		| sort >expected &&
+	git -C r2 pack-objects --rev --stdout --filter=blob:limit=1k >filter.pack <<-EOF &&
+	HEAD
+	EOF
+	git -C r2 index-pack ../filter.pack &&
+	git -C r2 verify-pack -v ../filter.pack \
+		| grep blob \
+		| awk -f print_1.awk \
+		| sort >observed &&
+	test_cmp observed expected
+'
+
+test_expect_success 'verify blob:limit=1m' '
+	git -C r2 ls-files -s large.1000 large.10000 \
+		| awk -f print_2.awk \
+		| sort >expected &&
+	git -C r2 pack-objects --rev --stdout --filter=blob:limit=1m >filter.pack <<-EOF &&
+	HEAD
+	EOF
+	git -C r2 index-pack ../filter.pack &&
+	git -C r2 verify-pack -v ../filter.pack \
+		| grep blob \
+		| awk -f print_1.awk \
+		| sort >observed &&
+	test_cmp observed expected
+'
+
+test_expect_success 'verify normal and blob:limit packfiles have same commits/trees' '
+	git -C r2 verify-pack -v ../all.pack \
+		| grep -E "commit|tree" \
+		| awk -f print_1.awk \
+		| sort >expected &&
+	git -C r2 verify-pack -v ../filter.pack \
+		| grep -E "commit|tree" \
+		| awk -f print_1.awk \
+		| sort >observed &&
+	test_cmp observed expected
+'
+
+# Test sparse:path=<path> filter.
+# Use a local file containing a sparse-checkout specification to filter
+# out blobs not required for the corresponding sparse-checkout.  We do not
+# require sparse-checkout to actually be enabled.
+
+test_expect_success 'setup r3' '
+	git init r3 &&
+	mkdir r3/dir1 &&
+	for n in sparse1 sparse2
+	do
+		echo "This is file: $n" > r3/$n
+		git -C r3 add $n
+		echo "This is file: dir1/$n" > r3/dir1/$n
+		git -C r3 add dir1/$n
+	done &&
+	git -C r3 commit -m "sparse" &&
+	echo dir1/ >pattern1 &&
+	echo sparse1 >pattern2
+'
+
+test_expect_success 'verify blob count in normal packfile' '
+	git -C r3 ls-files -s sparse1 sparse2 dir1/sparse1 dir1/sparse2 \
+		| awk -f print_2.awk \
+		| sort >expected &&
+	git -C r3 pack-objects --rev --stdout >all.pack <<-EOF &&
+	HEAD
+	EOF
+	git -C r3 index-pack ../all.pack &&
+	git -C r3 verify-pack -v ../all.pack \
+		| grep blob \
+		| awk -f print_1.awk \
+		| sort >observed &&
+	test_cmp observed expected
+'
+
+test_expect_success 'verify sparse:path=pattern1' '
+	git -C r3 ls-files -s dir1/sparse1 dir1/sparse2 \
+		| awk -f print_2.awk \
+		| sort >expected &&
+	git -C r3 pack-objects --rev --stdout --filter=sparse:path=../pattern1 >filter.pack <<-EOF &&
+	HEAD
+	EOF
+	git -C r3 index-pack ../filter.pack &&
+	git -C r3 verify-pack -v ../filter.pack \
+		| grep blob \
+		| awk -f print_1.awk \
+		| sort >observed &&
+	test_cmp observed expected
+'
+
+test_expect_success 'verify normal and sparse:path=pattern1 packfiles have same commits/trees' '
+	git -C r3 verify-pack -v ../all.pack \
+		| grep -E "commit|tree" \
+		| awk -f print_1.awk \
+		| sort >expected &&
+	git -C r3 verify-pack -v ../filter.pack \
+		| grep -E "commit|tree" \
+		| awk -f print_1.awk \
+		| sort >observed &&
+	test_cmp observed expected
+'
+
+test_expect_success 'verify sparse:path=pattern2' '
+	git -C r3 ls-files -s sparse1 dir1/sparse1 \
+		| awk -f print_2.awk \
+		| sort >expected &&
+	git -C r3 pack-objects --rev --stdout --filter=sparse:path=../pattern2 >filter.pack <<-EOF &&
+	HEAD
+	EOF
+	git -C r3 index-pack ../filter.pack &&
+	git -C r3 verify-pack -v ../filter.pack \
+		| grep blob \
+		| awk -f print_1.awk \
+		| sort >observed &&
+	test_cmp observed expected
+'
+
+test_expect_success 'verify normal and sparse:path=pattern2 packfiles have same commits/trees' '
+	git -C r3 verify-pack -v ../all.pack \
+		| grep -E "commit|tree" \
+		| awk -f print_1.awk \
+		| sort >expected &&
+	git -C r3 verify-pack -v ../filter.pack \
+		| grep -E "commit|tree" \
+		| awk -f print_1.awk \
+		| sort >observed &&
+	test_cmp observed expected
+'
+
+# Test sparse:oid=<oid-ish> filter.
+# Like sparse:path, but we get the sparse-checkout specification from
+# a blob rather than a file on disk.
+
+test_expect_success 'setup r4' '
+	git init r4 &&
+	mkdir r4/dir1 &&
+	for n in sparse1 sparse2
+	do
+		echo "This is file: $n" > r4/$n
+		git -C r4 add $n
+		echo "This is file: dir1/$n" > r4/dir1/$n
+		git -C r4 add dir1/$n
+	done &&
+	echo dir1/ >r4/pattern &&
+	git -C r4 add pattern &&
+	git -C r4 commit -m "pattern"
+'
+
+test_expect_success 'verify blob count in normal packfile' '
+	git -C r4 ls-files -s pattern sparse1 sparse2 dir1/sparse1 dir1/sparse2 \
+		| awk -f print_2.awk \
+		| sort >expected &&
+	git -C r4 pack-objects --rev --stdout >all.pack <<-EOF &&
+	HEAD
+	EOF
+	git -C r4 index-pack ../all.pack &&
+	git -C r4 verify-pack -v ../all.pack \
+		| grep blob \
+		| awk -f print_1.awk \
+		| sort >observed &&
+	test_cmp observed expected
+'
+
+test_expect_success 'verify sparse:oid=OID' '
+	git -C r4 ls-files -s dir1/sparse1 dir1/sparse2 \
+		| awk -f print_2.awk \
+		| sort >expected &&
+	oid=$(git -C r4 ls-files -s pattern | awk -f print_2.awk) &&
+	git -C r4 pack-objects --rev --stdout --filter=sparse:oid=$oid >filter.pack <<-EOF &&
+	HEAD
+	EOF
+	git -C r4 index-pack ../filter.pack &&
+	git -C r4 verify-pack -v ../filter.pack \
+		| grep blob \
+		| awk -f print_1.awk \
+		| sort >observed &&
+	test_cmp observed expected
+'
+
+test_expect_success 'verify sparse:oid=oid-ish' '
+	git -C r4 ls-files -s dir1/sparse1 dir1/sparse2 \
+		| awk -f print_2.awk \
+		| sort >expected &&
+	git -C r4 pack-objects --rev --stdout --filter=sparse:oid=master:pattern >filter.pack <<-EOF &&
+	HEAD
+	EOF
+	git -C r4 index-pack ../filter.pack &&
+	git -C r4 verify-pack -v ../filter.pack \
+		| grep blob \
+		| awk -f print_1.awk \
+		| sort >observed &&
+	test_cmp observed expected
+'
+
+# Delete some loose objects and use pack-objects, but WITHOUT any filtering.
+# This models previously omitted objects that we did not receive.
+
+test_expect_success 'setup r1 - delete loose blobs' '
+	git -C r1 ls-files -s file.1 file.2 file.3 file.4 file.5 \
+		| awk -f print_2.awk \
+		| sort >expected &&
+	for id in `cat expected | sed "s|..|&/|"`
+	do
+		rm r1/.git/objects/$id
+	done
+'
+
+test_expect_success 'verify pack-objects fails w/ missing objects' '
+	test_must_fail git -C r1 pack-objects --rev --stdout >miss.pack <<-EOF
+	HEAD
+	EOF
+'
+
+if ! test_have_prereq TODO; then
+	skip_all='TODO Allow pack-objects to work with missing objects'
+	test_done
+fi
+
+test_expect_success 'verify pack-objects w/ extension.partialcloneremote set succeeds' '
+	git -C r1 config --local core.repositoryformatversion 1 &&
+	git -C r1 config --local extensions.partialcloneremote "origin" &&
+	git -C r1 pack-objects --rev --stdout >miss.pack <<-EOF
+	HEAD
+	EOF
+'
+
+test_expect_success 'veify pack-objects w/ extension.partialclonefilter set succeeds' '
+	git -C r1 config --local core.repositoryformatversion 1 &&
+	git -C r1 config --local extensions.partialclonefilter "something" &&
+	git -C r1 pack-objects --rev --stdout >miss.pack <<-EOF
+	HEAD
+	EOF
+'
+
+test_done
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 03/13] list-objects: filter objects in traverse_commit_list
  2017-10-24 18:53 ` [PATCH 03/13] list-objects: filter objects in traverse_commit_list Jeff Hostetler
@ 2017-10-25  4:05   ` Jonathan Tan
  2017-10-25 19:25     ` Jeff Hostetler
  0 siblings, 1 reply; 37+ messages in thread
From: Jonathan Tan @ 2017-10-25  4:05 UTC (permalink / raw)
  To: Jeff Hostetler
  Cc: Git mailing list, Junio C Hamano, Jeff King, Jeff Hostetler

On Tue, Oct 24, 2017 at 11:53 AM, Jeff Hostetler <git@jeffhostetler.com> wrote:

> +enum list_objects_filter_result {
> +       LOFR_ZERO      = 0,
> +       LOFR_MARK_SEEN = 1<<0,

Probably worth documenting, something like /* Mark this object so that
it is skipped for the rest of the traversal. */

> +       LOFR_SHOW      = 1<<1,

And something like /* Invoke show_object_fn on this object. This
object may be revisited unless LOFR_MARK_SEEN is also set. */

> +};
> +
> +/* See object.h and revision.h */
> +#define FILTER_REVISIT (1<<25)

I think this should be declared closer to its use - in the sparse
filter code or in the file that uses it. Wherever it is, also update
the chart in object.h to indicate that we're using this 25th bit.

> +
> +enum list_objects_filter_type {
> +       LOFT_BEGIN_TREE,
> +       LOFT_END_TREE,
> +       LOFT_BLOB
> +};
> +
> +typedef enum list_objects_filter_result list_objects_filter_result;
> +typedef enum list_objects_filter_type list_objects_filter_type;

I don't think we typedef enums in Git code.

> +
> +typedef list_objects_filter_result (*filter_object_fn)(
> +       list_objects_filter_type filter_type,
> +       struct object *obj,
> +       const char *pathname,
> +       const char *filename,
> +       void *filter_data);
> +
> +void traverse_commit_list_worker(
> +       struct rev_info *,
> +       show_commit_fn, show_object_fn, void *show_data,
> +       filter_object_fn filter, void *filter_data);

I think things would be much clearer if a default filter was declared
(matching the behavior as of this patch when filter == NULL), say:
static inline default_filter(args) { switch(filter_type) { case
LOFT_BEGIN_TREE: return LOFR_MARK_SEEN | LOFR_SHOW; case
LOFT_END_TREE: return LOFT_ZERO; ...

And inline traverse_commit_list() instead of putting it in the .c file.

This would reduce or eliminate the need to document
traverse_commit_list_worker, including what happens if filter is NULL,
and explain how a user would make their own filter_object_fn.

> +
> +#endif /* LIST_OBJECTS_H */
> --
> 2.9.3
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 01/13] dir: allow exclusions from blob in addition to file
  2017-10-24 18:53 ` [PATCH 01/13] dir: allow exclusions from blob in addition to file Jeff Hostetler
@ 2017-10-25  4:05   ` Eric Sunshine
  2017-10-25  6:43   ` Junio C Hamano
  1 sibling, 0 replies; 37+ messages in thread
From: Eric Sunshine @ 2017-10-25  4:05 UTC (permalink / raw)
  To: Jeff Hostetler
  Cc: Git List, Junio C Hamano, Jeff King, jonathantanmy,
	Jeff Hostetler

On Tue, Oct 24, 2017 at 2:53 PM, Jeff Hostetler <git@jeffhostetler.com> wrote:
> Refactor add_excludes() to separate the reading of the
> exclude file into a buffer and the parsing of the buffer
> into exclude_list items.
>
> Add add_excludes_from_blob_to_list() to allow an exclude
> file be specified with an OID.
>
> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
> ---
> diff --git a/dir.c b/dir.c
> @@ -841,6 +856,38 @@ int add_excludes_from_file_to_list(const char *fname, const char *base,
> +int add_excludes_from_blob_to_list(
> +       struct object_id *oid,
> +       const char *base, int baselen,
> +       struct exclude_list *el)
> +{
> +       char *buf;
> +       unsigned long size;
> +       enum object_type type;
> +
> +       buf = read_sha1_file(oid->hash, &type, &size);
> +       if (!buf)
> +               return -1;
> +
> +       if (type != OBJ_BLOB) {
> +               free(buf);
> +               return -1;
> +       }
> +
> +       if (size == 0) {
> +               free(buf);
> +               return 0;
> +       }
> +
> +       if (buf[size - 1] != '\n') {
> +               buf = xrealloc(buf, st_add(size, 1));
> +               buf[size++] = '\n';
> +       }
> +
> +       add_excludes_from_buffer(buf, size, base, baselen, el);

Seeing all the free()'s above, at first glance, one wonders why 'buf'
isn't freed here after add_excludes_from_buffer(), however an
examination of that function shows that 'buf' is assigned to
el->filebuf and later freed by clear_exclude_list(). Okay.

> +       return 0;

Should this be returning the result of add_excludes_from_buffer()
rather than unconditionally returning 0?

> +}

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/13] list-objects-filter-options: common argument parsing
  2017-10-24 18:53 ` [PATCH 07/13] list-objects-filter-options: common argument parsing Jeff Hostetler
@ 2017-10-25  4:14   ` Jonathan Tan
  2017-10-25 19:28     ` Jeff Hostetler
  0 siblings, 1 reply; 37+ messages in thread
From: Jonathan Tan @ 2017-10-25  4:14 UTC (permalink / raw)
  To: Jeff Hostetler
  Cc: Git mailing list, Junio C Hamano, Jeff King, Jeff Hostetler

On Tue, Oct 24, 2017 at 11:53 AM, Jeff Hostetler <git@jeffhostetler.com> wrote:
> + * <arg> ::= blob:none
> + *           blob:limit:<n>[kmg]
> + *           sparse:oid:<oid-expression>
> + *           sparse:path:<pathname>

I notice in the code below that there are some usages of "=" instead
of ":" - could you clarify which one it is? (Ideally this would point
to one point of documentation which serves as both user and technical
documentation.)

> + */
> +int parse_list_objects_filter(struct list_objects_filter_options *filter_options,
> +                             const char *arg)
> +{
> +       struct object_context oc;
> +       struct object_id sparse_oid;
> +       const char *v0;
> +       const char *v1;
> +
> +       if (filter_options->choice)
> +               die(_("multiple object filter types cannot be combined"));
> +
> +       /*
> +        * TODO consider rejecting 'arg' if it contains any
> +        * TODO injection characters (since we might send this
> +        * TODO to a sub-command or to the server and we don't
> +        * TODO want to deal with legacy quoting/escaping for
> +        * TODO a new feature).
> +        */
> +
> +       filter_options->raw_value = strdup(arg);
> +
> +       if (skip_prefix(arg, "blob:", &v0) || skip_prefix(arg, "blobs:", &v0)) {

I know that some people prefer leniency, but I think it's better to
standardize on one form ("blob" instead of both "blob" and "blobs").

> +               if (!strcmp(v0, "none")) {
> +                       filter_options->choice = LOFC_BLOB_NONE;
> +                       return 0;
> +               }
> +
> +               if (skip_prefix(v0, "limit=", &v1) &&
> +                   git_parse_ulong(v1, &filter_options->blob_limit_value)) {
> +                       filter_options->choice = LOFC_BLOB_LIMIT;
> +                       return 0;
> +               }
> +       }
> +       else if (skip_prefix(arg, "sparse:", &v0)) {
> +               if (skip_prefix(v0, "oid=", &v1)) {
> +                       filter_options->choice = LOFC_SPARSE_OID;
> +                       if (!get_oid_with_context(v1, GET_OID_BLOB,
> +                                                 &sparse_oid, &oc)) {
> +                               /*
> +                                * We successfully converted the <oid-expr>
> +                                * into an actual OID.  Rewrite the raw_value
> +                                * in canonoical form with just the OID.
> +                                * (If we send this request to the server, we
> +                                * want an absolute expression rather than a
> +                                * local-ref-relative expression.)
> +                                */
> +                               free((char *)filter_options->raw_value);
> +                               filter_options->raw_value =
> +                                       xstrfmt("sparse:oid=%s",
> +                                               oid_to_hex(&sparse_oid));
> +                               filter_options->sparse_oid_value =
> +                                       oiddup(&sparse_oid);
> +                       } else {
> +                               /*
> +                                * We could not turn the <oid-expr> into an
> +                                * OID.  Leave the raw_value as is in case
> +                                * the server can parse it.  (It may refer to
> +                                * a branch, commit, or blob we don't have.)
> +                                */
> +                       }
> +                       return 0;
> +               }
> +
> +               if (skip_prefix(v0, "path=", &v1)) {
> +                       filter_options->choice = LOFC_SPARSE_PATH;
> +                       filter_options->sparse_path_value = strdup(v1);
> +                       return 0;
> +               }
> +       }
> +
> +       die(_("invalid filter expression '%s'"), arg);
> +       return 0;
> +}
> +
> +int opt_parse_list_objects_filter(const struct option *opt,
> +                                 const char *arg, int unset)
> +{
> +       struct list_objects_filter_options *filter_options = opt->value;
> +
> +       assert(arg);
> +       assert(!unset);
> +
> +       return parse_list_objects_filter(filter_options, arg);
> +}
> diff --git a/list-objects-filter-options.h b/list-objects-filter-options.h
> new file mode 100644
> index 0000000..23bd68e
> --- /dev/null
> +++ b/list-objects-filter-options.h
> @@ -0,0 +1,50 @@
> +#ifndef LIST_OBJECTS_FILTER_OPTIONS_H
> +#define LIST_OBJECTS_FILTER_OPTIONS_H
> +
> +#include "parse-options.h"
> +
> +/*
> + * Common declarations and utilities for filtering objects (such as omitting
> + * large blobs) in list_objects:traverse_commit_list() and git-rev-list.
> + */
> +
> +enum list_objects_filter_choice {
> +       LOFC_DISABLED = 0,
> +       LOFC_BLOB_NONE,
> +       LOFC_BLOB_LIMIT,
> +       LOFC_SPARSE_OID,
> +       LOFC_SPARSE_PATH,
> +};
> +
> +struct list_objects_filter_options {
> +       /*
> +        * The raw argument value given on the command line or
> +        * protocol request.  (The part after the "--keyword=".)
> +        */
> +       char *raw_value;
> +
> +       /*
> +        * Parsed values. Only 1 will be set depending on the flags below.
> +        */
> +       struct object_id *sparse_oid_value;
> +       char *sparse_path_value;
> +       unsigned long blob_limit_value;
> +
> +       enum list_objects_filter_choice choice;
> +};
> +
> +/* Normalized command line arguments */
> +#define CL_ARG__FILTER "filter"
> +
> +int parse_list_objects_filter(struct list_objects_filter_options *filter_options,
> +                             const char *arg);
> +
> +int opt_parse_list_objects_filter(const struct option *opt,
> +                                 const char *arg, int unset);
> +
> +#define OPT_PARSE_LIST_OBJECTS_FILTER(fo) \
> +       { OPTION_CALLBACK, 0, CL_ARG__FILTER, fo, N_("args"), \
> +         N_("object filtering"), PARSE_OPT_NONEG, \
> +         opt_parse_list_objects_filter }

Thanks - this does make it easier to have a standard argument name and
description everywhere.

> +
> +#endif /* LIST_OBJECTS_FILTER_OPTIONS_H */
> --
> 2.9.3
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 08/13] list-objects: add traverse_commit_list_filtered method
  2017-10-24 18:53 ` [PATCH 08/13] list-objects: add traverse_commit_list_filtered method Jeff Hostetler
@ 2017-10-25  4:24   ` Jonathan Tan
  2017-10-25 19:29     ` Jeff Hostetler
  0 siblings, 1 reply; 37+ messages in thread
From: Jonathan Tan @ 2017-10-25  4:24 UTC (permalink / raw)
  To: Jeff Hostetler
  Cc: Git mailing list, Junio C Hamano, Jeff King, Jeff Hostetler

On Tue, Oct 24, 2017 at 11:53 AM, Jeff Hostetler <git@jeffhostetler.com> wrote:
> +void traverse_commit_list_filtered(
> +       struct list_objects_filter_options *filter_options,
> +       struct rev_info *revs,
> +       show_commit_fn show_commit,
> +       show_object_fn show_object,
> +       list_objects_filter_map_foreach_cb print_omitted_object,
> +       void *show_data);

So the function call chain, if we wanted a filtered traversal, is:
traverse_commit_list_filtered -> traverse_commit_list__sparse_path
(and friends, and each algorithm is in its own file) ->
traverse_commit_list_worker

This makes the implementation of each algorithm more easily understood
(since they are all in their own files), but also increases the number
of global functions and code files. I personally would combine the
traverse_commit_list__* functions into one file
(list-objects-filtered.c), make them static, and also put
traverse_commit_list_filtered in there, but I understand that other
people in the Git project may differ on this.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 10/13] rev-list: add list-objects filtering support
  2017-10-24 18:53 ` [PATCH 10/13] rev-list: add list-objects filtering support Jeff Hostetler
@ 2017-10-25  4:41   ` Jonathan Tan
  2017-10-25 19:37     ` Jeff Hostetler
  0 siblings, 1 reply; 37+ messages in thread
From: Jonathan Tan @ 2017-10-25  4:41 UTC (permalink / raw)
  To: Jeff Hostetler
  Cc: Git mailing list, Junio C Hamano, Jeff King, Jeff Hostetler

On Tue, Oct 24, 2017 at 11:53 AM, Jeff Hostetler <git@jeffhostetler.com> wrote:
>  static void finish_object(struct object *obj, const char *name, void *cb_data)
>  {
>         struct rev_list_info *info = cb_data;
> -       if (obj->type == OBJ_BLOB && !has_object_file(&obj->oid))
> +       if (obj->type == OBJ_BLOB && !has_object_file(&obj->oid)) {
> +               if (arg_print_missing) {
> +                       list_objects_filter_map_insert(
> +                               &missing_objects, &obj->oid, name, obj->type);
> +                       return;
> +               }
> +
> +               /*
> +                * Relax consistency checks when we expect missing
> +                * objects because of partial-clone or a previous
> +                * partial-fetch.
> +                *
> +                * Note that this is independent of any filtering that
> +                * we are doing in this run.
> +                */
> +               if (is_partial_clone_registered())
> +                       return;
> +
>                 die("missing blob object '%s'", oid_to_hex(&obj->oid));

I'm fine with arg_print_missing suppressing lazy fetching (when I
rebase my patches on this, I'll have to ensure that fetch_if_missing
is set to 0 if arg_print_missing is true), but I think that the
behavior when arg_print_missing is false should be the opposite - we
should let has_object_file() perform the lazy fetching, and die if it
returns false (that is, if the fetching failed).

> +       }
>         if (info->revs->verify_objects && !obj->parsed && obj->type != OBJ_COMMIT)
>                 parse_object(&obj->oid);
>  }

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/13] WIP Partial clone part 1: object filtering
  2017-10-24 18:53 [PATCH 00/13] WIP Partial clone part 1: object filtering Jeff Hostetler
                   ` (12 preceding siblings ...)
  2017-10-24 18:53 ` [PATCH 13/13] t5317: pack-objects object filtering test Jeff Hostetler
@ 2017-10-25  4:57 ` Jonathan Tan
  2017-10-25  5:00 ` Junio C Hamano
  14 siblings, 0 replies; 37+ messages in thread
From: Jonathan Tan @ 2017-10-25  4:57 UTC (permalink / raw)
  To: Jeff Hostetler
  Cc: Git mailing list, Junio C Hamano, Jeff King, Jeff Hostetler

On Tue, Oct 24, 2017 at 11:53 AM, Jeff Hostetler <git@jeffhostetler.com> wrote:
> From: Jeff Hostetler <jeffhost@microsoft.com>
>
> I've been working with Jonathan Tan to combine our partial clone
> proposals.  This patch series represents a first step in that effort
> and introduces an object filtering mechanism to select unwanted
> objects.
>
> [1] traverse_commit_list and list-objects is extended to allow
>     various filters.
> [2] rev-list is extended to expose filtering.  This allows testing
>     of the filtering options.  And can be used later to predict
>     missing objects before commands like checkout or merge.
> [3] pack-objects is extended to use filtering parameters and build
>     packfiles that omit unwanted objects.
>
> This patch series lays the ground work for subsequent parts which
> will extend clone, fetch, fetch-pack, upload-pack, fsck, and etc.

Thanks - I've taken a look at all of them except for the partialclone
extension one, which I've only glanced over briefly. Apart from some
style issues (indentation and typedef-ing of enums) I think that they
generally look all right.

One possible issue with using a formatted filter string (e.g.
blob:none) directly passed to the server as opposed to actual
client-interpreted flags (e.g. --blob-byte-limit=0) is that a user may
be confused if the version of Git they're using supports fancier
filters, which will work if they're running rev-list locally, but not
when they're fetching from a not-so-fancy Git server. Maybe that is
fine, though - something we've discussed internally is an ability to
offload some calculations (e.g. git log -S) to Git servers, and if we
end up doing something similar to that, users will need to deal with
this problem anyway.

The reason why I only glanced over the partialclone patch is because I
think that that change needs more discussion than the rest, and it
would be good to get the others in first. I know that you included the
partialclone patch because it is needed by the rev-list one, but as I
commented in the rev-list one, I think that it does not need to be
aware of partial clones, at least for now. The overall ideas in the
partialclone patch seem good, though - in particular, one conceptual
difference from my patch [1] is that the filter setting is a property
of the repository as opposed to the remote, which does seem like an
improvement, because it makes it easier to make and explain e.g. a
"git log --use-repo-filter -S" command that uses the preconfigured
config.

[1] https://public-inbox.org/git/407a298b52a9e0a2ee4135fe844e35b9a14c0f7b.1506714999.git.jonathantanmy@google.com/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/13] WIP Partial clone part 1: object filtering
  2017-10-24 18:53 [PATCH 00/13] WIP Partial clone part 1: object filtering Jeff Hostetler
                   ` (13 preceding siblings ...)
  2017-10-25  4:57 ` [PATCH 00/13] WIP Partial clone part 1: object filtering Jonathan Tan
@ 2017-10-25  5:00 ` Junio C Hamano
  2017-10-25  6:46   ` Jonathan Tan
  14 siblings, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2017-10-25  5:00 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, peff, jonathantanmy, Jeff Hostetler

Jeff Hostetler <git@jeffhostetler.com> writes:

> From: Jeff Hostetler <jeffhost@microsoft.com>
>
> I've been working with Jonathan Tan to combine our partial clone
> proposals.  This patch series represents a first step in that effort
> and introduces an object filtering mechanism to select unwanted
> objects.
>
> [1] traverse_commit_list and list-objects is extended to allow
>     various filters.
> [2] rev-list is extended to expose filtering.  This allows testing
>     of the filtering options.  And can be used later to predict
>     missing objects before commands like checkout or merge.
> [3] pack-objects is extended to use filtering parameters and build
>     packfiles that omit unwanted objects.
>
> This patch series lays the ground work for subsequent parts which
> will extend clone, fetch, fetch-pack, upload-pack, fsck, and etc.

OK, thanks for working well together.  So does this (1) build on
Jonathan's fsck-squelching series, or (2) ignores that and builds
filtering first, potentially leaving the codebase to a broken state
where it can create fsck-unclean repository until Jonathan's series
is rebased on top of this, or (3) something else?  [*1*]

I also saw a patch marked as "this is from Jonathan's earlier work",
taking the authorship (which to me implies that the changes were
extensive enough), so I am a bit at loss envisioning how this piece
fits in the bigger picture together with the other piece.


[Footnote]

*1* Not having the answer to this question does bother me, but it is
    perfectly fine if the answer is (2), especially while the series
    is in a WIP state.


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 01/13] dir: allow exclusions from blob in addition to file
  2017-10-24 18:53 ` [PATCH 01/13] dir: allow exclusions from blob in addition to file Jeff Hostetler
  2017-10-25  4:05   ` Eric Sunshine
@ 2017-10-25  6:43   ` Junio C Hamano
  2017-10-25 14:54     ` Jeff Hostetler
  1 sibling, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2017-10-25  6:43 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, peff, jonathantanmy, Jeff Hostetler

Jeff Hostetler <git@jeffhostetler.com> writes:

> +static int add_excludes_from_buffer(char *buf, size_t size,
> +				    const char *base, int baselen,
> +				    struct exclude_list *el);
> +
>  /*
>   * Given a file with name "fname", read it (either from disk, or from
>   * an index if 'istate' is non-null), parse it and store the
> @@ -754,9 +758,9 @@ static int add_excludes(const char *fname, const char *base, int baselen,
>  			struct sha1_stat *sha1_stat)
>  {
>  	struct stat st;
> -	int fd, i, lineno = 1;
> +	int fd;
>  	size_t size = 0;
> -	char *buf, *entry;
> +	char *buf;
>  
>  	fd = open(fname, O_RDONLY);
>  	if (fd < 0 || fstat(fd, &st) < 0) {

The post-context of this hunk is quite interesting in that there is
a call to read_skip_worktree_file_from_index(); which essentially 
pretends as if we read from the filesystem but in fact it grabs the
blob object name registered in the index and reads from it.

The reason why it is interesting is because this patch adds yet
nother "let's instead read from a blob object" function and there is
no sign to make the existing one take advantage of the new function
seen in this patch.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/13] WIP Partial clone part 1: object filtering
  2017-10-25  5:00 ` Junio C Hamano
@ 2017-10-25  6:46   ` Jonathan Tan
  2017-10-25 15:39     ` Jeff Hostetler
                       ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Jonathan Tan @ 2017-10-25  6:46 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Jeff Hostetler, Git mailing list, Jeff King, Jeff Hostetler

On Tue, Oct 24, 2017 at 10:00 PM, Junio C Hamano <gitster@pobox.com> wrote:
> OK, thanks for working well together.  So does this (1) build on
> Jonathan's fsck-squelching series, or (2) ignores that and builds
> filtering first, potentially leaving the codebase to a broken state
> where it can create fsck-unclean repository until Jonathan's series
> is rebased on top of this, or (3) something else?  [*1*]

Excluding the partialclone patch (patch 9), I think that the answer is
(2), but I don't think that it leaves the codebase in a broken state.
In particular, none of the code modifies the repo, so it can't create
a fsck-unclean one.

Maybe one could say that this leaves the codebase with features that
no one will ever use in the absence of partial clone, but I don't
think that's true - rev-list with blob-size/sparse-specification
filter seems independently useful, at least (for example, when
desiring to operate on a repo in a sparse way without going through a
workdir), and if we're planning to allow listing of objects, we
probably should allow packing as well (especially since this doesn't
add much code).

The above is relevant only if we can exclude the partialclone patch,
but I think that we can and we should, as I wrote in my reply to Jeff
Hostetler [1].

As for how this patch set (excluding the partialclone patch) interacts
with my fsck series, they are relatively independent, as far as I can
tell. I'll rebase my fsck, gc, and lazy object fetch patches (but not
the fetch and clone parts, which we plan to instead adapt from Jeff
Hostetler's patches, as far as I know) on top of these and resend
those out once discussion on this has settled.

[1] https://public-inbox.org/git/CAGf8dg+8AR=XfSV92ODAtKTNjBnD1+oVZp9rs4Y4Otz_eZyTfg@mail.gmail.com/

> I also saw a patch marked as "this is from Jonathan's earlier work",
> taking the authorship (which to me implies that the changes were
> extensive enough), so I am a bit at loss envisioning how this piece
> fits in the bigger picture together with the other piece.

The patch you mentioned is the partialclone patch, which I think can
be considered separately from the rest (as I said above).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 02/13] list-objects-filter-map: extend oidmap to collect omitted objects
  2017-10-24 18:53 ` [PATCH 02/13] list-objects-filter-map: extend oidmap to collect omitted objects Jeff Hostetler
@ 2017-10-25  7:10   ` Junio C Hamano
  2017-10-25 19:22     ` Jeff Hostetler
  0 siblings, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2017-10-25  7:10 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, peff, jonathantanmy, Jeff Hostetler

Jeff Hostetler <git@jeffhostetler.com> writes:

> From: Jeff Hostetler <jeffhost@microsoft.com>
>
> Create helper class to extend oidmap to collect a list of
> omitted or missing objects during traversal.

The reason why oidmap itself cannot be used is because the code
wants to record not just the object name but something else about
the object.  And attributes that the code may care about we can see
in this patch are the object type and the path it found.  

Is the plan to extend this set of attributes over time as different
"omitter"s are added?  Why was "path" chosen as a member of the
initial set and how it will be useful (also, what path would we
record for tags and commits)?

These "future plans" needs revealed upfront, instead of (or in
addition to) "will be used in a later commit".  As it is hard to
judge if "filter map" is an appropriate name for this thing without
knowing _how_ it is envisioned to be used.  "filter map" sounds more
like a map function that is consulted when we decide if we want to
drop the object, but from the looks of the code, it is used more to
record what was done to these objects.

Is it really a "map" (i.e. whose primary focus is to find out what
an object name is "mapped to" when we get an object name---e.g. we
notice an otherwise connected object is missing, and consult this
"map" to learn what the type/path is because we want to do X)?  Or
is it more like a "set of known-to-be-missing object" (i.e. whose
primary point is to serve as a set of object names and what a name
maps to is primarily for debugging)?  These are easier to answer if
we know how it will be used.

> This will be used in a later commit by the list-object filtering
> code.
>
> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
> ---
> diff --git a/list-objects-filter-map.c b/list-objects-filter-map.c
> new file mode 100644
> index 0000000..7e496b3
> --- /dev/null
> +++ b/list-objects-filter-map.c
> @@ -0,0 +1,63 @@
> +#include "cache.h"
> +#include "list-objects-filter-map.h"
> +
> +int list_objects_filter_map_insert(struct oidmap *map,
> +				   const struct object_id *oid,
> +				   const char *pathname, enum object_type type)
> +{
> +	size_t len, size;
> +	struct list_objects_filter_map_entry *e;
> +
> +	if (oidmap_get(map, oid))
> +		return 1;

It is OK for the existing entry to record a path that is totally
different from what the caller has.  It is hard to judge without
knowing what pathname the callers are expected to call this function
with, but I am guessing that it is similar to the path shown in the
output from "rev-list --objects"---and if that is the case, it is
correct that the same object may be reached at different paths
depending on what tree the traversal begins at, so pathname recorded
in the map is merely "there is one tree somewhere that has this
object at this path".

For that matter, the caller may have a completely different type
from the object we saw earlier; not checking and flagging it as a
possible error makes me feel somewhat uneasy, but there probably is
little you can do at this layer of the code if you noticed such a
discrepancy so it may be OK to punt.

> +	len = ((pathname && *pathname) ? strlen(pathname) : 0);
> +	size = (offsetof(struct list_objects_filter_map_entry, pathname) + len + 1);
> +	e = xcalloc(1, size);
> +
> +	oidcpy(&e->entry.oid, oid);
> +	e->type = type;
> +	if (pathname && *pathname)
> +		strcpy(e->pathname, pathname);
> +
> +	oidmap_put(map, e);
> +	return 0;
> +}

The return value from the function needs to be documented in the
header to help callers.  It is not apparent why "we did already have
one" and "we now newly added" is interesting to the callers, for
example.  An obvious alternative implementation of this function
would return the pointer to an entry that records the object id
(i.e. either the one that was already there, or the one we created
because we saw this object for the first time), so that the caller
can do something interesting to it---again, because the reason why
we want this "filter map" is not explained at this stage, it is hard
to tell what that "sometehing interesting" would be.

> +static int my_cmp(const void *a, const void *b)
> +{
> +	const struct oidmap_entry *ea, *eb;
> +
> +	ea = *(const struct oidmap_entry **)a;
> +	eb = *(const struct oidmap_entry **)b;
> +
> +	return oidcmp(&ea->oid, &eb->oid);
> +}
> +
> +void list_objects_filter_map_foreach(struct oidmap *map,
> +				     list_objects_filter_map_foreach_cb cb,

Name a typedef of a function as something_fn, not something_cb;
something_cb is often the type of a struct to be fed to the callback
function.  And call such a parameter of type something_fn just fn.

> +				     void *cb_data)
> +{
> +	struct hashmap_iter iter;
> +	struct list_objects_filter_map_entry **array;
> +	struct list_objects_filter_map_entry *e;
> +	int k, nr;
> +
> +	nr = hashmap_get_size(&map->map);
> +	if (!nr)
> +		return;
> +
> +	array = xcalloc(nr, sizeof(*e));
> +
> +	k = 0;
> +	hashmap_iter_init(&map->map, &iter);
> +	while ((e = hashmap_iter_next(&iter)))
> +		array[k++] = e;
> +
> +	QSORT(array, nr, my_cmp);

It is entirely unclear why foreach() must return the object names in
order.

> +	for (k = 0; k < nr; k++)
> +		cb(k, nr, array[k], cb_data);

Also it is not clear if you wanted to expose the type of the
entry to the callback function.

An obvious alternative

	fn(&array[k].entry.oid, cb_data);

would allow you to keep the type of map-entry private to the map,
and also the callback does not need to know about k or nr.

I guess you are giving k and nr to allow the callers to do a
progress bar?  If that is the case, that's fine by me.  I still do
not understand why we uncondtionally sort, though.

> +
> +	free(array);
> +}
> diff --git a/list-objects-filter-map.h b/list-objects-filter-map.h
> new file mode 100644
> index 0000000..794fc81
> --- /dev/null
> +++ b/list-objects-filter-map.h
> @@ -0,0 +1,26 @@
> +#ifndef LIST_OBJECTS_FILTER_MAP_H
> +#define LIST_OBJECTS_FILTER_MAP_H
> +
> +#include "oidmap.h"
> +
> +struct list_objects_filter_map_entry {
> +	struct oidmap_entry entry; /* must be first */
> +	enum object_type type;
> +	char pathname[FLEX_ARRAY];
> +};
> +
> +extern int list_objects_filter_map_insert(
> +	struct oidmap *map,
> +	const struct object_id *oid,
> +	const char *pathname, enum object_type type);
> +
> +typedef void (*list_objects_filter_map_foreach_cb)(
> +	int i, int i_limit,
> +	struct list_objects_filter_map_entry *e, void *cb_data);
> +
> +extern void list_objects_filter_map_foreach(
> +	struct oidmap *map,
> +	list_objects_filter_map_foreach_cb cb,
> +	void *cb_data);
> +
> +#endif /* LIST_OBJECTS_FILTER_MAP_H */

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 01/13] dir: allow exclusions from blob in addition to file
  2017-10-25  6:43   ` Junio C Hamano
@ 2017-10-25 14:54     ` Jeff Hostetler
  2017-10-26  3:47       ` Junio C Hamano
  0 siblings, 1 reply; 37+ messages in thread
From: Jeff Hostetler @ 2017-10-25 14:54 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, jonathantanmy, Jeff Hostetler



On 10/25/2017 2:43 AM, Junio C Hamano wrote:
> Jeff Hostetler <git@jeffhostetler.com> writes:
> 
>> +static int add_excludes_from_buffer(char *buf, size_t size,
>> +				    const char *base, int baselen,
>> +				    struct exclude_list *el);
>> +
>>   /*
>>    * Given a file with name "fname", read it (either from disk, or from
>>    * an index if 'istate' is non-null), parse it and store the
>> @@ -754,9 +758,9 @@ static int add_excludes(const char *fname, const char *base, int baselen,
>>   			struct sha1_stat *sha1_stat)
>>   {
>>   	struct stat st;
>> -	int fd, i, lineno = 1;
>> +	int fd;
>>   	size_t size = 0;
>> -	char *buf, *entry;
>> +	char *buf;
>>   
>>   	fd = open(fname, O_RDONLY);
>>   	if (fd < 0 || fstat(fd, &st) < 0) {
> 
> The post-context of this hunk is quite interesting in that there is
> a call to read_skip_worktree_file_from_index(); which essentially
> pretends as if we read from the filesystem but in fact it grabs the
> blob object name registered in the index and reads from it.
> 
> The reason why it is interesting is because this patch adds yet
> nother "let's instead read from a blob object" function and there is
> no sign to make the existing one take advantage of the new function
> seen in this patch.
> 

The existing code handles use cases where you want to read the
exclusion list from a pathname in the worktree -- or from blob
named in the index when the pathname is not populated (presumably
because of the skip-worktree bit).

I was wanting to add a more general case (and perhaps my commit
message should be improved).  I want to be able to read it from
a blob not necessarily associated with the current commit or
not necessarily available on the local client, but yet known to
exist.  I'm thinking of the case the client could ask the server
to do a partial clone using a sparse-checkout specification stored
in a well-known location on the server.  The reason for this is
that, in this case, the client is pre-clone and doesn't have a
worktree or index.

With my "add_excludes_from_blob_to_list()", we can request a
blob-ish expression, such as "master:enlistments/foo".  In my
later commits associated with clone and fetch, we can use this
mechanism to let the client ask the server to filter using the
blob associated with this blob-ish.  If the client has the blob
(such as during a later fetch) and can resolve it, then it can
and send the server the OID, but it can also send the blob-ish
to the server and let it resolve it.

Jeff



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/13] WIP Partial clone part 1: object filtering
  2017-10-25  6:46   ` Jonathan Tan
@ 2017-10-25 15:39     ` Jeff Hostetler
  2017-10-26  2:09       ` Junio C Hamano
  2017-10-26  2:01     ` Junio C Hamano
  2017-10-30 22:27     ` Jonathan Nieder
  2 siblings, 1 reply; 37+ messages in thread
From: Jeff Hostetler @ 2017-10-25 15:39 UTC (permalink / raw)
  To: Jonathan Tan, Junio C Hamano; +Cc: Git mailing list, Jeff King, Jeff Hostetler



On 10/25/2017 2:46 AM, Jonathan Tan wrote:
> On Tue, Oct 24, 2017 at 10:00 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> OK, thanks for working well together.  So does this (1) build on
>> Jonathan's fsck-squelching series, or (2) ignores that and builds
>> filtering first, potentially leaving the codebase to a broken state
>> where it can create fsck-unclean repository until Jonathan's series
>> is rebased on top of this, or (3) something else?  [*1*]
> 
> Excluding the partialclone patch (patch 9), I think that the answer is
> (2), but I don't think that it leaves the codebase in a broken state.
> In particular, none of the code modifies the repo, so it can't create
> a fsck-unclean one.

My part 1 series starts with filtering, rev-list, and pack-objects.
So, yes, it add new features that no one will use yet.  But it is useful
by itself.  For example, you can use rev-list to ask for the set of
missing objects that you would need to do a checkout (assuming you had
commits and trees, but no blobs or no large blobs) *before* actually
starting the checkout.

I was then going to lay Jonathan's fsck/gc/dynamic object fetch
on top of that.  I started that here:
	https://github.com/jeffhostetler/git/pull/7

Patch 9 just adds the "extensions.partialclone*" fields and is prep
for my rev-list and his fsck changes.  I included it sooner rather than
later so I can test rev-list on a repo with hand-deleted blobs.
But yes, it can be omitted from this series and included with the fsck
changes.


> 
> Maybe one could say that this leaves the codebase with features that
> no one will ever use in the absence of partial clone, but I don't
> think that's true - rev-list with blob-size/sparse-specification
> filter seems independently useful, at least (for example, when
> desiring to operate on a repo in a sparse way without going through a
> workdir), and if we're planning to allow listing of objects, we
> probably should allow packing as well (especially since this doesn't
> add much code).
> 
> The above is relevant only if we can exclude the partialclone patch,
> but I think that we can and we should, as I wrote in my reply to Jeff
> Hostetler [1].
> 
> As for how this patch set (excluding the partialclone patch) interacts
> with my fsck series, they are relatively independent, as far as I can
> tell. I'll rebase my fsck, gc, and lazy object fetch patches (but not
> the fetch and clone parts, which we plan to instead adapt from Jeff
> Hostetler's patches, as far as I know) on top of these and resend
> those out once discussion on this has settled.

Yes, I want to get Jonathan's fsck/gc/lazy changes built into part 2.
They came over easily and are independent of how/why there are missing
objects.

For part 3, I'd like to take my version of clone, fetch, fetch-pack,
and upload-pack (that talks with the filters from part 1) and incorporate
Jonathan's promisor concepts.  That merge is a little messier, so I'd
like to parts 1 and 2 a chance to get some feedback first.


> [1] https://public-inbox.org/git/CAGf8dg+8AR=XfSV92ODAtKTNjBnD1+oVZp9rs4Y4Otz_eZyTfg@mail.gmail.com/
> 
>> I also saw a patch marked as "this is from Jonathan's earlier work",
>> taking the authorship (which to me implies that the changes were
>> extensive enough), so I am a bit at loss envisioning how this piece
>> fits in the bigger picture together with the other piece.
> 
> The patch you mentioned is the partialclone patch, which I think can
> be considered separately from the rest (as I said above).

A question of mailing-list etiquette: in patch 9, I took Jonathan's
ideas for adding the "extensions.partialclone" setting and extended it
with some helper functions.  His change was part of a larger change
with other code (fsck, IIRC) that I wasn't ready for.  What is the
preferred way to give credit for something like this?


Thanks
Jeff



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 02/13] list-objects-filter-map: extend oidmap to collect omitted objects
  2017-10-25  7:10   ` Junio C Hamano
@ 2017-10-25 19:22     ` Jeff Hostetler
  2017-10-26  4:12       ` Junio C Hamano
  0 siblings, 1 reply; 37+ messages in thread
From: Jeff Hostetler @ 2017-10-25 19:22 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, jonathantanmy, Jeff Hostetler



On 10/25/2017 3:10 AM, Junio C Hamano wrote:
> Jeff Hostetler <git@jeffhostetler.com> writes:
> 
>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>
>> Create helper class to extend oidmap to collect a list of
>> omitted or missing objects during traversal.
> 
> The reason why oidmap itself cannot be used is because the code
> wants to record not just the object name but something else about
> the object.  And attributes that the code may care about we can see
> in this patch are the object type and the path it found.

I recently simplified the code in this version to not completely
sub-class oidmap, but to just use it along with a custom
_insert method that takes care of allocating the _entry
data.  I should update the commit message to reflect that.

> 
> Is the plan to extend this set of attributes over time as different
> "omitter"s are added?  Why was "path" chosen as a member of the
> initial set and how it will be useful (also, what path would we
> record for tags and commits)?

I envisioned this to let rev-list print the pathname of omitted
objects -- like "rev-list --objects" does for regular blobs.
I would leave the pathname NULL for tags and commits.

The pathname helps with debugging and testing, but also is
used by the sparse filter to avoid some expensive duplicate
is-excluded lookups.

Currently the 3 filters I have defined all use the same extra
data.  I suppose a future filter could want additional fields,
so maybe it would be better to refactor my "map-entry" to be
per-filter specific.

> 
> These "future plans" needs revealed upfront, instead of (or in
> addition to) "will be used in a later commit".  As it is hard to
> judge if "filter map" is an appropriate name for this thing without
> knowing _how_ it is envisioned to be used.  "filter map" sounds more
> like a map function that is consulted when we decide if we want to
> drop the object, but from the looks of the code, it is used more to
> record what was done to these objects.

Sorry, I meant a later commit in this patch series.  It is used by
commits 4, 5, 6, and 10 to actually do the filtering and collect a
list of omitted or missing objects.

> 
> Is it really a "map" (i.e. whose primary focus is to find out what
> an object name is "mapped to" when we get an object name---e.g. we
> notice an otherwise connected object is missing, and consult this
> "map" to learn what the type/path is because we want to do X)?  Or
> is it more like a "set of known-to-be-missing object" (i.e. whose
> primary point is to serve as a set of object names and what a name
> maps to is primarily for debugging)?  These are easier to answer if
> we know how it will be used.

I think of a "set" as a member? or not-member? class.
I think of a "map" as a member? or not-member? class but where each
member also has a value.  Sometimes map lookups just want to know
membership and sometimes the lookup wants the value.

Granted, having the key and value data stuffed into the same entry
(from hashmap's point of view, rather than a key having a pointer
to a value) does kind of blur the line, but I was thinking about
a map here.  (And I was building on oidmap which builds on hashmap,
so it seemed appropriate.)

> 
>> This will be used in a later commit by the list-object filtering
>> code.
>>
>> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
>> ---
>> diff --git a/list-objects-filter-map.c b/list-objects-filter-map.c
>> new file mode 100644
>> index 0000000..7e496b3
>> --- /dev/null
>> +++ b/list-objects-filter-map.c
>> @@ -0,0 +1,63 @@
>> +#include "cache.h"
>> +#include "list-objects-filter-map.h"
>> +
>> +int list_objects_filter_map_insert(struct oidmap *map,
>> +				   const struct object_id *oid,
>> +				   const char *pathname, enum object_type type)
>> +{
>> +	size_t len, size;
>> +	struct list_objects_filter_map_entry *e;
>> +
>> +	if (oidmap_get(map, oid))
>> +		return 1;
> 
> It is OK for the existing entry to record a path that is totally
> different from what the caller has.  It is hard to judge without
> knowing what pathname the callers are expected to call this function
> with, but I am guessing that it is similar to the path shown in the
> output from "rev-list --objects"---and if that is the case, it is
> correct that the same object may be reached at different paths
> depending on what tree the traversal begins at, so pathname recorded
> in the map is merely "there is one tree somewhere that has this
> object at this path".

Right, the first observed pathname is as good as any.

> 
> For that matter, the caller may have a completely different type
> from the object we saw earlier; not checking and flagging it as a
> possible error makes me feel somewhat uneasy, but there probably is
> little you can do at this layer of the code if you noticed such a
> discrepancy so it may be OK to punt.

I could assert() that the types match, but right there's not much
we can do about it at this layer.

> 
>> +	len = ((pathname && *pathname) ? strlen(pathname) : 0);
>> +	size = (offsetof(struct list_objects_filter_map_entry, pathname) + len + 1);
>> +	e = xcalloc(1, size);
>> +
>> +	oidcpy(&e->entry.oid, oid);
>> +	e->type = type;
>> +	if (pathname && *pathname)
>> +		strcpy(e->pathname, pathname);
>> +
>> +	oidmap_put(map, e);
>> +	return 0;
>> +}
> 
> The return value from the function needs to be documented in the
> header to help callers.  It is not apparent why "we did already have
> one" and "we now newly added" is interesting to the callers, for
> example.  An obvious alternative implementation of this function
> would return the pointer to an entry that records the object id
> (i.e. either the one that was already there, or the one we created
> because we saw this object for the first time), so that the caller
> can do something interesting to it---again, because the reason why
> we want this "filter map" is not explained at this stage, it is hard
> to tell what that "sometehing interesting" would be.

good point.  thanks.

> 
>> +static int my_cmp(const void *a, const void *b)
>> +{
>> +	const struct oidmap_entry *ea, *eb;
>> +
>> +	ea = *(const struct oidmap_entry **)a;
>> +	eb = *(const struct oidmap_entry **)b;
>> +
>> +	return oidcmp(&ea->oid, &eb->oid);
>> +}
>> +
>> +void list_objects_filter_map_foreach(struct oidmap *map,
>> +				     list_objects_filter_map_foreach_cb cb,
> 
> Name a typedef of a function as something_fn, not something_cb;
> something_cb is often the type of a struct to be fed to the callback
> function.  And call such a parameter of type something_fn just fn.
> 

ok.


>> +				     void *cb_data)
>> +{
>> +	struct hashmap_iter iter;
>> +	struct list_objects_filter_map_entry **array;
>> +	struct list_objects_filter_map_entry *e;
>> +	int k, nr;
>> +
>> +	nr = hashmap_get_size(&map->map);
>> +	if (!nr)
>> +		return;
>> +
>> +	array = xcalloc(nr, sizeof(*e));
>> +
>> +	k = 0;
>> +	hashmap_iter_init(&map->map, &iter);
>> +	while ((e = hashmap_iter_next(&iter)))
>> +		array[k++] = e;
>> +
>> +	QSORT(array, nr, my_cmp);
> 
> It is entirely unclear why foreach() must return the object names in
> order.
> 
>> +	for (k = 0; k < nr; k++)
>> +		cb(k, nr, array[k], cb_data);
> 
> Also it is not clear if you wanted to expose the type of the
> entry to the callback function.
> 
> An obvious alternative
> 
> 	fn(&array[k].entry.oid, cb_data);
> 
> would allow you to keep the type of map-entry private to the map,
> and also the callback does not need to know about k or nr.
> 
> I guess you are giving k and nr to allow the callers to do a
> progress bar?  If that is the case, that's fine by me.  I still do
> not understand why we uncondtionally sort, though.

The thought was that we would sort the OIDs so that things
like rev-list could print the omitted/missing objects in OID
order.  Not critical that we do it here, but I thought it would
help callers.

I included the {k, nr} so that the callback could dump header/trailer
information when reporting the results or pre-allocate an array.
I'll look at refactoring this -- I never quite liked how it turned
out anyway -- especially with the oidmap simplifications.

>> +
>> +	free(array);
>> +}
>> diff --git a/list-objects-filter-map.h b/list-objects-filter-map.h
>> new file mode 100644
>> index 0000000..794fc81
>> --- /dev/null
>> +++ b/list-objects-filter-map.h
>> @@ -0,0 +1,26 @@
>> +#ifndef LIST_OBJECTS_FILTER_MAP_H
>> +#define LIST_OBJECTS_FILTER_MAP_H
>> +
>> +#include "oidmap.h"
>> +
>> +struct list_objects_filter_map_entry {
>> +	struct oidmap_entry entry; /* must be first */
>> +	enum object_type type;
>> +	char pathname[FLEX_ARRAY];
>> +};
>> +
>> +extern int list_objects_filter_map_insert(
>> +	struct oidmap *map,
>> +	const struct object_id *oid,
>> +	const char *pathname, enum object_type type);
>> +
>> +typedef void (*list_objects_filter_map_foreach_cb)(
>> +	int i, int i_limit,
>> +	struct list_objects_filter_map_entry *e, void *cb_data);
>> +
>> +extern void list_objects_filter_map_foreach(
>> +	struct oidmap *map,
>> +	list_objects_filter_map_foreach_cb cb,
>> +	void *cb_data);
>> +
>> +#endif /* LIST_OBJECTS_FILTER_MAP_H */

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 03/13] list-objects: filter objects in traverse_commit_list
  2017-10-25  4:05   ` Jonathan Tan
@ 2017-10-25 19:25     ` Jeff Hostetler
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff Hostetler @ 2017-10-25 19:25 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Git mailing list, Junio C Hamano, Jeff King, Jeff Hostetler



On 10/25/2017 12:05 AM, Jonathan Tan wrote:
> On Tue, Oct 24, 2017 at 11:53 AM, Jeff Hostetler <git@jeffhostetler.com> wrote:
> 
>> +enum list_objects_filter_result {
>> +       LOFR_ZERO      = 0,
>> +       LOFR_MARK_SEEN = 1<<0,
> 
> Probably worth documenting, something like /* Mark this object so that
> it is skipped for the rest of the traversal. */
> 
>> +       LOFR_SHOW      = 1<<1,
> 
> And something like /* Invoke show_object_fn on this object. This
> object may be revisited unless LOFR_MARK_SEEN is also set. */
> 
>> +};
>> +
>> +/* See object.h and revision.h */
>> +#define FILTER_REVISIT (1<<25)
> 
> I think this should be declared closer to its use - in the sparse
> filter code or in the file that uses it. Wherever it is, also update
> the chart in object.h to indicate that we're using this 25th bit.
> 
>> +
>> +enum list_objects_filter_type {
>> +       LOFT_BEGIN_TREE,
>> +       LOFT_END_TREE,
>> +       LOFT_BLOB
>> +};
>> +
>> +typedef enum list_objects_filter_result list_objects_filter_result;
>> +typedef enum list_objects_filter_type list_objects_filter_type;
> 
> I don't think we typedef enums in Git code.
> 
>> +
>> +typedef list_objects_filter_result (*filter_object_fn)(
>> +       list_objects_filter_type filter_type,
>> +       struct object *obj,
>> +       const char *pathname,
>> +       const char *filename,
>> +       void *filter_data);
>> +
>> +void traverse_commit_list_worker(
>> +       struct rev_info *,
>> +       show_commit_fn, show_object_fn, void *show_data,
>> +       filter_object_fn filter, void *filter_data);
> 
> I think things would be much clearer if a default filter was declared
> (matching the behavior as of this patch when filter == NULL), say:
> static inline default_filter(args) { switch(filter_type) { case
> LOFT_BEGIN_TREE: return LOFR_MARK_SEEN | LOFR_SHOW; case
> LOFT_END_TREE: return LOFT_ZERO; ...
> 
> And inline traverse_commit_list() instead of putting it in the .c file.
> 
> This would reduce or eliminate the need to document
> traverse_commit_list_worker, including what happens if filter is NULL,
> and explain how a user would make their own filter_object_fn.
> 
>> +
>> +#endif /* LIST_OBJECTS_H */
>> --
>> 2.9.3
>>

I'll give that a try.  Thanks!

Jeff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 07/13] list-objects-filter-options: common argument parsing
  2017-10-25  4:14   ` Jonathan Tan
@ 2017-10-25 19:28     ` Jeff Hostetler
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff Hostetler @ 2017-10-25 19:28 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Git mailing list, Junio C Hamano, Jeff King, Jeff Hostetler



On 10/25/2017 12:14 AM, Jonathan Tan wrote:
> On Tue, Oct 24, 2017 at 11:53 AM, Jeff Hostetler <git@jeffhostetler.com> wrote:
>> + * <arg> ::= blob:none
>> + *           blob:limit:<n>[kmg]
>> + *           sparse:oid:<oid-expression>
>> + *           sparse:path:<pathname>
> 
> I notice in the code below that there are some usages of "=" instead
> of ":" - could you clarify which one it is? (Ideally this would point
> to one point of documentation which serves as both user and technical
> documentation.)

good catch.  thanks!
  
>> + */
>> +int parse_list_objects_filter(struct list_objects_filter_options *filter_options,
>> +                             const char *arg)
>> +{
>> +       struct object_context oc;
>> +       struct object_id sparse_oid;
>> +       const char *v0;
>> +       const char *v1;
>> +
>> +       if (filter_options->choice)
>> +               die(_("multiple object filter types cannot be combined"));
>> +
>> +       /*
>> +        * TODO consider rejecting 'arg' if it contains any
>> +        * TODO injection characters (since we might send this
>> +        * TODO to a sub-command or to the server and we don't
>> +        * TODO want to deal with legacy quoting/escaping for
>> +        * TODO a new feature).
>> +        */
>> +
>> +       filter_options->raw_value = strdup(arg);
>> +
>> +       if (skip_prefix(arg, "blob:", &v0) || skip_prefix(arg, "blobs:", &v0)) {
> 
> I know that some people prefer leniency, but I think it's better to
> standardize on one form ("blob" instead of both "blob" and "blobs").

I could go either way on this.  (I kept mistyping it during interactive testing,
so I added both cases...)

> 
>> +               if (!strcmp(v0, "none")) {
>> +                       filter_options->choice = LOFC_BLOB_NONE;
>> +                       return 0;
>> +               }
>> +
>> +               if (skip_prefix(v0, "limit=", &v1) &&
>> +                   git_parse_ulong(v1, &filter_options->blob_limit_value)) {
>> +                       filter_options->choice = LOFC_BLOB_LIMIT;
>> +                       return 0;
>> +               }
>> +       }
>> +       else if (skip_prefix(arg, "sparse:", &v0)) {
>> +               if (skip_prefix(v0, "oid=", &v1)) {
>> +                       filter_options->choice = LOFC_SPARSE_OID;
>> +                       if (!get_oid_with_context(v1, GET_OID_BLOB,
>> +                                                 &sparse_oid, &oc)) {
>> +                               /*
>> +                                * We successfully converted the <oid-expr>
>> +                                * into an actual OID.  Rewrite the raw_value
>> +                                * in canonoical form with just the OID.
>> +                                * (If we send this request to the server, we
>> +                                * want an absolute expression rather than a
>> +                                * local-ref-relative expression.)
>> +                                */
>> +                               free((char *)filter_options->raw_value);
>> +                               filter_options->raw_value =
>> +                                       xstrfmt("sparse:oid=%s",
>> +                                               oid_to_hex(&sparse_oid));
>> +                               filter_options->sparse_oid_value =
>> +                                       oiddup(&sparse_oid);
>> +                       } else {
>> +                               /*
>> +                                * We could not turn the <oid-expr> into an
>> +                                * OID.  Leave the raw_value as is in case
>> +                                * the server can parse it.  (It may refer to
>> +                                * a branch, commit, or blob we don't have.)
>> +                                */
>> +                       }
>> +                       return 0;
>> +               }
>> +
>> +               if (skip_prefix(v0, "path=", &v1)) {
>> +                       filter_options->choice = LOFC_SPARSE_PATH;
>> +                       filter_options->sparse_path_value = strdup(v1);
>> +                       return 0;
>> +               }
>> +       }
>> +
>> +       die(_("invalid filter expression '%s'"), arg);
>> +       return 0;
>> +}
>> +
>> +int opt_parse_list_objects_filter(const struct option *opt,
>> +                                 const char *arg, int unset)
>> +{
>> +       struct list_objects_filter_options *filter_options = opt->value;
>> +
>> +       assert(arg);
>> +       assert(!unset);
>> +
>> +       return parse_list_objects_filter(filter_options, arg);
>> +}
>> diff --git a/list-objects-filter-options.h b/list-objects-filter-options.h
>> new file mode 100644
>> index 0000000..23bd68e
>> --- /dev/null
>> +++ b/list-objects-filter-options.h
>> @@ -0,0 +1,50 @@
>> +#ifndef LIST_OBJECTS_FILTER_OPTIONS_H
>> +#define LIST_OBJECTS_FILTER_OPTIONS_H
>> +
>> +#include "parse-options.h"
>> +
>> +/*
>> + * Common declarations and utilities for filtering objects (such as omitting
>> + * large blobs) in list_objects:traverse_commit_list() and git-rev-list.
>> + */
>> +
>> +enum list_objects_filter_choice {
>> +       LOFC_DISABLED = 0,
>> +       LOFC_BLOB_NONE,
>> +       LOFC_BLOB_LIMIT,
>> +       LOFC_SPARSE_OID,
>> +       LOFC_SPARSE_PATH,
>> +};
>> +
>> +struct list_objects_filter_options {
>> +       /*
>> +        * The raw argument value given on the command line or
>> +        * protocol request.  (The part after the "--keyword=".)
>> +        */
>> +       char *raw_value;
>> +
>> +       /*
>> +        * Parsed values. Only 1 will be set depending on the flags below.
>> +        */
>> +       struct object_id *sparse_oid_value;
>> +       char *sparse_path_value;
>> +       unsigned long blob_limit_value;
>> +
>> +       enum list_objects_filter_choice choice;
>> +};
>> +
>> +/* Normalized command line arguments */
>> +#define CL_ARG__FILTER "filter"
>> +
>> +int parse_list_objects_filter(struct list_objects_filter_options *filter_options,
>> +                             const char *arg);
>> +
>> +int opt_parse_list_objects_filter(const struct option *opt,
>> +                                 const char *arg, int unset);
>> +
>> +#define OPT_PARSE_LIST_OBJECTS_FILTER(fo) \
>> +       { OPTION_CALLBACK, 0, CL_ARG__FILTER, fo, N_("args"), \
>> +         N_("object filtering"), PARSE_OPT_NONEG, \
>> +         opt_parse_list_objects_filter }
> 
> Thanks - this does make it easier to have a standard argument name and
> description everywhere.
> 
>> +
>> +#endif /* LIST_OBJECTS_FILTER_OPTIONS_H */
>> --
>> 2.9.3
>>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 08/13] list-objects: add traverse_commit_list_filtered method
  2017-10-25  4:24   ` Jonathan Tan
@ 2017-10-25 19:29     ` Jeff Hostetler
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff Hostetler @ 2017-10-25 19:29 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Git mailing list, Junio C Hamano, Jeff King, Jeff Hostetler



On 10/25/2017 12:24 AM, Jonathan Tan wrote:
> On Tue, Oct 24, 2017 at 11:53 AM, Jeff Hostetler <git@jeffhostetler.com> wrote:
>> +void traverse_commit_list_filtered(
>> +       struct list_objects_filter_options *filter_options,
>> +       struct rev_info *revs,
>> +       show_commit_fn show_commit,
>> +       show_object_fn show_object,
>> +       list_objects_filter_map_foreach_cb print_omitted_object,
>> +       void *show_data);
> 
> So the function call chain, if we wanted a filtered traversal, is:
> traverse_commit_list_filtered -> traverse_commit_list__sparse_path
> (and friends, and each algorithm is in its own file) ->
> traverse_commit_list_worker
> 
> This makes the implementation of each algorithm more easily understood
> (since they are all in their own files), but also increases the number
> of global functions and code files. I personally would combine the
> traverse_commit_list__* functions into one file
> (list-objects-filtered.c), make them static, and also put
> traverse_commit_list_filtered in there, but I understand that other
> people in the Git project may differ on this.
> 

I'll do a round of refactoring to include your suggestion of
a default null filter.  Then with that see what collapsing this
looks like.

Thanks,
Jeff

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 10/13] rev-list: add list-objects filtering support
  2017-10-25  4:41   ` Jonathan Tan
@ 2017-10-25 19:37     ` Jeff Hostetler
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff Hostetler @ 2017-10-25 19:37 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Git mailing list, Junio C Hamano, Jeff King, Jeff Hostetler



On 10/25/2017 12:41 AM, Jonathan Tan wrote:
> On Tue, Oct 24, 2017 at 11:53 AM, Jeff Hostetler <git@jeffhostetler.com> wrote:
>>   static void finish_object(struct object *obj, const char *name, void *cb_data)
>>   {
>>          struct rev_list_info *info = cb_data;
>> -       if (obj->type == OBJ_BLOB && !has_object_file(&obj->oid))
>> +       if (obj->type == OBJ_BLOB && !has_object_file(&obj->oid)) {
>> +               if (arg_print_missing) {
>> +                       list_objects_filter_map_insert(
>> +                               &missing_objects, &obj->oid, name, obj->type);
>> +                       return;
>> +               }
>> +
>> +               /*
>> +                * Relax consistency checks when we expect missing
>> +                * objects because of partial-clone or a previous
>> +                * partial-fetch.
>> +                *
>> +                * Note that this is independent of any filtering that
>> +                * we are doing in this run.
>> +                */
>> +               if (is_partial_clone_registered())
>> +                       return;
>> +
>>                  die("missing blob object '%s'", oid_to_hex(&obj->oid));
> 
> I'm fine with arg_print_missing suppressing lazy fetching (when I
> rebase my patches on this, I'll have to ensure that fetch_if_missing
> is set to 0 if arg_print_missing is true), but I think that the
> behavior when arg_print_missing is false should be the opposite - we
> should let has_object_file() perform the lazy fetching, and die if it
> returns false (that is, if the fetching failed).

Right. This is a point where our different approaches need
to come together.  My "is_partial_clone_registered" is essentially
a placeholder for your lazy fetching.  so we can delete this call
when your changes are in.  Basically, you set:
	fetch_if_missing = !arg_print_missing
at the top.

> 
>> +       }
>>          if (info->revs->verify_objects && !obj->parsed && obj->type != OBJ_COMMIT)
>>                  parse_object(&obj->oid);
>>   }

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/13] WIP Partial clone part 1: object filtering
  2017-10-25  6:46   ` Jonathan Tan
  2017-10-25 15:39     ` Jeff Hostetler
@ 2017-10-26  2:01     ` Junio C Hamano
  2017-10-30 22:27     ` Jonathan Nieder
  2 siblings, 0 replies; 37+ messages in thread
From: Junio C Hamano @ 2017-10-26  2:01 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: Jeff Hostetler, Git mailing list, Jeff King, Jeff Hostetler

Jonathan Tan <jonathantanmy@google.com> writes:

> On Tue, Oct 24, 2017 at 10:00 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> OK, thanks for working well together.  So does this (1) build on
>> Jonathan's fsck-squelching series, or (2) ignores that and builds
>> filtering first, potentially leaving the codebase to a broken state
>> where it can create fsck-unclean repository until Jonathan's series
>> is rebased on top of this, or (3) something else?  [*1*]
>
> Excluding the partialclone patch (patch 9), I think that the answer is
> (2), but I don't think that it leaves the codebase in a broken state.
> In particular, none of the code modifies the repo, so it can't create
> a fsck-unclean one.

OK.  It is not dangerous enough to matter until we start using the
updated features in repack->rev-list|pack-objects ;-)  As I said, I
was mostly interested in learning what the big-picture direction was
and also seeing you two were more-or-less in agreement.

> The above is relevant only if we can exclude the partialclone patch,
> but I think that we can and we should, as I wrote in my reply to Jeff
> Hostetler [1].

OK.

> As for how this patch set (excluding the partialclone patch) interacts
> with my fsck series, they are relatively independent, as far as I can
> tell. I'll rebase my fsck, gc, and lazy object fetch patches (but not
> the fetch and clone parts, which we plan to instead adapt from Jeff
> Hostetler's patches, as far as I know) on top of these and resend
> those out once discussion on this has settled.

OK.  Thanks, I think tht is a reasonable way forward.

> [1] https://public-inbox.org/git/CAGf8dg+8AR=XfSV92ODAtKTNjBnD1+oVZp9rs4Y4Otz_eZyTfg@mail.gmail.com/
>
>> I also saw a patch marked as "this is from Jonathan's earlier work",
>> taking the authorship (which to me implies that the changes were
>> extensive enough), so I am a bit at loss envisioning how this piece
>> fits in the bigger picture together with the other piece.
>
> The patch you mentioned is the partialclone patch, which I think can
> be considered separately from the rest (as I said above).

Good, that lets us sidestep Jeff's question about "how should the
credits for this change attributed?", too.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/13] WIP Partial clone part 1: object filtering
  2017-10-25 15:39     ` Jeff Hostetler
@ 2017-10-26  2:09       ` Junio C Hamano
  0 siblings, 0 replies; 37+ messages in thread
From: Junio C Hamano @ 2017-10-26  2:09 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: Jonathan Tan, Git mailing list, Jeff King, Jeff Hostetler

Jeff Hostetler <git@jeffhostetler.com> writes:

> A question of mailing-list etiquette: in patch 9, I took Jonathan's
> ideas for adding the "extensions.partialclone" setting and extended it
> with some helper functions.  His change was part of a larger change
> with other code (fsck, IIRC) that I wasn't ready for.  What is the
> preferred way to give credit for something like this?

I think the note you left in the proposed log message

    This patch is part of a patch originally authored by:
    Jonathan Tan <jonathantanmy@google.com>

was a bit misleading.  The phrasing makes it sound as if it is
more-or-less verbatim copy (either of the whole thing or just a
subset) of Jonathan's patch, in which case, keeping the authorship
intact, i.e.

	From: Jonathan Tan <jonathantanmy@google.com>

	... log message taken from the original, with additional
        ... note to describe any adjustment you did

	Signed-off-by: Jonathan Tan <jonathantanmy@google.com>
	Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>

would have been more appropriate.  But if you just were inspired by
the idea in his patch and wrote a one that is similar to but
different from it that suits the need of your series better, then a
note left in the log that instead does s/is part of/was inspired by/
would have been perfectly fine.

Thanks.




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 01/13] dir: allow exclusions from blob in addition to file
  2017-10-25 14:54     ` Jeff Hostetler
@ 2017-10-26  3:47       ` Junio C Hamano
  2017-10-26 18:11         ` Jeff Hostetler
  0 siblings, 1 reply; 37+ messages in thread
From: Junio C Hamano @ 2017-10-26  3:47 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, peff, jonathantanmy, Jeff Hostetler

Jeff Hostetler <git@jeffhostetler.com> writes:

> The existing code handles use cases where you want to read the
> exclusion list from a pathname in the worktree -- or from blob
> named in the index when the pathname is not populated (presumably
> because of the skip-worktree bit).
>
> I was wanting to add a more general case (and perhaps my commit
> message should be improved).  I want to be able to read it from
> a blob not necessarily associated with the current commit or
> not necessarily available on the local client, but yet known to
> exist.  

Oh, I understand the above two paragraphs perfectly well, and I
agree with you that such a helper to read from an arbitrary blob is
a worthy thing to have.  I was merely commenting on the fact that
such a helper that is meant to be able to handle more general cases
is not used to help the more specific case that we already have,
which was a bit curious.

I guess the reason why it is not done is (besides expediency)
because the model the new helper operates in would not fit well with
the existing logic flow, where everything is loaded into core
(either from the filesystem or from a blob) and then a common code
parses and registers; the helper wants to do the reading (only) from
the blob, the parsing and the registration all by itself, so there
is not much that can be shared even if the existing code wanted to
reuse what the helper offers.

The new helper mimicks the read_skip_worktree_file_from_index()
codepath to massage the data it reads from the blob to buf[] but not
really (e.g. even though it copies and pastes a lot, it forgets to
call skip_utf8_bom(), for example).  We may still want to see if we
can share more so that we do not have to worry about these tiny
differences between codepaths.

> With my "add_excludes_from_blob_to_list()", we can request a
> blob-ish expression, such as "master:enlistments/foo".  In my
> later commits associated with clone and fetch, we can use this
> mechanism to let the client ask the server to filter using the
> blob associated with this blob-ish.  If the client has the blob
> (such as during a later fetch) and can resolve it, then it can
> and send the server the OID, but it can also send the blob-ish
> to the server and let it resolve it.

Security-minded people may want to keep an eye or two open for these
later patches---extended SHA-1 expressions is a new attack surface
we would want to carefully polish and protect.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 02/13] list-objects-filter-map: extend oidmap to collect omitted objects
  2017-10-25 19:22     ` Jeff Hostetler
@ 2017-10-26  4:12       ` Junio C Hamano
  0 siblings, 0 replies; 37+ messages in thread
From: Junio C Hamano @ 2017-10-26  4:12 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, peff, jonathantanmy, Jeff Hostetler

Jeff Hostetler <git@jeffhostetler.com> writes:

> Sorry, I meant a later commit in this patch series.  It is used by
> commits 4, 5, 6, and 10 to actually do the filtering and collect a
> list of omitted or missing objects.

I know you meant "later commits in the series" ;-).  

It does not change the fact that readers of 02/13 haven't seen them
yet to understand patch 02/13, if the changes that drove the design
of this step is in the same series or if they are not yet posted.

> I think of a "set" as a member? or not-member? class.
> I think of a "map" as a member? or not-member? class but where each
> member also has a value.  Sometimes map lookups just want to know
> membership and sometimes the lookup wants the value.
>
> Granted, having the key and value data stuffed into the same entry
> (from hashmap's point of view, rather than a key having a pointer
> to a value) does kind of blur the line, but I was thinking about
> a map here.  (And I was building on oidmap which builds on hashmap,
> so it seemed appropriate.)

My question was mostly about "if this is a map, then a caller that
queries the map with an oid does so because it wants to know the
data associated to the oid; if this is just a set, it is mostly
interested in the membership" and "I cannot quite tell which was
meant without the caller".  

It seems that some callers do care about the "path" name from your
response above, so calling this "map" sounds more appropriate.

The answer "it can be used to speed up 'is this path excluded?'
check" is a bit worrisome, though.  A blob can appear at more than
one path, and unless all the appearances of it are in an excluded
path, omitting the blob from the repository would lead to an aborted
"rev-list --objects" run, and this "map" can record at most one path
per each object; we need to wait until seeing the optimization code
to actually see how effectively this data helps optimization and
comment on the code ;-)

>>> +	len = ((pathname && *pathname) ? strlen(pathname) : 0);
>>> +	size = (offsetof(struct list_objects_filter_map_entry, pathname) + len + 1);
>>> +	e = xcalloc(1, size);
>>> +
>>> +	oidcpy(&e->entry.oid, oid);
>>> +	e->type = type;
>>> +	if (pathname && *pathname)
>>> +		strcpy(e->pathname, pathname);
>>> +
>>> +	oidmap_put(map, e);
>>> +	return 0;
>>> +}
>>
>> The return value from the function needs to be documented in the
>> header to help callers.  It is not apparent why "we did already have
>> one" and "we now newly added" is interesting to the callers, for
>> example.  An obvious alternative implementation of this function
>> would return the pointer to an entry that records the object id
>> (i.e. either the one that was already there, or the one we created
>> because we saw this object for the first time), so that the caller
>> can do something interesting to it---again, because the reason why
>> we want this "filter map" is not explained at this stage, it is hard
>> to tell what that "sometehing interesting" would be.
>
> good point.  thanks.

I am more confused by the response ;-) But as we established that
this is a map (not a set that borrows the implementation of map),
where the data recorded in 'e' is quite useful to the caller, it
probably makes sense to make 'e' available to the caller?  It is
still unclear if the caller finds "it is the first time I saw the
object you gave me" vs "I've seen that object before already"
useful.

>>> +	for (k = 0; k < nr; k++)
>>> +		cb(k, nr, array[k], cb_data);
>>
>> Also it is not clear if you wanted to expose the type of the
>> entry to the callback function.
>
> The thought was that we would sort the OIDs so that things
> like rev-list could print the omitted/missing objects in OID
> order.  Not critical that we do it here, but I thought it would
> help callers.

I can foresee some callers would want sorted, while others do not.
I was primarily wondering why "my_cmp" is not a parameter that can
be NULL (in which case we do not sort at all).

>> An obvious alternative
>>
>> 	fn(&array[k].entry.oid, cb_data);
>>
>> would allow you to keep the type of map-entry private to the map,
>> and also the callback does not need to know about k or nr.
>> ...
> I included the {k, nr} so that the callback could dump header/trailer
> information when reporting the results or pre-allocate an array.
> I'll look at refactoring this -- I never quite liked how it turned
> out anyway -- especially with the oidmap simplifications.

And as we established that this is a map, where the data associated
with each oid is interesting to the caller, we do not want to hide
the type of array[] element by passing only &array[k].entry.oid, I
guess?

Thanks.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 01/13] dir: allow exclusions from blob in addition to file
  2017-10-26  3:47       ` Junio C Hamano
@ 2017-10-26 18:11         ` Jeff Hostetler
  0 siblings, 0 replies; 37+ messages in thread
From: Jeff Hostetler @ 2017-10-26 18:11 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, peff, jonathantanmy, Jeff Hostetler



On 10/25/2017 11:47 PM, Junio C Hamano wrote:
> Jeff Hostetler <git@jeffhostetler.com> writes:
> 
>> The existing code handles use cases where you want to read the
>> exclusion list from a pathname in the worktree -- or from blob
>> named in the index when the pathname is not populated (presumably
>> because of the skip-worktree bit).
>>
>> I was wanting to add a more general case (and perhaps my commit
>> message should be improved).  I want to be able to read it from
>> a blob not necessarily associated with the current commit or
>> not necessarily available on the local client, but yet known to
>> exist.
> 
> Oh, I understand the above two paragraphs perfectly well, and I
> agree with you that such a helper to read from an arbitrary blob is
> a worthy thing to have.  I was merely commenting on the fact that
> such a helper that is meant to be able to handle more general cases
> is not used to help the more specific case that we already have,
> which was a bit curious.
> 
> I guess the reason why it is not done is (besides expediency)
> because the model the new helper operates in would not fit well with
> the existing logic flow, where everything is loaded into core
> (either from the filesystem or from a blob) and then a common code
> parses and registers; the helper wants to do the reading (only) from
> the blob, the parsing and the registration all by itself, so there
> is not much that can be shared even if the existing code wanted to
> reuse what the helper offers.
> 
> The new helper mimicks the read_skip_worktree_file_from_index()
> codepath to massage the data it reads from the blob to buf[] but not
> really (e.g. even though it copies and pastes a lot, it forgets to
> call skip_utf8_bom(), for example).  We may still want to see if we
> can share more so that we do not have to worry about these tiny
> differences between codepaths.

I'm going to extract this commit, refactor it to try to share
more code with the existing read_skip_worktree_file_from_index()
and submit it as a separate patch series so that we can discuss
it in isolation without the rest of the partial-clone code getting
in the way.

The call to skip_utf8_bom() was captured in the new
add_excludes_from_buffer() routine that both add_excludes()
and my new add_excludes_from_blob_to_list() call.

> 
>> With my "add_excludes_from_blob_to_list()", we can request a
>> blob-ish expression, such as "master:enlistments/foo".  In my
>> later commits associated with clone and fetch, we can use this
>> mechanism to let the client ask the server to filter using the
>> blob associated with this blob-ish.  If the client has the blob
>> (such as during a later fetch) and can resolve it, then it can
>> and send the server the OID, but it can also send the blob-ish
>> to the server and let it resolve it.
> 
> Security-minded people may want to keep an eye or two open for these
> later patches---extended SHA-1 expressions is a new attack surface
> we would want to carefully polish and protect.
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 00/13] WIP Partial clone part 1: object filtering
  2017-10-25  6:46   ` Jonathan Tan
  2017-10-25 15:39     ` Jeff Hostetler
  2017-10-26  2:01     ` Junio C Hamano
@ 2017-10-30 22:27     ` Jonathan Nieder
  2 siblings, 0 replies; 37+ messages in thread
From: Jonathan Nieder @ 2017-10-30 22:27 UTC (permalink / raw)
  To: Jonathan Tan
  Cc: Junio C Hamano, Jeff Hostetler, Git mailing list, Jeff King,
	Jeff Hostetler

Hi,

Jonathan Tan wrote:

> As for how this patch set (excluding the partialclone patch) interacts
> with my fsck series, they are relatively independent, as far as I can
> tell. I'll rebase my fsck, gc, and lazy object fetch patches (but not
> the fetch and clone parts, which we plan to instead adapt from Jeff
> Hostetler's patches, as far as I know) on top of these and resend
> those out once discussion on this has settled.

Selfishly, I'll make a request here.  The only part of the series that
overlaps is the max-blob-bytes part, right?  Would you mind re-sending
the remainder of the series so it can go through the "next" ->
"master" -> etc process in the usual way?

My selfishness comes in because this would reduce the set of patches I
have to apply locally to just the max-blob-bytes part.  If I
understood correctly, the rest of the series was something everyone
agreed about, so there's no reason not to pursue including it in
"next".

I'd have the same request for this object filtering series, but I
think it's already happening: the patches in this thread so far do not
allow omitting some blobs from the local object store, so they should
be able to go through the "next" -> "master" -> etc process as well
without having to wait for the fsck patches.

Thanks,
Jonathan

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2017-10-30 22:27 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-24 18:53 [PATCH 00/13] WIP Partial clone part 1: object filtering Jeff Hostetler
2017-10-24 18:53 ` [PATCH 01/13] dir: allow exclusions from blob in addition to file Jeff Hostetler
2017-10-25  4:05   ` Eric Sunshine
2017-10-25  6:43   ` Junio C Hamano
2017-10-25 14:54     ` Jeff Hostetler
2017-10-26  3:47       ` Junio C Hamano
2017-10-26 18:11         ` Jeff Hostetler
2017-10-24 18:53 ` [PATCH 02/13] list-objects-filter-map: extend oidmap to collect omitted objects Jeff Hostetler
2017-10-25  7:10   ` Junio C Hamano
2017-10-25 19:22     ` Jeff Hostetler
2017-10-26  4:12       ` Junio C Hamano
2017-10-24 18:53 ` [PATCH 03/13] list-objects: filter objects in traverse_commit_list Jeff Hostetler
2017-10-25  4:05   ` Jonathan Tan
2017-10-25 19:25     ` Jeff Hostetler
2017-10-24 18:53 ` [PATCH 04/13] list-objects-filter-blobs-none: add filter to omit all blobs Jeff Hostetler
2017-10-24 18:53 ` [PATCH 05/13] list-objects-filter-blobs-limit: add large blob filtering Jeff Hostetler
2017-10-24 18:53 ` [PATCH 06/13] list-objects-filter-sparse: add sparse filter Jeff Hostetler
2017-10-24 18:53 ` [PATCH 07/13] list-objects-filter-options: common argument parsing Jeff Hostetler
2017-10-25  4:14   ` Jonathan Tan
2017-10-25 19:28     ` Jeff Hostetler
2017-10-24 18:53 ` [PATCH 08/13] list-objects: add traverse_commit_list_filtered method Jeff Hostetler
2017-10-25  4:24   ` Jonathan Tan
2017-10-25 19:29     ` Jeff Hostetler
2017-10-24 18:53 ` [PATCH 09/13] extension.partialclone: introduce partial clone extension Jeff Hostetler
2017-10-24 18:53 ` [PATCH 10/13] rev-list: add list-objects filtering support Jeff Hostetler
2017-10-25  4:41   ` Jonathan Tan
2017-10-25 19:37     ` Jeff Hostetler
2017-10-24 18:53 ` [PATCH 11/13] t6112: rev-list object filtering test Jeff Hostetler
2017-10-24 18:53 ` [PATCH 12/13] pack-objects: add list-objects filtering Jeff Hostetler
2017-10-24 18:53 ` [PATCH 13/13] t5317: pack-objects object filtering test Jeff Hostetler
2017-10-25  4:57 ` [PATCH 00/13] WIP Partial clone part 1: object filtering Jonathan Tan
2017-10-25  5:00 ` Junio C Hamano
2017-10-25  6:46   ` Jonathan Tan
2017-10-25 15:39     ` Jeff Hostetler
2017-10-26  2:09       ` Junio C Hamano
2017-10-26  2:01     ` Junio C Hamano
2017-10-30 22:27     ` Jonathan Nieder

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).