[PATCH 00/13] RFC object filtering for parital clone

git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed

* [PATCH 00/13] RFC object filtering for parital clone
@ 2017-09-22 20:26 Jeff Hostetler
  2017-09-22 20:26 ` [PATCH 01/13] dir: refactor add_excludes() Jeff Hostetler
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Jeff Hostetler @ 2017-09-22 20:26 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, jeffhost

From: Jeff Hostetler <jeffhost@microsoft.com>

This patch series contains WIP code demonstrating object (blob) filtering
in rev-list and pack-objects using a common filtering API in
list-objects and traverse-commit-list that allows both commands
to perform the same type of filter operations.  And serve as the
basis of partial-clone and partial-fetch.

This draft contains filters to:
() omit all blobs
() omit blobs larger than some size
() omit blobs using a sparse-checkout specification

In addition to specifying the filter criteria, the rev-list command
was updated to include options to:
() print a list of the omitted objects (due to the current filtering
   criteria)
() print a list of missing objects (probably from a prior partial
   clone/fetch).

This latter print option can be used with or without a new filter
criteria allowing it to be used with a pre-checkout bulk pre-fetch
command.

For example, if blobs were omitted during the clone or a fetch, the
client can do:

   git rev-list --quiet --objects --filter-print-missing NEWBRANCH

and get a list of just the objects that are required to checkout
NEWBRANCH.

Or if a sparse-checkout is in effect, the client can specify the
same criteria to look for just the missing blobs needed to do the
sparse-checkout:

   git rev-list --quiet --objects --filter-print-missing \
       --filter-use-path=./git/info/sparse-checkout NEWBRANCH

It does not matter why a blob is missing; that is, what filter
criteria was used during the clone or fetch.  All that matters
is the blob is missing and is now needed.

These commands output a list of missing blobs that can be fed
into a bulk fetch object request.  The goal here is to minimize
the need for dynamic object fetch mechanisms currently being
discussed.  (We cannot eliminate the need for dynamic fetching,
but we can use this to precompute/prefetch in bulk.)

Pack-objects was updated to allow the server to build incomplete
packfiles without unwanted blobs.

This is the first step to support partial-clone and -fetch. I've
omitted from this patch series corresponding changes to fetch-pack,
upload-pack, index-pack, verify-pack, fsck, gc, and the git protocol.
I can make these available if there is interest.  I omit them from
this RFC to not distract from the basic filtering ideas.

It also does not address the promisor/promised ideas currently
being discussed [2,3].  These should be considered independently.

The code in this patch series can be seen here [1].

[1] https://github.com/jeffhostetler/git/pull/3
[2] https://public-inbox.org/git/xmqq8thbqlqf.fsf@gitster.mtv.corp.google.com/t/
[3] https://github.com/jonathantanmy/git/commits/partialclone2

Jeff Hostetler (13):
  dir: refactor add_excludes()
  oidset2: create oidset subclass with object length and pathname
  list-objects: filter objects in traverse_commit_list
  list-objects-filter-all: add filter to omit all blobs
  list-objects-filter-large: add large blob filter to list-objects
  list-objects-filter-sparse: add sparse-checkout based filter
  object-filter: common declarations for object filtering
  list-objects: add traverse_commit_list_filtered method
  rev-list: add object filtering support
  rev-list: add filtering help text
  t6112: rev-list object filtering test
  pack-objects: add object filtering support
  pack-objects: add filtering help text

 Documentation/git-pack-objects.txt  |  17 +++
 Documentation/git-rev-list.txt      |   9 +-
 Documentation/rev-list-options.txt  |  32 +++++
 Makefile                            |   5 +
 builtin/pack-objects.c              |  24 +++-
 builtin/rev-list.c                  |  73 +++++++++-
 dir.c                               |  53 ++++++-
 dir.h                               |   4 +
 list-objects-filter-all.c           |  85 ++++++++++++
 list-objects-filter-all.h           |  18 +++
 list-objects-filter-large.c         | 108 +++++++++++++++
 list-objects-filter-large.h         |  18 +++
 list-objects-filter-sparse.c        | 221 +++++++++++++++++++++++++++++
 list-objects-filter-sparse.h        |  30 ++++
 list-objects.c                      | 100 +++++++++++---
 list-objects.h                      |  41 ++++++
 object-filter.c                     | 269 ++++++++++++++++++++++++++++++++++++
 object-filter.h                     | 173 +++++++++++++++++++++++
 oidset2.c                           | 104 ++++++++++++++
 oidset2.h                           |  58 ++++++++
 t/t6112-rev-list-filters-objects.sh | 237 +++++++++++++++++++++++++++++++
 21 files changed, 1657 insertions(+), 22 deletions(-)
 create mode 100644 list-objects-filter-all.c
 create mode 100644 list-objects-filter-all.h
 create mode 100644 list-objects-filter-large.c
 create mode 100644 list-objects-filter-large.h
 create mode 100644 list-objects-filter-sparse.c
 create mode 100644 list-objects-filter-sparse.h
 create mode 100644 object-filter.c
 create mode 100644 object-filter.h
 create mode 100644 oidset2.c
 create mode 100644 oidset2.h
 create mode 100755 t/t6112-rev-list-filters-objects.sh

-- 
2.9.3

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 01/13] dir: refactor add_excludes()
  2017-09-22 20:26 [PATCH 00/13] RFC object filtering for parital clone Jeff Hostetler
@ 2017-09-22 20:26 ` Jeff Hostetler
  2017-09-22 20:26 ` [PATCH 02/13] oidset2: create oidset subclass with object length and pathname Jeff Hostetler
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Jeff Hostetler @ 2017-09-22 20:26 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, jeffhost

From: Jeff Hostetler <jeffhost@microsoft.com>

Refactor add_excludes() to separate the reading of the
exclude file into a buffer and the parsing of the buffer
into exclude_list items.

Add add_excludes_from_blob_to_list() to allow an exclude
file be specified with an OID.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 dir.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 dir.h |  4 ++++
 2 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/dir.c b/dir.c
index ae6f5c9..3aeae06 100644
--- a/dir.c
+++ b/dir.c
@@ -739,6 +739,11 @@ static void invalidate_directory(struct untracked_cache *uc,
 		dir->dirs[i]->recurse = 0;
 }
 
+static int add_excludes_from_buffer(
+	char *buf, size_t size,
+	const char *base, int baselen,
+	struct exclude_list *el);
+
 /*
  * Given a file with name "fname", read it (either from disk, or from
  * an index if 'istate' is non-null), parse it and store the
@@ -754,9 +759,9 @@ static int add_excludes(const char *fname, const char *base, int baselen,
 			struct sha1_stat *sha1_stat)
 {
 	struct stat st;
-	int fd, i, lineno = 1;
+	int fd;
 	size_t size = 0;
-	char *buf, *entry;
+	char *buf;
 
 	fd = open(fname, O_RDONLY);
 	if (fd < 0 || fstat(fd, &st) < 0) {
@@ -813,6 +818,18 @@ static int add_excludes(const char *fname, const char *base, int baselen,
 		}
 	}
 
+	add_excludes_from_buffer(buf, size, base, baselen, el);
+	return 0;
+}
+
+static int add_excludes_from_buffer(
+	char *buf, size_t size,
+	const char *base, int baselen,
+	struct exclude_list *el)
+{
+	int i, lineno = 1;
+	char *entry;
+
 	el->filebuf = buf;
 
 	if (skip_utf8_bom(&buf, size))
@@ -841,6 +858,38 @@ int add_excludes_from_file_to_list(const char *fname, const char *base,
 	return add_excludes(fname, base, baselen, el, istate, NULL);
 }
 
+int add_excludes_from_blob_to_list(
+	struct object_id *oid,
+	const char *base, int baselen,
+	struct exclude_list *el)
+{
+	char *buf;
+	unsigned long size;
+	enum object_type type;
+
+	buf = read_sha1_file(oid->hash, &type, &size);
+	if (!buf)
+		return -1;
+
+	if (type != OBJ_BLOB) {
+		free(buf);
+		return -1;
+	}
+
+	if (size == 0) {
+		free(buf);
+		return 0;
+	}
+
+	if (buf[size - 1] != '\n') {
+		buf = xrealloc(buf, st_add(size, 1));
+		buf[size++] = '\n';
+	}
+
+	add_excludes_from_buffer(buf, size, base, baselen, el);
+	return 0;
+}
+
 struct exclude_list *add_exclude_list(struct dir_struct *dir,
 				      int group_type, const char *src)
 {
diff --git a/dir.h b/dir.h
index e371705..242de63 100644
--- a/dir.h
+++ b/dir.h
@@ -256,6 +256,10 @@ extern struct exclude_list *add_exclude_list(struct dir_struct *dir,
 extern int add_excludes_from_file_to_list(const char *fname, const char *base, int baselen,
 					  struct exclude_list *el, struct  index_state *istate);
 extern void add_excludes_from_file(struct dir_struct *, const char *fname);
+extern int add_excludes_from_blob_to_list(
+	struct object_id *oid,
+	const char *base, int baselen,
+	struct exclude_list *el);
 extern void parse_exclude_pattern(const char **string, int *patternlen, unsigned *flags, int *nowildcardlen);
 extern void add_exclude(const char *string, const char *base,
 			int baselen, struct exclude_list *el, int srcpos);
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 02/13] oidset2: create oidset subclass with object length and pathname
  2017-09-22 20:26 [PATCH 00/13] RFC object filtering for parital clone Jeff Hostetler
  2017-09-22 20:26 ` [PATCH 01/13] dir: refactor add_excludes() Jeff Hostetler
@ 2017-09-22 20:26 ` Jeff Hostetler
  2017-09-22 20:42   ` Brandon Williams
  2017-09-26 22:20   ` Jonathan Tan
  2017-09-22 20:26 ` [PATCH 03/13] list-objects: filter objects in traverse_commit_list Jeff Hostetler
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 16+ messages in thread
From: Jeff Hostetler @ 2017-09-22 20:26 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, jeffhost

From: Jeff Hostetler <jeffhost@microsoft.com>

Create subclass of oidset where each entry has a
field to store the length of the object's content
and an optional pathname.

This will be used in a future commit to build a
manifest of omitted objects in a partial/narrow
clone/fetch.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 Makefile  |   1 +
 oidset2.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 oidset2.h |  58 +++++++++++++++++++++++++++++++++++
 3 files changed, 163 insertions(+)
 create mode 100644 oidset2.c
 create mode 100644 oidset2.h

diff --git a/Makefile b/Makefile
index 461c845..4e0cc39 100644
--- a/Makefile
+++ b/Makefile
@@ -816,6 +816,7 @@ LIB_OBJS += notes-merge.o
 LIB_OBJS += notes-utils.o
 LIB_OBJS += object.o
 LIB_OBJS += oidset.o
+LIB_OBJS += oidset2.o
 LIB_OBJS += pack-bitmap.o
 LIB_OBJS += pack-bitmap-write.o
 LIB_OBJS += pack-check.o
diff --git a/oidset2.c b/oidset2.c
new file mode 100644
index 0000000..43bc9ee
--- /dev/null
+++ b/oidset2.c
@@ -0,0 +1,104 @@
+#include "cache.h"
+#include "oidset2.h"
+
+static int oidset2_hashcmp(const void *unused_cmp_data,
+			   const void *va, const void *vb,
+			   const void *vkey)
+{
+	const struct oidset2_entry *a = va, *b = vb;
+	const struct object_id *key = vkey;
+	return oidcmp(&a->oid, key ? key : &b->oid);
+}
+
+struct oidset2_entry *oidset2_get(const struct oidset2 *set, const struct object_id *oid)
+{
+	struct hashmap_entry key;
+	struct oidset2_entry *value;
+
+	if (!set->map.cmpfn)
+		return NULL;
+
+	hashmap_entry_init(&key, sha1hash(oid->hash));
+	value = hashmap_get(&set->map, &key, oid);
+
+	return value;
+}
+
+int oidset2_contains(const struct oidset2 *set, const struct object_id *oid)
+{
+	return !!oidset2_get(set, oid);
+}
+
+int oidset2_insert(struct oidset2 *set, const struct object_id *oid,
+		   enum object_type type, int64_t object_length,
+		   const char *pathname)
+{
+	struct oidset2_entry *entry;
+
+	if (!set->map.cmpfn)
+		hashmap_init(&set->map, oidset2_hashcmp, NULL, 0);
+
+	if (oidset2_contains(set, oid))
+		return 1;
+
+	entry = xcalloc(1, sizeof(*entry));
+	hashmap_entry_init(&entry->hash, sha1hash(oid->hash));
+	oidcpy(&entry->oid, oid);
+
+	entry->type = type;
+	entry->object_length = object_length;
+	if (pathname)
+	    entry->pathname = strdup(pathname);
+
+	hashmap_add(&set->map, entry);
+	return 0;
+}
+
+void oidset2_remove(struct oidset2 *set, const struct object_id *oid)
+{
+	struct hashmap_entry key;
+	struct oidset2_entry *e;
+
+	hashmap_entry_init(&key, sha1hash(oid->hash));
+	e = hashmap_remove(&set->map, &key, oid);
+
+	free(e->pathname);
+	free(e);
+}
+
+void oidset2_clear(struct oidset2 *set)
+{
+	hashmap_free(&set->map, 1);
+}
+
+static int oidset2_cmp(const void *a, const void *b)
+{
+	const struct oidset2_entry *ae = *((const struct oidset2_entry **)a);
+	const struct oidset2_entry *be = *((const struct oidset2_entry **)b);
+
+	return oidcmp(&ae->oid, &be->oid);
+}
+
+void oidset2_foreach(struct oidset2 *set, oidset2_foreach_cb cb, void *cb_data)
+{
+	struct hashmap_iter iter;
+	struct oidset2_entry **array;
+	struct oidset2_entry *e;
+	int j, k;
+
+	array = xcalloc(set->map.size, sizeof(*e));
+
+	hashmap_iter_init(&set->map, &iter);
+	k = 0;
+	while ((e = hashmap_iter_next(&iter)))
+		array[k++] = e;
+
+	QSORT(array, k, oidset2_cmp);
+
+	for (j = 0; j < k; j++) {
+		e = array[j];
+		cb(j, k, e, cb_data);
+	}
+
+	free(array);
+}
diff --git a/oidset2.h b/oidset2.h
new file mode 100644
index 0000000..67d8a5a
--- /dev/null
+++ b/oidset2.h
@@ -0,0 +1,58 @@
+#ifndef OIDSET2_H
+#define OIDSET2_H
+
+/**
+ * oidset2 is a variant of oidset, but allows additional fields for each object.
+ */
+
+/**
+ * A single oidset2; should be zero-initialized (or use OIDSET2_INIT).
+ */
+struct oidset2 {
+	struct hashmap map;
+};
+
+#define OIDSET2_INIT { { NULL } }
+
+struct oidset2_entry {
+	struct hashmap_entry hash;
+	struct object_id oid;
+
+	enum object_type type;
+	int64_t object_length;	/* This is SIGNED. Use -1 when unknown. */
+	char *pathname;
+};
+
+struct oidset2_entry *oidset2_get(const struct oidset2 *set, const struct object_id *oid);
+
+/**
+ * Returns true iff `set` contains `oid`.
+ */
+int oidset2_contains(const struct oidset2 *set, const struct object_id *oid);
+
+/**
+ * Insert the oid into the set; a copy is made, so "oid" does not need
+ * to persist after this function is called.
+ *
+ * Returns 1 if the oid was already in the set, 0 otherwise. This can be used
+ * to perform an efficient check-and-add.
+ */
+int oidset2_insert(struct oidset2 *set, const struct object_id *oid,
+		   enum object_type type, int64_t object_length,
+		   const char *pathname);
+
+void oidset2_remove(struct oidset2 *set, const struct object_id *oid);
+
+typedef void (*oidset2_foreach_cb)(
+	int i, int i_limit,
+	struct oidset2_entry *e, void *cb_data);
+
+void oidset2_foreach(struct oidset2 *set, oidset2_foreach_cb cb, void *cb_data);
+
+/**
+ * Remove all entries from the oidset2, freeing any resources associated with
+ * it.
+ */
+void oidset2_clear(struct oidset2 *set);
+
+#endif /* OIDSET2_H */
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 03/13] list-objects: filter objects in traverse_commit_list
  2017-09-22 20:26 [PATCH 00/13] RFC object filtering for parital clone Jeff Hostetler
  2017-09-22 20:26 ` [PATCH 01/13] dir: refactor add_excludes() Jeff Hostetler
  2017-09-22 20:26 ` [PATCH 02/13] oidset2: create oidset subclass with object length and pathname Jeff Hostetler
@ 2017-09-22 20:26 ` Jeff Hostetler
  2017-09-26 22:31   ` Jonathan Tan
  2017-09-22 20:26 ` [PATCH 04/13] list-objects-filter-all: add filter to omit all blobs Jeff Hostetler
  2017-09-23  0:39 ` [PATCH 00/13] RFC object filtering for parital clone Jonathan Tan
  4 siblings, 1 reply; 16+ messages in thread
From: Jeff Hostetler @ 2017-09-22 20:26 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, jeffhost

From: Jeff Hostetler <jeffhost@microsoft.com>

Create traverse_commit_list_filtered() and add filtering
interface to allow certain objects to be omitted (not shown)
during a traversal.

Update traverse_commit_list() to be a wrapper for the above.

Filtering will be used in a future commit by rev-list and
pack-objects for narrow/partial clone/fetch to omit certain
blobs from the output.

traverse_bitmap_commit_list() does not work with filtering.
If a packfile bitmap is present, it will not be used.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 list-objects.c | 66 ++++++++++++++++++++++++++++++++++++++++++++--------------
 list-objects.h | 30 ++++++++++++++++++++++++++
 2 files changed, 80 insertions(+), 16 deletions(-)

diff --git a/list-objects.c b/list-objects.c
index b3931fa..3e86008 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -13,10 +13,13 @@ static void process_blob(struct rev_info *revs,
 			 show_object_fn show,
 			 struct strbuf *path,
 			 const char *name,
-			 void *cb_data)
+			 void *cb_data,
+			 filter_object_fn filter,
+			 void *filter_data)
 {
 	struct object *obj = &blob->object;
 	size_t pathlen;
+	list_objects_filter_result r = LOFR_MARK_SEEN | LOFR_SHOW;
 
 	if (!revs->blob_objects)
 		return;
@@ -24,11 +27,15 @@ static void process_blob(struct rev_info *revs,
 		die("bad blob object");
 	if (obj->flags & (UNINTERESTING | SEEN))
 		return;
-	obj->flags |= SEEN;
 
 	pathlen = path->len;
 	strbuf_addstr(path, name);
-	show(obj, path->buf, cb_data);
+	if (filter)
+		r = filter(LOFT_BLOB, obj, path->buf, &path->buf[pathlen], filter_data);
+	if (r & LOFR_MARK_SEEN)
+		obj->flags |= SEEN;
+	if (r & LOFR_SHOW)
+		show(obj, path->buf, cb_data);
 	strbuf_setlen(path, pathlen);
 }
 
@@ -69,7 +76,9 @@ static void process_tree(struct rev_info *revs,
 			 show_object_fn show,
 			 struct strbuf *base,
 			 const char *name,
-			 void *cb_data)
+			 void *cb_data,
+			 filter_object_fn filter,
+			 void *filter_data)
 {
 	struct object *obj = &tree->object;
 	struct tree_desc desc;
@@ -77,6 +86,7 @@ static void process_tree(struct rev_info *revs,
 	enum interesting match = revs->diffopt.pathspec.nr == 0 ?
 		all_entries_interesting: entry_not_interesting;
 	int baselen = base->len;
+	list_objects_filter_result r = LOFR_MARK_SEEN | LOFR_SHOW;
 
 	if (!revs->tree_objects)
 		return;
@@ -90,9 +100,13 @@ static void process_tree(struct rev_info *revs,
 		die("bad tree object %s", oid_to_hex(&obj->oid));
 	}
 
-	obj->flags |= SEEN;
 	strbuf_addstr(base, name);
-	show(obj, base->buf, cb_data);
+	if (filter)
+		r = filter(LOFT_BEGIN_TREE, obj, base->buf, &base->buf[baselen], filter_data);
+	if (r & LOFR_MARK_SEEN)
+		obj->flags |= SEEN;
+	if (r & LOFR_SHOW)
+		show(obj, base->buf, cb_data);
 	if (base->len)
 		strbuf_addch(base, '/');
 
@@ -112,7 +126,7 @@ static void process_tree(struct rev_info *revs,
 			process_tree(revs,
 				     lookup_tree(entry.oid),
 				     show, base, entry.path,
-				     cb_data);
+				     cb_data, filter, filter_data);
 		else if (S_ISGITLINK(entry.mode))
 			process_gitlink(revs, entry.oid->hash,
 					show, base, entry.path,
@@ -121,8 +135,17 @@ static void process_tree(struct rev_info *revs,
 			process_blob(revs,
 				     lookup_blob(entry.oid),
 				     show, base, entry.path,
-				     cb_data);
+				     cb_data, filter, filter_data);
 	}
+
+	if (filter) {
+		r = filter(LOFT_END_TREE, obj, base->buf, &base->buf[baselen], filter_data);
+		if (r & LOFR_MARK_SEEN)
+			obj->flags |= SEEN;
+		if (r & LOFR_SHOW)
+			show(obj, base->buf, cb_data);
+	}
+
 	strbuf_setlen(base, baselen);
 	free_tree_buffer(tree);
 }
@@ -183,10 +206,10 @@ static void add_pending_tree(struct rev_info *revs, struct tree *tree)
 	add_pending_object(revs, &tree->object, "");
 }
 
-void traverse_commit_list(struct rev_info *revs,
-			  show_commit_fn show_commit,
-			  show_object_fn show_object,
-			  void *data)
+void traverse_commit_list_worker(
+	struct rev_info *revs,
+	show_commit_fn show_commit, show_object_fn show_object, void *show_data,
+	filter_object_fn filter, void *filter_data)
 {
 	int i;
 	struct commit *commit;
@@ -200,7 +223,7 @@ void traverse_commit_list(struct rev_info *revs,
 		 */
 		if (commit->tree)
 			add_pending_tree(revs, commit->tree);
-		show_commit(commit, data);
+		show_commit(commit, show_data);
 	}
 	for (i = 0; i < revs->pending.nr; i++) {
 		struct object_array_entry *pending = revs->pending.objects + i;
@@ -211,19 +234,19 @@ void traverse_commit_list(struct rev_info *revs,
 			continue;
 		if (obj->type == OBJ_TAG) {
 			obj->flags |= SEEN;
-			show_object(obj, name, data);
+			show_object(obj, name, show_data);
 			continue;
 		}
 		if (!path)
 			path = "";
 		if (obj->type == OBJ_TREE) {
 			process_tree(revs, (struct tree *)obj, show_object,
-				     &base, path, data);
+				     &base, path, show_data, filter, filter_data);
 			continue;
 		}
 		if (obj->type == OBJ_BLOB) {
 			process_blob(revs, (struct blob *)obj, show_object,
-				     &base, path, data);
+				     &base, path, show_data, filter, filter_data);
 			continue;
 		}
 		die("unknown pending object %s (%s)",
@@ -232,3 +255,14 @@ void traverse_commit_list(struct rev_info *revs,
 	object_array_clear(&revs->pending);
 	strbuf_release(&base);
 }
+
+void traverse_commit_list(struct rev_info *revs,
+			  show_commit_fn show_commit,
+			  show_object_fn show_object,
+			  void *show_data)
+{
+	traverse_commit_list_worker(
+		revs,
+		show_commit, show_object, show_data,
+		NULL, NULL);
+}
diff --git a/list-objects.h b/list-objects.h
index 0cebf85..39fcbb5 100644
--- a/list-objects.h
+++ b/list-objects.h
@@ -8,4 +8,34 @@ void traverse_commit_list(struct rev_info *, show_commit_fn, show_object_fn, voi
 typedef void (*show_edge_fn)(struct commit *);
 void mark_edges_uninteresting(struct rev_info *, show_edge_fn);
 
+enum list_objects_filter_result {
+	LOFR_ZERO      = 0,
+	LOFR_MARK_SEEN = 1<<0,
+	LOFR_SHOW      = 1<<1,
+};
+
+/* See object.h and revision.h */
+#define FILTER_REVISIT (1<<25)
+
+enum list_objects_filter_type {
+	LOFT_BEGIN_TREE,
+	LOFT_END_TREE,
+	LOFT_BLOB
+};
+
+typedef enum list_objects_filter_result list_objects_filter_result;
+typedef enum list_objects_filter_type list_objects_filter_type;
+
+typedef list_objects_filter_result (*filter_object_fn)(
+	list_objects_filter_type filter_type,
+	struct object *obj,
+	const char *pathname,
+	const char *filename,
+	void *filter_data);
+
+void traverse_commit_list_worker(
+	struct rev_info *,
+	show_commit_fn, show_object_fn, void *show_data,
+	filter_object_fn filter, void *filter_data);
+
 #endif
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 04/13] list-objects-filter-all: add filter to omit all blobs
  2017-09-22 20:26 [PATCH 00/13] RFC object filtering for parital clone Jeff Hostetler
                   ` (2 preceding siblings ...)
  2017-09-22 20:26 ` [PATCH 03/13] list-objects: filter objects in traverse_commit_list Jeff Hostetler
@ 2017-09-22 20:26 ` Jeff Hostetler
  2017-09-23  0:39 ` [PATCH 00/13] RFC object filtering for parital clone Jonathan Tan
  4 siblings, 0 replies; 16+ messages in thread
From: Jeff Hostetler @ 2017-09-22 20:26 UTC (permalink / raw)
  To: git; +Cc: gitster, peff, jonathantanmy, jeffhost

From: Jeff Hostetler <jeffhost@microsoft.com>

Create a simple filter for traverse_commit_list_worker() to omit
all blobs from the result.

This filter will be used in a future commit by rev-list and pack-objects
to create a "commits and trees" result.  This is intended for partial
clone and fetch support.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
 Makefile                  |  1 +
 list-objects-filter-all.c | 85 +++++++++++++++++++++++++++++++++++++++++++++++
 list-objects-filter-all.h | 18 ++++++++++
 3 files changed, 104 insertions(+)
 create mode 100644 list-objects-filter-all.c
 create mode 100644 list-objects-filter-all.h

diff --git a/Makefile b/Makefile
index 4e0cc39..b98e3dc 100644
--- a/Makefile
+++ b/Makefile
@@ -798,6 +798,7 @@ LIB_OBJS += levenshtein.o
 LIB_OBJS += line-log.o
 LIB_OBJS += line-range.o
 LIB_OBJS += list-objects.o
+LIB_OBJS += list-objects-filter-all.o
 LIB_OBJS += ll-merge.o
 LIB_OBJS += lockfile.o
 LIB_OBJS += log-tree.o
diff --git a/list-objects-filter-all.c b/list-objects-filter-all.c
new file mode 100644
index 0000000..2faccb3
--- /dev/null
+++ b/list-objects-filter-all.c
@@ -0,0 +1,85 @@
+#include "cache.h"
+#include "dir.h"
+#include "tag.h"
+#include "commit.h"
+#include "tree.h"
+#include "blob.h"
+#include "diff.h"
+#include "tree-walk.h"
+#include "revision.h"
+#include "list-objects.h"
+#include "list-objects-filter-all.h"
+
+/*
+ * A filter for list-objects to omit ALL blobs from the traversal.
+ */
+struct filter_omit_all_blobs_data {
+	struct oidset2 omits;
+};
+
+static list_objects_filter_result filter_omit_all_blobs(
+	list_objects_filter_type filter_type,
+	struct object *obj,
+	const char *pathname,
+	const char *filename,
+	void *filter_data_)
+{
+	struct filter_omit_all_blobs_data *filter_data = filter_data_;
+	int64_t object_length = -1;
+	unsigned long s;
+	enum object_type t;
+
+	switch (filter_type) {
+	default:
+		die("unkown filter_type");
+		return LOFR_ZERO;
+
+	case LOFT_BEGIN_TREE:
+		assert(obj->type == OBJ_TREE);
+		/* always include all tree objects */
+		return LOFR_MARK_SEEN | LOFR_SHOW;
+
+	case LOFT_END_TREE:
+		assert(obj->type == OBJ_TREE);
+		return LOFR_ZERO;
+
+	case LOFT_BLOB:
+		assert(obj->type == OBJ_BLOB);
+		assert((obj->flags & SEEN) == 0);
+
+		/*
+		 * Since we always omit all blobs (and never provisionally omit),
+		 * we should never see a blob twice.
+		 */
+		assert(!oidset2_contains(&filter_data->omits, &obj->oid));
+
+		t = sha1_object_info(obj->oid.hash, &s);
+		assert(t == OBJ_BLOB);
+		object_length = (int64_t)((uint64_t)(s));
+
+		/* Insert OID into the omitted list. No need for a pathname. */
+		oidset2_insert(&filter_data->omits, &obj->oid, t, object_length,
+			       NULL);
+		return LOFR_MARK_SEEN; /* but not LOFR_SHOW (hard omit) */
+	}
+}
+
+void traverse_commit_list_omit_all_blobs(
+	struct rev_info *revs,
+	show_commit_fn show_commit,
+	show_object_fn show_object,
+	oidset2_foreach_cb print_omitted_object,
+	void *ctx_data)
+{
+	struct filter_omit_all_blobs_data d;
+
+	memset(&d, 0, sizeof(d));
+
+	traverse_commit_list_worker(revs, show_commit, show_object, ctx_data,
+				    filter_omit_all_blobs, &d);
+
+	if (print_omitted_object)
+		oidset2_foreach(&d.omits, print_omitted_object, ctx_data);
+
+	oidset2_clear(&d.omits);
+}
diff --git a/list-objects-filter-all.h b/list-objects-filter-all.h
new file mode 100644
index 0000000..591589f
--- /dev/null
+++ b/list-objects-filter-all.h
@@ -0,0 +1,18 @@
+#ifndef LIST_OBJECTS_FILTER_ALL_H
+#define LIST_OBJECTS_FILTER_ALL_H
+
+#include "oidset2.h"
+
+/*
+ * A filter for list-objects to omit ALL blobs
+ * from the traversal.
+ */
+void traverse_commit_list_omit_all_blobs(
+	struct rev_info *revs,
+	show_commit_fn show_commit,
+	show_object_fn show_object,
+	oidset2_foreach_cb print_omitted_object,
+	void *ctx_data);
+
+#endif /* LIST_OBJECTS_FILTER_ALL_H */
+
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 02/13] oidset2: create oidset subclass with object length and pathname
  2017-09-22 20:26 ` [PATCH 02/13] oidset2: create oidset subclass with object length and pathname Jeff Hostetler
@ 2017-09-22 20:42   ` Brandon Williams
  2017-09-26 22:20   ` Jonathan Tan
  1 sibling, 0 replies; 16+ messages in thread
From: Brandon Williams @ 2017-09-22 20:42 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, gitster, peff, jonathantanmy, jeffhost

On 09/22, Jeff Hostetler wrote:
> From: Jeff Hostetler <jeffhost@microsoft.com>
> 
> Create subclass of oidset where each entry has a
> field to store the length of the object's content
> and an optional pathname.
> 
> This will be used in a future commit to build a
> manifest of omitted objects in a partial/narrow
> clone/fetch.
> 
> Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>

I think it may be better to create an 'oidmap' instead of an
'oidset+some specific other bits' that way, in the future, we have a
generalized data structure which we can use to provide a mapping
between oids and some arbitrary data.

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 00/13] RFC object filtering for parital clone
  2017-09-22 20:26 [PATCH 00/13] RFC object filtering for parital clone Jeff Hostetler
                   ` (3 preceding siblings ...)
  2017-09-22 20:26 ` [PATCH 04/13] list-objects-filter-all: add filter to omit all blobs Jeff Hostetler
@ 2017-09-23  0:39 ` Jonathan Tan
  2017-09-26 14:55   ` Jeff Hostetler
  4 siblings, 1 reply; 16+ messages in thread
From: Jonathan Tan @ 2017-09-23  0:39 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, gitster, peff, jeffhost

On Fri, 22 Sep 2017 20:26:19 +0000
Jeff Hostetler <git@jeffhostetler.com> wrote:

> This draft contains filters to:
> () omit all blobs
> () omit blobs larger than some size
> () omit blobs using a sparse-checkout specification
> 
> In addition to specifying the filter criteria, the rev-list command
> was updated to include options to:
> () print a list of the omitted objects (due to the current filtering
>    criteria)
> () print a list of missing objects (probably from a prior partial
>    clone/fetch).

Thanks, this last point seems useful.

I tried applying your patches and it doesn't apply cleanly on master.
Could you try rebasing? In particular, the code references
get_sha1_with_context(), which no longer exists (it was replaced with
get_oid_with_context()).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 00/13] RFC object filtering for parital clone
  2017-09-23  0:39 ` [PATCH 00/13] RFC object filtering for parital clone Jonathan Tan
@ 2017-09-26 14:55   ` Jeff Hostetler
  2017-09-26 19:23     ` Jeff Hostetler
  0 siblings, 1 reply; 16+ messages in thread
From: Jeff Hostetler @ 2017-09-26 14:55 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, gitster, peff, jeffhost



On 9/22/2017 8:39 PM, Jonathan Tan wrote:
> On Fri, 22 Sep 2017 20:26:19 +0000
> Jeff Hostetler <git@jeffhostetler.com> wrote:
> 
>> This draft contains filters to:
>> () omit all blobs
>> () omit blobs larger than some size
>> () omit blobs using a sparse-checkout specification
>>
>> In addition to specifying the filter criteria, the rev-list command
>> was updated to include options to:
>> () print a list of the omitted objects (due to the current filtering
>>     criteria)
>> () print a list of missing objects (probably from a prior partial
>>     clone/fetch).
> 
> Thanks, this last point seems useful.
> 
> I tried applying your patches and it doesn't apply cleanly on master.
> Could you try rebasing? In particular, the code references
> get_sha1_with_context(), which no longer exists (it was replaced with
> get_oid_with_context()).
> 

I usually build and submit patches relative to the latest
available tag (v2.14.1) rather than a moving target, but
yeah I'll look thru the other comments and rebase it forward.

Thanks
Jeff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 00/13] RFC object filtering for parital clone
  2017-09-26 14:55   ` Jeff Hostetler
@ 2017-09-26 19:23     ` Jeff Hostetler
  0 siblings, 0 replies; 16+ messages in thread
From: Jeff Hostetler @ 2017-09-26 19:23 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, gitster, peff, jeffhost



On 9/26/2017 10:55 AM, Jeff Hostetler wrote:
> 
> 
> On 9/22/2017 8:39 PM, Jonathan Tan wrote:
>> On Fri, 22 Sep 2017 20:26:19 +0000
>> Jeff Hostetler <git@jeffhostetler.com> wrote:
>> ...
>>
>> I tried applying your patches and it doesn't apply cleanly on master.
>> Could you try rebasing? In particular, the code references
>> get_sha1_with_context(), which no longer exists (it was replaced with
>> get_oid_with_context()).
>>
> 
> I usually build and submit patches relative to the latest
> available tag (v2.14.1) rather than a moving target, but
> yeah I'll look thru the other comments and rebase it forward.

https://github.com/jeffhostetler/git/tree/core/rev_list_filtering

I rebased this branch onto a current master and fixed
the get_sha1_with_context() problems.

Let me know if you have problems with it (or would prefer that I
send it to the mailing list).

Jeff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 02/13] oidset2: create oidset subclass with object length and pathname
  2017-09-22 20:26 ` [PATCH 02/13] oidset2: create oidset subclass with object length and pathname Jeff Hostetler
  2017-09-22 20:42   ` Brandon Williams
@ 2017-09-26 22:20   ` Jonathan Tan
  2017-09-27 14:47     ` Jeff Hostetler
  1 sibling, 1 reply; 16+ messages in thread
From: Jonathan Tan @ 2017-09-26 22:20 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, gitster, peff, jeffhost

On Fri, 22 Sep 2017 20:26:21 +0000
Jeff Hostetler <git@jeffhostetler.com> wrote:

> From: Jeff Hostetler <jeffhost@microsoft.com>
> 
> Create subclass of oidset where each entry has a
> field to store the length of the object's content
> and an optional pathname.
> 
> This will be used in a future commit to build a
> manifest of omitted objects in a partial/narrow
> clone/fetch.

As Brandon mentioned, I think "oidmap" should be the new data structure
of choice (with "oidset" modified to use it).

> +struct oidset2_entry {
> +	struct hashmap_entry hash;
> +	struct object_id oid;
> +
> +	enum object_type type;
> +	int64_t object_length;	/* This is SIGNED. Use -1 when unknown. */
> +	char *pathname;
> +};

object_length is defined to be "unsigned long" in Git code, I think.
When is object_length not known, and in those cases, would it be better
to use a separate data structure to store what we need?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 03/13] list-objects: filter objects in traverse_commit_list
  2017-09-22 20:26 ` [PATCH 03/13] list-objects: filter objects in traverse_commit_list Jeff Hostetler
@ 2017-09-26 22:31   ` Jonathan Tan
  2017-09-27 17:04     ` Jeff Hostetler
  0 siblings, 1 reply; 16+ messages in thread
From: Jonathan Tan @ 2017-09-26 22:31 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, gitster, peff, jeffhost

On Fri, 22 Sep 2017 20:26:22 +0000
Jeff Hostetler <git@jeffhostetler.com> wrote:

> From: Jeff Hostetler <jeffhost@microsoft.com>
> 
> Create traverse_commit_list_filtered() and add filtering

You mention _filtered() here, but this patch contains _worker().

>  list-objects.h | 30 ++++++++++++++++++++++++++
>  2 files changed, 80 insertions(+), 16 deletions(-)
> 
> diff --git a/list-objects.c b/list-objects.c
> index b3931fa..3e86008 100644
> --- a/list-objects.c
> +++ b/list-objects.c
> @@ -13,10 +13,13 @@ static void process_blob(struct rev_info *revs,
>  			 show_object_fn show,
>  			 struct strbuf *path,
>  			 const char *name,
> -			 void *cb_data)
> +			 void *cb_data,
> +			 filter_object_fn filter,
> +			 void *filter_data)

I had some thoughts about modifying "show" (instead of adding "filter",
as you have done in this patch) to indicate to the caller that the
corresponding object should not be marked "seen". That does have the
advantage that we don't have so many callbacks flying around, but it
also makes things more complicated, so I don't know which is better.

> +	if (filter) {
> +		r = filter(LOFT_END_TREE, obj, base->buf, &base->buf[baselen], filter_data);
> +		if (r & LOFR_MARK_SEEN)
> +			obj->flags |= SEEN;
> +		if (r & LOFR_SHOW)
> +			show(obj, base->buf, cb_data);
> +	}

This LOFT_END_TREE was added to support the code in
list-objects-filter-sparse - would it be OK to completely remove the
optimization involving the "provisional" omission of blobs? (I don't
think the exact same tree object will occur multiple times often.) If
yes, then I think this block can be removed too and will simplify the
code.

> +/* See object.h and revision.h */
> +#define FILTER_REVISIT (1<<25)

If you do add this, also update object.h. But I don't think this is the
right place to do it - it's probably better in
list-objects-filter-sparse.

> +typedef enum list_objects_filter_result list_objects_filter_result;
> +typedef enum list_objects_filter_type list_objects_filter_type;

I don't think Git style is to typedef enums.

> +void traverse_commit_list_worker(
> +	struct rev_info *,
> +	show_commit_fn, show_object_fn, void *show_data,
> +	filter_object_fn filter, void *filter_data);

Here (and throughout), as far as I can tell, the style is to indent to
the character after the opening parenthesis.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 02/13] oidset2: create oidset subclass with object length and pathname
  2017-09-26 22:20   ` Jonathan Tan
@ 2017-09-27 14:47     ` Jeff Hostetler
  0 siblings, 0 replies; 16+ messages in thread
From: Jeff Hostetler @ 2017-09-27 14:47 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, gitster, peff, jeffhost

On 9/26/2017 6:20 PM, Jonathan Tan wrote:
> On Fri, 22 Sep 2017 20:26:21 +0000
> Jeff Hostetler <git@jeffhostetler.com> wrote:
> 
>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>
>> Create subclass of oidset where each entry has a
>> field to store the length of the object's content
>> and an optional pathname.
>>
>> This will be used in a future commit to build a
>> manifest of omitted objects in a partial/narrow
>> clone/fetch.
> 
> As Brandon mentioned, I think "oidmap" should be the new data structure
> of choice (with "oidset" modified to use it).

I'll take a look at that. I'm not exactly happy with
my oidset2, but it works and it minimized touching other
things.  But yes, it may clear up a few things.

>> +struct oidset2_entry {
>> +	struct hashmap_entry hash;
>> +	struct object_id oid;
>> +
>> +	enum object_type type;
>> +	int64_t object_length;	/* This is SIGNED. Use -1 when unknown. */
>> +	char *pathname;
>> +};
> 
> object_length is defined to be "unsigned long" in Git code, I think.
> When is object_length not known, and in those cases, would it be better
> to use a separate data structure to store what we need?

Yeah, I struggled with that one.  Git currently treats file size as
a 32-bit unsigned value throughout the code.  I assume eventually there
will be a round of changes to support 64-bit values, so this anticipates
that.

I could change it to be an unknown flag, rather assuming -1, but in an
earlier draft I was printing -1 in the rev-list output.  I can change this.

WRT a separate structure, the SET I create will contain entries for items
where we may or may not know the size and that depends on the context.
When building a list of already-missing blobs (with the --filter-print-missing)
we never know the size.  But when building a list of to-be-omitted blobs
(from the current set of filter options), we may or may not know.  I'm
not sure we need 2 _entry definitions right now.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 03/13] list-objects: filter objects in traverse_commit_list
  2017-09-26 22:31   ` Jonathan Tan
@ 2017-09-27 17:04     ` Jeff Hostetler
  2017-09-27 18:00       ` Jonathan Tan
  0 siblings, 1 reply; 16+ messages in thread
From: Jeff Hostetler @ 2017-09-27 17:04 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, gitster, peff, jeffhost

On 9/26/2017 6:31 PM, Jonathan Tan wrote:
> On Fri, 22 Sep 2017 20:26:22 +0000
> Jeff Hostetler <git@jeffhostetler.com> wrote:
> 
>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>
>> Create traverse_commit_list_filtered() and add filtering
> 
> You mention _filtered() here, but this patch contains _worker().

thanks.

>>   list-objects.h | 30 ++++++++++++++++++++++++++
>>   2 files changed, 80 insertions(+), 16 deletions(-)
>>
>> diff --git a/list-objects.c b/list-objects.c
>> index b3931fa..3e86008 100644
>> --- a/list-objects.c
>> +++ b/list-objects.c
>> @@ -13,10 +13,13 @@ static void process_blob(struct rev_info *revs,
>>   			 show_object_fn show,
>>   			 struct strbuf *path,
>>   			 const char *name,
>> -			 void *cb_data)
>> +			 void *cb_data,
>> +			 filter_object_fn filter,
>> +			 void *filter_data)
> 
> I had some thoughts about modifying "show" (instead of adding "filter",
> as you have done in this patch) to indicate to the caller that the
> corresponding object should not be marked "seen". That does have the
> advantage that we don't have so many callbacks flying around, but it
> also makes things more complicated, so I don't know which is better.

I wanted the filtering function to be independent of the showing function.
As you can see in pack-objects.c and rev-list.c:  All I needed to change
there was to call the new version of the traverse code and not alter the
show function.  The filtering is completely isolated inside the traverse
code.

It also means that the filtering code can do higher-level filtering.
Currently, the show-object callback gets called with each terminal node
(blob) and it has already lost any chance for tree-level optimizations.
The filtering code participates in the treewalk on the commit and can do
subtree elimination.

> 
>> +	if (filter) {
>> +		r = filter(LOFT_END_TREE, obj, base->buf, &base->buf[baselen], filter_data);
>> +		if (r & LOFR_MARK_SEEN)
>> +			obj->flags |= SEEN;
>> +		if (r & LOFR_SHOW)
>> +			show(obj, base->buf, cb_data);
>> +	}
> 
> This LOFT_END_TREE was added to support the code in
> list-objects-filter-sparse - would it be OK to completely remove the
> optimization involving the "provisional" omission of blobs? (I don't
> think the exact same tree object will occur multiple times often.) If
> yes, then I think this block can be removed too and will simplify the
> code.

The sparse filter is looking at pathnames and using the same rules
as sparse-checkout to decide which to *include* in the result.  This
is essentially backwards from the other filters which are looking for
reasons to *exclude* a blob.  If I see a {pathname, sha} pair and the
pathname is not wanted (by the sparse-checkout rules), I still don't
know anything about the object -- since the same SHA may appear later
in the treewalk but with a different pathname that *does* match the
patterns and *is* wanted.

The net-net is that I have to mark the blob as "provisionally omitted"
until I've seen all the pathnames associated with it.  Only then can I
say it should be omitted.

Likewise, there are things about the tree object that we cannot
decide until we've seen all possible directory paths that reference it.
For example, if you rename a tree/directory between 2 commits, but make no
other changes within the directory, it will/should have the same SHA in the
second commit.  If one of those paths is included in the sparse-checkout
and one is not, then you need include those blobs (and the tree object)
in the result.  If the treewalk visits the excluded case first, you don't
want to discard the tree object (and shortcut future treewalks) because
the filter won't get a chance to see the included directory path case.

Also, the current code does not attempt to omit tree objects, but a
future version may.  And having the _BEGIN_ and _END_ events means the
filter can keep track of the nesting without the expense of having to
discover it by parsing the pathname looking for slashes as we do elsewhere.

> 
>> +/* See object.h and revision.h */
>> +#define FILTER_REVISIT (1<<25)
> 
> If you do add this, also update object.h. But I don't think this is the
> right place to do it - it's probably better in
> list-objects-filter-sparse.
> 
>> +typedef enum list_objects_filter_result list_objects_filter_result;
>> +typedef enum list_objects_filter_type list_objects_filter_type;
> 
> I don't think Git style is to typedef enums.
> 
>> +void traverse_commit_list_worker(
>> +	struct rev_info *,
>> +	show_commit_fn, show_object_fn, void *show_data,
>> +	filter_object_fn filter, void *filter_data);
> 
> Here (and throughout), as far as I can tell, the style is to indent to
> the character after the opening parenthesis.
> 

right. thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 03/13] list-objects: filter objects in traverse_commit_list
  2017-09-27 17:04     ` Jeff Hostetler
@ 2017-09-27 18:00       ` Jonathan Tan
  2017-09-27 19:09         ` Jeff Hostetler
  0 siblings, 1 reply; 16+ messages in thread
From: Jonathan Tan @ 2017-09-27 18:00 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, gitster, peff, jeffhost

On Wed, 27 Sep 2017 13:04:42 -0400
Jeff Hostetler <git@jeffhostetler.com> wrote:

> The sparse filter is looking at pathnames and using the same rules
> as sparse-checkout to decide which to *include* in the result.  This
> is essentially backwards from the other filters which are looking for
> reasons to *exclude* a blob.  If I see a {pathname, sha} pair and the
> pathname is not wanted (by the sparse-checkout rules), I still don't
> know anything about the object -- since the same SHA may appear later
> in the treewalk but with a different pathname that *does* match the
> patterns and *is* wanted.
> 
> The net-net is that I have to mark the blob as "provisionally omitted"
> until I've seen all the pathnames associated with it.  Only then can I
> say it should be omitted.

How is this different from refraining from marking the blob as
LOFR_MARK_SEEN? When you would provisionally omit the blob, return
LOFR_ZERO so that a future iteration will revisit the blob again, and
when you would include it in the output, return
LOFR_MARK_SEEN|LOFR_SHOW.

> Likewise, there are things about the tree object that we cannot
> decide until we've seen all possible directory paths that reference it.
> For example, if you rename a tree/directory between 2 commits, but make no
> other changes within the directory, it will/should have the same SHA in the
> second commit.  If one of those paths is included in the sparse-checkout
> and one is not, then you need include those blobs (and the tree object)
> in the result.  If the treewalk visits the excluded case first, you don't
> want to discard the tree object (and shortcut future treewalks) because
> the filter won't get a chance to see the included directory path case.

For trees, I guess it's slightly different in that you do need an extra
flag to keep track of whether the tree has been shown. So mark SHOWN and
return LOFR_SHOW on the first time the tree is shown, and LOFR_ZERO
otherwise. And trees must never be marked as LOFR_MARK_SEEN.

(This SHOWN flag might play a similar role to your FILTER_REVISIT.)

Until now, it seems to me that the _END event is not required.

> Also, the current code does not attempt to omit tree objects, but a
> future version may.  And having the _BEGIN_ and _END_ events means the
> filter can keep track of the nesting without the expense of having to
> discover it by parsing the pathname looking for slashes as we do elsewhere.

A feature that omits tree objects would need _END, true. But until
then, I personally don't think we should add such infrastructure until
we have a feature that needs it.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 03/13] list-objects: filter objects in traverse_commit_list
  2017-09-27 18:00       ` Jonathan Tan
@ 2017-09-27 19:09         ` Jeff Hostetler
  2017-09-27 20:49           ` Jonathan Tan
  0 siblings, 1 reply; 16+ messages in thread
From: Jeff Hostetler @ 2017-09-27 19:09 UTC (permalink / raw)
  To: Jonathan Tan; +Cc: git, gitster, peff, jeffhost

On 9/27/2017 2:00 PM, Jonathan Tan wrote:
> On Wed, 27 Sep 2017 13:04:42 -0400
> Jeff Hostetler <git@jeffhostetler.com> wrote:
> 
>> The sparse filter is looking at pathnames and using the same rules
>> as sparse-checkout to decide which to *include* in the result.  This
>> is essentially backwards from the other filters which are looking for
>> reasons to *exclude* a blob.  If I see a {pathname, sha} pair and the
>> pathname is not wanted (by the sparse-checkout rules), I still don't
>> know anything about the object -- since the same SHA may appear later
>> in the treewalk but with a different pathname that *does* match the
>> patterns and *is* wanted.
>>
>> The net-net is that I have to mark the blob as "provisionally omitted"
>> until I've seen all the pathnames associated with it.  Only then can I
>> say it should be omitted.
> 
> How is this different from refraining from marking the blob as
> LOFR_MARK_SEEN? When you would provisionally omit the blob, return
> LOFR_ZERO so that a future iteration will revisit the blob again, and
> when you would include it in the output, return
> LOFR_MARK_SEEN|LOFR_SHOW.

By adding it to the set of provisionally omitted objects, we
have the option to capture a little extra information with it
and refer to that the next time we see the object in the traversal.
For example, in the sparse-checkout case, the first time we see the
object we know the pathname and know that it does not need to be
included.  The second time we see that object, we can see if the
new pathname is the same as the previous one with a simple strcmp
and avoid the expensive is_excluded_from_list() computation.  Keep
in mind that rev-list or pack-objects could be called be on something
like HEAD~100000..HEAD or that there may be 50,000 tips.  So a file
that doesn't change across that range will be visited many times
with the same {pathname, sha}.

Then when the traversal is finished, we have a resulting set of
(now actually) omitted objects.  Which we can iterate over if the
caller is interested in.

> 
>> Likewise, there are things about the tree object that we cannot
>> decide until we've seen all possible directory paths that reference it.
>> For example, if you rename a tree/directory between 2 commits, but make no
>> other changes within the directory, it will/should have the same SHA in the
>> second commit.  If one of those paths is included in the sparse-checkout
>> and one is not, then you need include those blobs (and the tree object)
>> in the result.  If the treewalk visits the excluded case first, you don't
>> want to discard the tree object (and shortcut future treewalks) because
>> the filter won't get a chance to see the included directory path case.
> 
> For trees, I guess it's slightly different in that you do need an extra
> flag to keep track of whether the tree has been shown. So mark SHOWN and
> return LOFR_SHOW on the first time the tree is shown, and LOFR_ZERO
> otherwise. And trees must never be marked as LOFR_MARK_SEEN.

Right now I want to force the tree to be shown the first time it is
visited (because I don't want to do tree filtering yet).  I don't mark
it SEEN yet because we may want to revisit blobs within (say, after a
folder rename like I described previously).

I do, however, mark the tree object as SEEN (in the _END event) when I
can verify that I've included ALL of the children.

So it might be possible that I could change the flags and not use
FILTER_REVISIT on tree objects, I hesitate to do that right now.

Having the FILTER_REVISIT flag on blob objects means I can avoid
doing a hash/oidset lookup on subsequent visits.

> (This SHOWN flag might play a similar role to your FILTER_REVISIT.)
> 
> Until now, it seems to me that the _END event is not required.
> 
>> Also, the current code does not attempt to omit tree objects, but a
>> future version may.  And having the _BEGIN_ and _END_ events means the
>> filter can keep track of the nesting without the expense of having to
>> discover it by parsing the pathname looking for slashes as we do elsewhere.
> 
> A feature that omits tree objects would need _END, true. But until
> then, I personally don't think we should add such infrastructure until
> we have a feature that needs it.

The sparse filter is using the _END now to try to help shortcut the
treewalk (when all referenced objects within it are known to already
be included).

The _END event's role may expand with a 4th filter that also filters
tree objects, but I don't want to eliminate it right now.

Thanks
Jeff

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 03/13] list-objects: filter objects in traverse_commit_list
  2017-09-27 19:09         ` Jeff Hostetler
@ 2017-09-27 20:49           ` Jonathan Tan
  0 siblings, 0 replies; 16+ messages in thread
From: Jonathan Tan @ 2017-09-27 20:49 UTC (permalink / raw)
  To: Jeff Hostetler; +Cc: git, gitster, peff, jeffhost

On Wed, 27 Sep 2017 15:09:43 -0400
Jeff Hostetler <git@jeffhostetler.com> wrote:

> By adding it to the set of provisionally omitted objects, we
> have the option to capture a little extra information with it
> and refer to that the next time we see the object in the traversal.
> For example, in the sparse-checkout case, the first time we see the
> object we know the pathname and know that it does not need to be
> included.  The second time we see that object, we can see if the
> new pathname is the same as the previous one with a simple strcmp
> and avoid the expensive is_excluded_from_list() computation.  Keep
> in mind that rev-list or pack-objects could be called be on something
> like HEAD~100000..HEAD or that there may be 50,000 tips.  So a file
> that doesn't change across that range will be visited many times
> with the same {pathname, sha}.

Ah, capturing the extra information makes sense. I missed that detail.

> Right now I want to force the tree to be shown the first time it is
> visited (because I don't want to do tree filtering yet).  I don't mark
> it SEEN yet because we may want to revisit blobs within (say, after a
> folder rename like I described previously).
> 
> I do, however, mark the tree object as SEEN (in the _END event) when I
> can verify that I've included ALL of the children.

This optimization makes sense too.

> So it might be possible that I could change the flags and not use
> FILTER_REVISIT on tree objects, I hesitate to do that right now.

You're probably right that we need some sort of flag on tree objects,
and FILTER_REVISIT can do the job. (My suggestion SHOWN plays a similar
role anyway.)

> Having the FILTER_REVISIT flag on blob objects means I can avoid
> doing a hash/oidset lookup on subsequent visits.

By the hash/oidset lookup, I presume you mean the lookup on the set of
provisionally omitted objects? If yes, this makes sense.

Thanks for your clarifications - I'll take another look at the code
here.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-09-27 20:49 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-22 20:26 [PATCH 00/13] RFC object filtering for parital clone Jeff Hostetler
2017-09-22 20:26 ` [PATCH 01/13] dir: refactor add_excludes() Jeff Hostetler
2017-09-22 20:26 ` [PATCH 02/13] oidset2: create oidset subclass with object length and pathname Jeff Hostetler
2017-09-22 20:42   ` Brandon Williams
2017-09-26 22:20   ` Jonathan Tan
2017-09-27 14:47     ` Jeff Hostetler
2017-09-22 20:26 ` [PATCH 03/13] list-objects: filter objects in traverse_commit_list Jeff Hostetler
2017-09-26 22:31   ` Jonathan Tan
2017-09-27 17:04     ` Jeff Hostetler
2017-09-27 18:00       ` Jonathan Tan
2017-09-27 19:09         ` Jeff Hostetler
2017-09-27 20:49           ` Jonathan Tan
2017-09-22 20:26 ` [PATCH 04/13] list-objects-filter-all: add filter to omit all blobs Jeff Hostetler
2017-09-23  0:39 ` [PATCH 00/13] RFC object filtering for parital clone Jonathan Tan
2017-09-26 14:55   ` Jeff Hostetler
2017-09-26 19:23     ` Jeff Hostetler

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).