git@vger.kernel.org mailing list mirror (one of many)
 help / color / mirror / code / Atom feed
* [RFC PATCH 0/3] support for filtering trees and blobs based on depth
@ 2018-10-11 23:08 Matthew DeVore
  2018-10-11 23:09 ` [RFC PATCH 1/3] list-objects: support for skipping tree traversal Matthew DeVore
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Matthew DeVore @ 2018-10-11 23:08 UTC (permalink / raw)
  To: git
  Cc: Matthew DeVore, sbeller, git, jeffhost, peff, stefanbeller,
	jonathantanmy, gitster, pclouds

This adds support for depth >0 in the tree:<depth> filter. Before this patch,
only <depth>=0 is supported, which means all trees and blobs are filtered.

The purpose of this is to allow fetching of entire directories in a partial
clone use case. If I do a partial clone of a repo with no objects and then want
to do something like "make" it will be quite slow of we initiate a separate
fetch for every file needed. Alternatively, fetching directories at a time -
as soon as any file in a directory is accessed - is a reasonable approach.

Thank you,

Matthew DeVore (3):
  list-objects: support for skipping tree traversal
  Documentation/git-rev-list: s/<commit>/<object>/
  list-objects-filter: teach tree:# how to handle >0

 Documentation/git-rev-list.txt      | 21 ++++---
 Documentation/rev-list-options.txt  | 24 +++++---
 builtin/rev-list.c                  |  2 +-
 list-objects-filter-options.c       |  6 +-
 list-objects-filter-options.h       |  1 +
 list-objects-filter.c               | 52 +++++++++++++---
 list-objects-filter.h               |  6 ++
 list-objects.c                      |  5 +-
 t/t6112-rev-list-filters-objects.sh | 94 +++++++++++++++++++++++++++++
 9 files changed, 178 insertions(+), 33 deletions(-)

-- 
2.19.1.331.ge82ca0e54c-goog


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC PATCH 1/3] list-objects: support for skipping tree traversal
  2018-10-11 23:08 [RFC PATCH 0/3] support for filtering trees and blobs based on depth Matthew DeVore
@ 2018-10-11 23:09 ` Matthew DeVore
  2018-10-14 23:15   ` Junio C Hamano
  2018-10-11 23:09 ` [RFC PATCH 2/3] Documentation/git-rev-list: s/<commit>/<object>/ Matthew DeVore
  2018-10-11 23:09 ` [RFC PATCH 3/3] list-objects-filter: teach tree:# how to handle >0 Matthew DeVore
  2 siblings, 1 reply; 14+ messages in thread
From: Matthew DeVore @ 2018-10-11 23:09 UTC (permalink / raw)
  To: git
  Cc: Matthew DeVore, sbeller, git, jeffhost, peff, stefanbeller,
	jonathantanmy, gitster, pclouds

The tree:0 filter does not need to traverse the trees that it has
filtered out, so optimize list-objects and list-objects-filter to skip
traversing the trees entirely. Before this patch, we iterated over all
children of the tree, and did nothing for all of them, which was
wasteful.

Signed-off-by: Matthew DeVore <matvore@google.com>
---
 list-objects-filter.c               | 11 +++++++++--
 list-objects-filter.h               |  6 ++++++
 list-objects.c                      |  5 ++++-
 t/t6112-rev-list-filters-objects.sh | 10 ++++++++++
 4 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/list-objects-filter.c b/list-objects-filter.c
index 09b2b05d5..37fba456d 100644
--- a/list-objects-filter.c
+++ b/list-objects-filter.c
@@ -102,9 +102,16 @@ static enum list_objects_filter_result filter_trees_none(
 
 	case LOFS_BEGIN_TREE:
 	case LOFS_BLOB:
-		if (filter_data->omits)
+		if (filter_data->omits) {
 			oidset_insert(filter_data->omits, &obj->oid);
-		return LOFR_MARK_SEEN; /* but not LOFR_DO_SHOW (hard omit) */
+			/* _MARK_SEEN but not _DO_SHOW (hard omit) */
+			return LOFR_MARK_SEEN;
+		}
+		else
+			/*
+			 * Not collecting omits so no need to to traverse tree.
+			 */
+			return LOFR_SKIP_TREE | LOFR_MARK_SEEN;
 
 	case LOFS_END_TREE:
 		assert(obj->type == OBJ_TREE);
diff --git a/list-objects-filter.h b/list-objects-filter.h
index a6f6b4990..52b4a84da 100644
--- a/list-objects-filter.h
+++ b/list-objects-filter.h
@@ -24,6 +24,11 @@ struct oidset;
  *              In general, objects should only be shown once, but
  *              this result DOES NOT imply that we mark it SEEN.
  *
+ * _SKIP_TREE : Used in LOFS_BEGIN_TREE situation - indicates that
+ *              the tree's children should not be iterated over. This
+ *              is used as an optimization when all children will
+ *              definitely be ignored.
+ *
  * Most of the time, you want the combination (_MARK_SEEN | _DO_SHOW)
  * but they can be used independently, such as when sparse-checkout
  * pattern matching is being applied.
@@ -45,6 +50,7 @@ enum list_objects_filter_result {
 	LOFR_ZERO      = 0,
 	LOFR_MARK_SEEN = 1<<0,
 	LOFR_DO_SHOW   = 1<<1,
+	LOFR_SKIP_TREE = 1<<2,
 };
 
 enum list_objects_filter_situation {
diff --git a/list-objects.c b/list-objects.c
index 7a1a0929d..d1e3d217c 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -11,6 +11,7 @@
 #include "list-objects-filter-options.h"
 #include "packfile.h"
 #include "object-store.h"
+#include "trace.h"
 
 struct traversal_context {
 	struct rev_info *revs;
@@ -184,7 +185,9 @@ static void process_tree(struct traversal_context *ctx,
 	if (base->len)
 		strbuf_addch(base, '/');
 
-	if (!failed_parse)
+	if (r & LOFR_SKIP_TREE)
+		trace_printf("Skipping contents of tree %s...\n", base->buf);
+	else if (!failed_parse)
 		process_tree_contents(ctx, tree, base);
 
 	if ((obj->flags & NOT_USER_GIVEN) && ctx->filter_fn) {
diff --git a/t/t6112-rev-list-filters-objects.sh b/t/t6112-rev-list-filters-objects.sh
index 08e0c7db6..efb1bee2e 100755
--- a/t/t6112-rev-list-filters-objects.sh
+++ b/t/t6112-rev-list-filters-objects.sh
@@ -244,6 +244,16 @@ test_expect_success 'verify tree:0 includes trees in "filtered" output' '
 	test_cmp expected filtered_types
 '
 
+# Make sure tree:0 does not iterate through any trees.
+
+test_expect_success 'filter a GIANT tree through tree:0' '
+	GIT_TRACE=1 git -C r3 rev-list \
+		--objects --filter=tree:0 HEAD 2>filter_trace &&
+	grep "Skipping contents of tree [.][.][.]" filter_trace >actual &&
+	# One line for each commit traversed.
+	test_line_count = 2 actual
+'
+
 # Delete some loose objects and use rev-list, but WITHOUT any filtering.
 # This models previously omitted objects that we did not receive.
 
-- 
2.19.1.331.ge82ca0e54c-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 2/3] Documentation/git-rev-list: s/<commit>/<object>/
  2018-10-11 23:08 [RFC PATCH 0/3] support for filtering trees and blobs based on depth Matthew DeVore
  2018-10-11 23:09 ` [RFC PATCH 1/3] list-objects: support for skipping tree traversal Matthew DeVore
@ 2018-10-11 23:09 ` Matthew DeVore
  2018-10-14 23:25   ` Junio C Hamano
  2018-10-11 23:09 ` [RFC PATCH 3/3] list-objects-filter: teach tree:# how to handle >0 Matthew DeVore
  2 siblings, 1 reply; 14+ messages in thread
From: Matthew DeVore @ 2018-10-11 23:09 UTC (permalink / raw)
  To: git
  Cc: Matthew DeVore, sbeller, git, jeffhost, peff, stefanbeller,
	jonathantanmy, gitster, pclouds

git-rev-list has a mode where it works on the granularity of trees and
blobs, rather than commits only. When discussing this mode in
documenation, it can get awkward to refer to the list of arguments that
may include blobs and trees as <commit>. It is especially awkward in a
follow-up patch, so prepare for that patch by renaming the argument.

In addition to simply renaming the argument, also reword documentation
in some places such that we include non-commit objects in our
terminology. In other words, s/commit/object/ in any prose where the
context obviously applies to trees and blobs in a non-pathological way.

Signed-off-by: Matthew DeVore <matvore@google.com>
---
 Documentation/git-rev-list.txt     | 21 ++++++++++++---------
 Documentation/rev-list-options.txt | 16 ++++++++--------
 builtin/rev-list.c                 |  2 +-
 3 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/Documentation/git-rev-list.txt b/Documentation/git-rev-list.txt
index 88609ff43..b3357932c 100644
--- a/Documentation/git-rev-list.txt
+++ b/Documentation/git-rev-list.txt
@@ -60,20 +60,23 @@ SYNOPSIS
 	     [ --no-walk ] [ --do-walk ]
 	     [ --count ]
 	     [ --use-bitmap-index ]
-	     <commit>... [ \-- <paths>... ]
+	     <object>... [ \-- <paths>... ]
 
 DESCRIPTION
 -----------
 
-List commits that are reachable by following the `parent` links from the
-given commit(s), but exclude commits that are reachable from the one(s)
-given with a '{caret}' in front of them.  The output is given in reverse
-chronological order by default.
+List objects that are reachable by following references from the given
+object(s), but exclude objects that are reachable from the one(s) given
+with a '{caret}' in front of them.
 
-You can think of this as a set operation.  Commits given on the command
-line form a set of commits that are reachable from any of them, and then
-commits reachable from any of the ones given with '{caret}' in front are
-subtracted from that set.  The remaining commits are what comes out in the
+By default, only commit objects are shown, and the commits are shown in
+reverse chronological order. The '--object' flag causes non-commit objects
+to also be shown.
+
+You can think of this as a set operation.  Objects given on the command
+line form a set of objects that are reachable from any of them, and then
+objects reachable from any of the ones given with '{caret}' in front are
+subtracted from that set.  The remaining objects are what come out in the
 command's output.  Various other options and paths parameters can be used
 to further limit the result.
 
diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
index 5f1672913..c2c1c40e6 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -139,29 +139,29 @@ parents) and `--max-parents=-1` (negative numbers denote no upper limit).
 
 --all::
 	Pretend as if all the refs in `refs/`, along with `HEAD`, are
-	listed on the command line as '<commit>'.
+	listed on the command line as '<object>'.
 
 --branches[=<pattern>]::
 	Pretend as if all the refs in `refs/heads` are listed
-	on the command line as '<commit>'. If '<pattern>' is given, limit
+	on the command line as '<object>'. If '<pattern>' is given, limit
 	branches to ones matching given shell glob. If pattern lacks '?',
 	'{asterisk}', or '[', '/{asterisk}' at the end is implied.
 
 --tags[=<pattern>]::
 	Pretend as if all the refs in `refs/tags` are listed
-	on the command line as '<commit>'. If '<pattern>' is given, limit
+	on the command line as '<object>'. If '<pattern>' is given, limit
 	tags to ones matching given shell glob. If pattern lacks '?', '{asterisk}',
 	or '[', '/{asterisk}' at the end is implied.
 
 --remotes[=<pattern>]::
 	Pretend as if all the refs in `refs/remotes` are listed
-	on the command line as '<commit>'. If '<pattern>' is given, limit
+	on the command line as '<object>'. If '<pattern>' is given, limit
 	remote-tracking branches to ones matching given shell glob.
 	If pattern lacks '?', '{asterisk}', or '[', '/{asterisk}' at the end is implied.
 
 --glob=<glob-pattern>::
 	Pretend as if all the refs matching shell glob '<glob-pattern>'
-	are listed on the command line as '<commit>'. Leading 'refs/',
+	are listed on the command line as '<object>'. Leading 'refs/',
 	is automatically prepended if missing. If pattern lacks '?', '{asterisk}',
 	or '[', '/{asterisk}' at the end is implied.
 
@@ -182,7 +182,7 @@ explicitly.
 
 --reflog::
 	Pretend as if all objects mentioned by reflogs are listed on the
-	command line as `<commit>`.
+	command line as `<object>`.
 
 --single-worktree::
 	By default, all working trees will be examined by the
@@ -205,9 +205,9 @@ ifndef::git-rev-list[]
 endif::git-rev-list[]
 
 --stdin::
-	In addition to the '<commit>' listed on the command
+	In addition to the '<object>' listed on the command
 	line, read them from the standard input. If a `--` separator is
-	seen, stop reading commits and start reading paths to limit the
+	seen, stop reading objects and start reading paths to limit the
 	result.
 
 ifdef::git-rev-list[]
diff --git a/builtin/rev-list.c b/builtin/rev-list.c
index 49d6deed7..9817e6747 100644
--- a/builtin/rev-list.c
+++ b/builtin/rev-list.c
@@ -21,7 +21,7 @@
 #include "object-store.h"
 
 static const char rev_list_usage[] =
-"git rev-list [OPTION] <commit-id>... [ -- paths... ]\n"
+"git rev-list [OPTION] <object-id>... [ -- paths... ]\n"
 "  limiting output:\n"
 "    --max-count=<n>\n"
 "    --max-age=<epoch>\n"
-- 
2.19.1.331.ge82ca0e54c-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC PATCH 3/3] list-objects-filter: teach tree:# how to handle >0
  2018-10-11 23:08 [RFC PATCH 0/3] support for filtering trees and blobs based on depth Matthew DeVore
  2018-10-11 23:09 ` [RFC PATCH 1/3] list-objects: support for skipping tree traversal Matthew DeVore
  2018-10-11 23:09 ` [RFC PATCH 2/3] Documentation/git-rev-list: s/<commit>/<object>/ Matthew DeVore
@ 2018-10-11 23:09 ` Matthew DeVore
  2018-10-15  2:31   ` Junio C Hamano
  2 siblings, 1 reply; 14+ messages in thread
From: Matthew DeVore @ 2018-10-11 23:09 UTC (permalink / raw)
  To: git
  Cc: Matthew DeVore, sbeller, git, jeffhost, peff, stefanbeller,
	jonathantanmy, gitster, pclouds

Implement positive values for <depth> in the tree:<depth> filter. The
exact semantics are described in Documentation/rev-list-options.txt.

The long-term goal at the end of this is to allow a partial clone to
eagerly fetch an entire directory of files by fetching a tree and
specifying <depth>=1. This, for instance, would make a build operation
fast and convenient. It is fast because the partial clone does not need
to fetch each file individually, and convenient because the user does
not need to supply a sparse-checkout specification.

Signed-off-by: Matthew DeVore <matvore@google.com>
---
 Documentation/rev-list-options.txt  |  8 ++-
 list-objects-filter-options.c       |  6 +--
 list-objects-filter-options.h       |  1 +
 list-objects-filter.c               | 49 +++++++++++++----
 t/t6112-rev-list-filters-objects.sh | 84 +++++++++++++++++++++++++++++
 5 files changed, 132 insertions(+), 16 deletions(-)

diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
index c2c1c40e6..c78985c41 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -734,8 +734,12 @@ specification contained in <path>.
 +
 The form '--filter=tree:<depth>' omits all blobs and trees whose depth
 from the root tree is >= <depth> (minimum depth if an object is located
-at multiple depths in the commits traversed). Currently, only <depth>=0
-is supported, which omits all blobs and trees.
+at multiple depths in the commits traversed). <depth>=0 will not include
+any trees or blobs unless included explicitly in <object>. <depth>=1
+will include only the tree and blobs which are referenced directly by a
+commit reachable from <object> or an object given in <object>. <depth>=2
+is like <depth>=1 while also including trees and blobs one more level
+removed from <object> or a reachable commit.
 
 --no-filter::
 	Turn off any previous `--filter=` argument.
diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
index e8da2e858..9dc61d6e6 100644
--- a/list-objects-filter-options.c
+++ b/list-objects-filter-options.c
@@ -50,12 +50,12 @@ static int gently_parse_list_objects_filter(
 		}
 
 	} else if (skip_prefix(arg, "tree:", &v0)) {
-		unsigned long depth;
-		if (!git_parse_ulong(v0, &depth) || depth != 0) {
+		if (!git_parse_ulong(v0,
+				     &filter_options->tree_depth_limit_value)) {
 			if (errbuf) {
 				strbuf_addstr(
 					errbuf,
-					_("only 'tree:0' is supported"));
+					_("expected 'tree:<int>'"));
 			}
 			return 1;
 		}
diff --git a/list-objects-filter-options.h b/list-objects-filter-options.h
index af64e5c66..c1ae70cd8 100644
--- a/list-objects-filter-options.h
+++ b/list-objects-filter-options.h
@@ -44,6 +44,7 @@ struct list_objects_filter_options {
 	struct object_id *sparse_oid_value;
 	char *sparse_path_value;
 	unsigned long blob_limit_value;
+	unsigned long tree_depth_limit_value;
 };
 
 /* Normalized command line arguments */
diff --git a/list-objects-filter.c b/list-objects-filter.c
index 37fba456d..e69fb9b82 100644
--- a/list-objects-filter.c
+++ b/list-objects-filter.c
@@ -83,53 +83,80 @@ static void *filter_blobs_none__init(
  * A filter for list-objects to omit ALL trees and blobs from the traversal.
  * Can OPTIONALLY collect a list of the omitted OIDs.
  */
-struct filter_trees_none_data {
+struct filter_trees_depth_data {
 	struct oidset *omits;
+	unsigned long max_depth;
+	unsigned long current_depth;
 };
 
-static enum list_objects_filter_result filter_trees_none(
+static enum list_objects_filter_result filter_trees_depth(
 	enum list_objects_filter_situation filter_situation,
 	struct object *obj,
 	const char *pathname,
 	const char *filename,
 	void *filter_data_)
 {
-	struct filter_trees_none_data *filter_data = filter_data_;
+	struct filter_trees_depth_data *filter_data = filter_data_;
+
+	int too_deep = filter_data->current_depth >= filter_data->max_depth;
+
+	/*
+	 * Note that we do not use _MARK_SEEN in order to allow re-traversal in
+	 * case we encounter a tree or blob again at a shallower depth.
+	 */
 
 	switch (filter_situation) {
 	default:
 		BUG("unknown filter_situation: %d", filter_situation);
 
-	case LOFS_BEGIN_TREE:
 	case LOFS_BLOB:
+		if (!too_deep) goto include_it;
+
+		if (filter_data->omits)
+			oidset_insert(filter_data->omits, &obj->oid);
+
+		return LOFR_ZERO;
+
+	case LOFS_BEGIN_TREE:
+		filter_data->current_depth++;
+
+		if (!too_deep) goto include_it;
+
 		if (filter_data->omits) {
 			oidset_insert(filter_data->omits, &obj->oid);
-			/* _MARK_SEEN but not _DO_SHOW (hard omit) */
-			return LOFR_MARK_SEEN;
+			return LOFR_ZERO;
 		}
 		else
 			/*
 			 * Not collecting omits so no need to to traverse tree.
 			 */
-			return LOFR_SKIP_TREE | LOFR_MARK_SEEN;
+			return LOFR_SKIP_TREE;
 
 	case LOFS_END_TREE:
 		assert(obj->type == OBJ_TREE);
+		filter_data->current_depth--;
 		return LOFR_ZERO;
 
 	}
+
+include_it:
+	if (filter_data->omits)
+		oidset_remove(filter_data->omits, &obj->oid);
+	return LOFR_MARK_SEEN | LOFR_DO_SHOW;
 }
 
-static void* filter_trees_none__init(
+static void* filter_trees_depth__init(
 	struct oidset *omitted,
 	struct list_objects_filter_options *filter_options,
 	filter_object_fn *filter_fn,
 	filter_free_fn *filter_free_fn)
 {
-	struct filter_trees_none_data *d = xcalloc(1, sizeof(*d));
+	struct filter_trees_depth_data *d = xcalloc(1, sizeof(*d));
 	d->omits = omitted;
+	d->max_depth = filter_options->tree_depth_limit_value;
+	d->current_depth = 0;
 
-	*filter_fn = filter_trees_none;
+	*filter_fn = filter_trees_depth;
 	*filter_free_fn = free;
 	return d;
 }
@@ -426,7 +453,7 @@ static filter_init_fn s_filters[] = {
 	NULL,
 	filter_blobs_none__init,
 	filter_blobs_limit__init,
-	filter_trees_none__init,
+	filter_trees_depth__init,
 	filter_sparse_oid__init,
 	filter_sparse_path__init,
 };
diff --git a/t/t6112-rev-list-filters-objects.sh b/t/t6112-rev-list-filters-objects.sh
index efb1bee2e..43ee0df80 100755
--- a/t/t6112-rev-list-filters-objects.sh
+++ b/t/t6112-rev-list-filters-objects.sh
@@ -254,6 +254,90 @@ test_expect_success 'filter a GIANT tree through tree:0' '
 	test_line_count = 2 actual
 '
 
+# Test tree:# filters.
+
+expect_has () {
+	commit=$1 &&
+	name=$2 &&
+
+	hash=$(git -C r3 rev-parse $commit:$name) &&
+	grep "^$hash $name$" actual
+}
+
+test_expect_success 'verify tree:1 includes root trees' '
+	git -C r3 rev-list --objects --filter=tree:1 HEAD >actual &&
+
+	# We should get two root directories and two commits.
+	expect_has HEAD "" &&
+	expect_has HEAD~1 ""  &&
+	test_line_count = 4 actual
+'
+
+test_expect_success 'verify tree:2 includes root trees and immediate children' '
+	git -C r3 rev-list --objects --filter=tree:2 HEAD >actual &&
+
+	expect_has HEAD "" &&
+	expect_has HEAD~1 "" &&
+	expect_has HEAD dir1 &&
+	expect_has HEAD pattern &&
+	expect_has HEAD sparse1 &&
+	expect_has HEAD sparse2 &&
+
+	# There are also 2 commit objects
+	test_line_count = 8 actual
+'
+
+test_expect_success 'verify tree:3 includes everything expected' '
+	git -C r3 rev-list --objects --filter=tree:3 HEAD >actual &&
+
+	expect_has HEAD "" &&
+	expect_has HEAD~1 "" &&
+	expect_has HEAD dir1 &&
+	expect_has HEAD dir1/sparse1 &&
+	expect_has HEAD dir1/sparse2 &&
+	expect_has HEAD pattern &&
+	expect_has HEAD sparse1 &&
+	expect_has HEAD sparse2 &&
+
+	# There are also 2 commit objects
+	test_line_count = 10 actual
+'
+
+# Test provisional omit collection logic with a repo that has objects appearing
+# at multiple depths - first deeper than the filter's threshold, then shallow.
+
+test_expect_success 'setup r4' '
+	git init r4 &&
+
+	echo foo > r4/foo &&
+	mkdir r4/subdir &&
+	echo bar > r4/subdir/bar &&
+
+	mkdir r4/filt &&
+	cp -r r4/foo r4/subdir r4/filt &&
+
+	git -C r4 add foo subdir filt &&
+	git -C r4 commit -m "commit msg"
+'
+
+expect_has_with_different_name () {
+	commit=$1 &&
+	name=$2 &&
+
+	hash=$(git -C r4 rev-parse $commit:$name) &&
+	! grep "^$hash $name$" actual &&
+	grep "^$hash " actual &&
+	! grep "~$hash" actual
+}
+
+test_expect_success 'test tree:# filter provisional omit for blob and tree' '
+	git -C r4 rev-list --objects --filter-print-omitted --filter=tree:2 \
+		HEAD >actual &&
+
+	expect_has_with_different_name HEAD filt/foo &&
+	expect_has_with_different_name HEAD filt/subdir
+'
+
 # Delete some loose objects and use rev-list, but WITHOUT any filtering.
 # This models previously omitted objects that we did not receive.
 
-- 
2.19.1.331.ge82ca0e54c-goog


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/3] list-objects: support for skipping tree traversal
  2018-10-11 23:09 ` [RFC PATCH 1/3] list-objects: support for skipping tree traversal Matthew DeVore
@ 2018-10-14 23:15   ` Junio C Hamano
  2018-10-18  0:25     ` Matthew DeVore
  0 siblings, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2018-10-14 23:15 UTC (permalink / raw)
  To: Matthew DeVore
  Cc: git, sbeller, git, jeffhost, peff, stefanbeller, jonathantanmy,
	pclouds

Matthew DeVore <matvore@google.com> writes:

> The tree:0 filter does not need to traverse the trees that it has
> filtered out, so optimize list-objects and list-objects-filter to skip
> traversing the trees entirely. Before this patch, we iterated over all
> children of the tree, and did nothing for all of them, which was
> wasteful.
>
> Signed-off-by: Matthew DeVore <matvore@google.com>
> ---
>  list-objects-filter.c               | 11 +++++++++--
>  list-objects-filter.h               |  6 ++++++
>  list-objects.c                      |  5 ++++-
>  t/t6112-rev-list-filters-objects.sh | 10 ++++++++++
>  4 files changed, 29 insertions(+), 3 deletions(-)

This step looks more like "ow, we could have done the tree:0 support
that is in 'next' better" than a part of "here is a series to do
tree:N for non zero value of N".

If that is the case, I'd prefer to see this step polished enough
before [2-3/3] of this RFC is worked on, so that we can merge the
tree:0 (but not yet tree:N) support that is solid down to 'master'
soonish.

> diff --git a/list-objects-filter.c b/list-objects-filter.c
> index 09b2b05d5..37fba456d 100644
> --- a/list-objects-filter.c
> +++ b/list-objects-filter.c
> @@ -102,9 +102,16 @@ static enum list_objects_filter_result filter_trees_none(
>  
>  	case LOFS_BEGIN_TREE:
>  	case LOFS_BLOB:
> -		if (filter_data->omits)
> +		if (filter_data->omits) {
>  			oidset_insert(filter_data->omits, &obj->oid);
> -		return LOFR_MARK_SEEN; /* but not LOFR_DO_SHOW (hard omit) */
> +			/* _MARK_SEEN but not _DO_SHOW (hard omit) */
> +			return LOFR_MARK_SEEN;
> +		}
> +		else
> +			/*
> +			 * Not collecting omits so no need to to traverse tree.
> +			 */
> +			return LOFR_SKIP_TREE | LOFR_MARK_SEEN;

OK, so "not collecting omits" is synonymous to "N==0, we aren't
doing tree at any level", and at this point in the series before the
support for N>0 is introduced, we'd always take this "else" clause
because tree:0 is the only thing we support?

Style: our modern style is to use {} around the body which is a
single statement on the else clause when the body of the
corresponding if clause needs {} around (and vice versa), so

	if (filter_data->omits) {
		oidset_isnert(...);
		return ...;
	} else {
		return ...;
	}

> diff --git a/list-objects-filter.h b/list-objects-filter.h
> index a6f6b4990..52b4a84da 100644
> --- a/list-objects-filter.h
> +++ b/list-objects-filter.h
> @@ -24,6 +24,11 @@ struct oidset;
>   *              In general, objects should only be shown once, but
>   *              this result DOES NOT imply that we mark it SEEN.
>   *
> + * _SKIP_TREE : Used in LOFS_BEGIN_TREE situation - indicates that
> + *              the tree's children should not be iterated over. This
> + *              is used as an optimization when all children will
> + *              definitely be ignored.
> + *
>   * Most of the time, you want the combination (_MARK_SEEN | _DO_SHOW)
>   * but they can be used independently, such as when sparse-checkout
>   * pattern matching is being applied.
> @@ -45,6 +50,7 @@ enum list_objects_filter_result {
>  	LOFR_ZERO      = 0,
>  	LOFR_MARK_SEEN = 1<<0,
>  	LOFR_DO_SHOW   = 1<<1,
> +	LOFR_SKIP_TREE = 1<<2,
>  };
>  
>  enum list_objects_filter_situation {
> diff --git a/list-objects.c b/list-objects.c
> index 7a1a0929d..d1e3d217c 100644
> --- a/list-objects.c
> +++ b/list-objects.c
> @@ -11,6 +11,7 @@
>  #include "list-objects-filter-options.h"
>  #include "packfile.h"
>  #include "object-store.h"
> +#include "trace.h"
>  
>  struct traversal_context {
>  	struct rev_info *revs;
> @@ -184,7 +185,9 @@ static void process_tree(struct traversal_context *ctx,
>  	if (base->len)
>  		strbuf_addch(base, '/');
>  
> -	if (!failed_parse)
> +	if (r & LOFR_SKIP_TREE)
> +		trace_printf("Skipping contents of tree %s...\n", base->buf);
> +	else if (!failed_parse)
>  		process_tree_contents(ctx, tree, base);

Even when failed_parse==true, i.e. we found that the tree object's
data cannot be understood, if we have skip-tree bit set, that means
we do not care---we won't be descending into its children anyway.

Makes sense.

> diff --git a/t/t6112-rev-list-filters-objects.sh b/t/t6112-rev-list-filters-objects.sh
> index 08e0c7db6..efb1bee2e 100755
> --- a/t/t6112-rev-list-filters-objects.sh
> +++ b/t/t6112-rev-list-filters-objects.sh
> @@ -244,6 +244,16 @@ test_expect_success 'verify tree:0 includes trees in "filtered" output' '
>  	test_cmp expected filtered_types
>  '
>  
> +# Make sure tree:0 does not iterate through any trees.
> +
> +test_expect_success 'filter a GIANT tree through tree:0' '
> +	GIT_TRACE=1 git -C r3 rev-list \
> +		--objects --filter=tree:0 HEAD 2>filter_trace &&
> +	grep "Skipping contents of tree [.][.][.]" filter_trace >actual &&

Here you are not jus tmaking sure SKIP_TREE bit is set for some
tree, but it is set when base->buf is an empty string (i.e. the top
level tree)?  Which makes sense, and the next text makes sure that
between the two commits, the number of total "top level" trees is 2,
but I wonder if it is more direct to also make sure that the code is
not even seeing or skipping any tree inside these top level trees
(i.e. the same message but for ""!=base->buf should never appear in
the trace).

> +	# One line for each commit traversed.
> +	test_line_count = 2 actual
> +'
> +
>  # Delete some loose objects and use rev-list, but WITHOUT any filtering.
>  # This models previously omitted objects that we did not receive.

Thanks.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/3] Documentation/git-rev-list: s/<commit>/<object>/
  2018-10-11 23:09 ` [RFC PATCH 2/3] Documentation/git-rev-list: s/<commit>/<object>/ Matthew DeVore
@ 2018-10-14 23:25   ` Junio C Hamano
  2018-10-20  0:03     ` Matthew DeVore
  0 siblings, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2018-10-14 23:25 UTC (permalink / raw)
  To: Matthew DeVore
  Cc: git, sbeller, git, jeffhost, peff, stefanbeller, jonathantanmy,
	pclouds

Matthew DeVore <matvore@google.com> writes:

> -List commits that are reachable by following the `parent` links from the
> -given commit(s), but exclude commits that are reachable from the one(s)
> -given with a '{caret}' in front of them.  The output is given in reverse
> -chronological order by default.
> +List objects that are reachable by following references from the given
> +object(s), but exclude objects that are reachable from the one(s) given
> +with a '{caret}' in front of them.
>  
> +By default, only commit objects are shown, and the commits are shown in
> +reverse chronological order. The '--object' flag causes non-commit objects
> +to also be shown.
>
> +You can think of this as a set operation.  Objects given on the command
> +line form a set of objects that are reachable from any of them, and then
> +objects reachable from any of the ones given with '{caret}' in front are
> +subtracted from that set.  The remaining objects are what come out in the
>  command's output.  Various other options and paths parameters can be used
>  to further limit the result.

I am not sure if this is a good rewrite.  It gives a false
impression as if you'd not see anything if I did this:

	git rev-list --objects ^master md/filter-trees:t/valgrind

especially if you change the SYNOPSIS and make it claim that the
command takes <object> as starting point.  Updating <commit> to
<object> for those that are _shown_ is OK, but the s/commit/object/
for those that are given as the starting points is not a good change
at all, if I understand what the code is designed to do correctly.

It is more like "this is a set operation across commits.  We also
show objects that are reachable from the commits in the resulting
set and are not reachable from the commits in the set that were
excluded when --objects option is given".


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 3/3] list-objects-filter: teach tree:# how to handle >0
  2018-10-11 23:09 ` [RFC PATCH 3/3] list-objects-filter: teach tree:# how to handle >0 Matthew DeVore
@ 2018-10-15  2:31   ` Junio C Hamano
  2018-11-08  0:47     ` Matthew DeVore
  2018-12-10 23:15     ` Matthew DeVore
  0 siblings, 2 replies; 14+ messages in thread
From: Junio C Hamano @ 2018-10-15  2:31 UTC (permalink / raw)
  To: Matthew DeVore
  Cc: git, sbeller, git, jeffhost, peff, stefanbeller, jonathantanmy,
	pclouds

Matthew DeVore <matvore@google.com> writes:

> The long-term goal at the end of this is to allow a partial clone to
> eagerly fetch an entire directory of files by fetching a tree and
> specifying <depth>=1. This, for instance, would make a build operation
> fast and convenient

This would reduce round-trip, as you do not have to fetch the tree
to see what its contents are before issuing another set of fetches
for them.  Those who are building virtual filesystem that let you
mount a specific tree object, perhaps via fuse, may find it useful,
too, even though I suspect that may not be your primary focus.

> diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
> index c2c1c40e6..c78985c41 100644
> --- a/Documentation/rev-list-options.txt
> +++ b/Documentation/rev-list-options.txt
> @@ -734,8 +734,12 @@ specification contained in <path>.
>  +
>  The form '--filter=tree:<depth>' omits all blobs and trees whose depth
>  from the root tree is >= <depth> (minimum depth if an object is located
> -at multiple depths in the commits traversed). Currently, only <depth>=0
> -is supported, which omits all blobs and trees.
> +at multiple depths in the commits traversed). <depth>=0 will not include
> +any trees or blobs unless included explicitly in <object>. <depth>=1

Here, <object> refers to the objects directly requested on the
command line (or --stdin)?  Triggering this question from me is a
sign that this description may want to say a bit more to avoid the
same question from the real readers.  Perhaps replace "included
explicitly in <object>" with "explicitly requested by listing them
on the command line or feeding them with `--stdin`", or something
like that?

> diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
> index e8da2e858..9dc61d6e6 100644
> --- a/list-objects-filter-options.c
> +++ b/list-objects-filter-options.c
> @@ -50,12 +50,12 @@ static int gently_parse_list_objects_filter(
>  		}
>  
>  	} else if (skip_prefix(arg, "tree:", &v0)) {
> -		unsigned long depth;
> -		if (!git_parse_ulong(v0, &depth) || depth != 0) {
> +		if (!git_parse_ulong(v0,
> +				     &filter_options->tree_depth_limit_value)) {
>  			if (errbuf) {
>  				strbuf_addstr(
>  					errbuf,
> -					_("only 'tree:0' is supported"));
> +					_("expected 'tree:<int>'"));

We do not accept "tree:-1", even though "-1" is an int.  Is it too
obvious to worry about?  I do not think we want to say tree:<uint>
even if we do want to make it clear we reject "tree:-1"

I am wondering if "expected 'tree:<depth>'" would work better.

> diff --git a/list-objects-filter-options.h b/list-objects-filter-options.h
> index af64e5c66..c1ae70cd8 100644
> --- a/list-objects-filter-options.h
> +++ b/list-objects-filter-options.h
> @@ -44,6 +44,7 @@ struct list_objects_filter_options {
>  	struct object_id *sparse_oid_value;
>  	char *sparse_path_value;
>  	unsigned long blob_limit_value;
> +	unsigned long tree_depth_limit_value;
>  };

This change gets it right by adding "depth" in the field name and it
is not a comment on this patch, but someday not in too distant
future we should rename the "blob_limit_value" to make it clear that
it is filtering by number of bytes, as other filtering criteria on
blobs that can be expressed in ulong are quite possible.

> -static enum list_objects_filter_result filter_trees_none(
> +static enum list_objects_filter_result filter_trees_depth(
>  	enum list_objects_filter_situation filter_situation,
>  	struct object *obj,
>  	const char *pathname,
>  	const char *filename,
>  	void *filter_data_)
>  {
> -	struct filter_trees_none_data *filter_data = filter_data_;
> +	struct filter_trees_depth_data *filter_data = filter_data_;
> +
> +	int too_deep = filter_data->current_depth >= filter_data->max_depth;

Does max mean "maximum allowed", or "this and anything larger are
rejected".  The latter sound wrong, but I offhand do not know if
your current_depth counts from 0 or 1, so there may not be
off-by-one.

 - dir.c::within_depth() that is used by pathspec matching that in turn
   is used by "grep --max-depth=1" does "if (depth > max_depth)", which
   sounds more in line with the usual convention, I would think.

 - pack-objects has max_delta_cache_size, which also is used as
   "maximum allowed", not "this is already too big".  So is its
   max_depth.

There may be other examples.  One existing violator I noticed was
the "reject blobs that is this size or larger" in this file; it is
called "max_bytes", but it is apparently not "maximum allowed",
which we probably would want to fix.

> +	/*
> +	 * Note that we do not use _MARK_SEEN in order to allow re-traversal in
> +	 * case we encounter a tree or blob again at a shallower depth.
> +	 */

Hmph.  Earlier tree:0 support never even read the actual tree, so
this was not a problem.  We wouldn't have found a tree in deeper
places first and then at a shallower depth, as we wouldn't have seen
any tree in any depth deeper than the surface anyway.

Now we need to worry about a tree that originally gets seen in a
deeper depth (that is still below the allowed maximum) to reappear
at a shallower place, so a subtree within it that used to be too
deep may now be within the allowed maximum depth.

Step 1 of these three patches made sure trees are not even opened
under "tree:0", so it was not just optimizing/shrinking the output
of rev-list but also optimizing the traversal.  When we are
collecting omits, however, this one now returns _ZERO which means we
still traverse into the tree, even under "tree:0"?  I must be
reading the code incorrectly (in general, when we are seeing a tree
object that itself is at the maximum allowed depth, trees found by
reading its contents will never become eligible for output, even if
they are found at a shallower depth than their other copies were
previously found, I would think).

>  	switch (filter_situation) {
>  	default:
>  		BUG("unknown filter_situation: %d", filter_situation);
>  
> -	case LOFS_BEGIN_TREE:
>  	case LOFS_BLOB:
> +		if (!too_deep) goto include_it;

Style: on two lines, like you did below for the next if() statement.

> +
> +		if (filter_data->omits)
> +			oidset_insert(filter_data->omits, &obj->oid);
> +
> +		return LOFR_ZERO;
> +
> +	case LOFS_BEGIN_TREE:
> +		filter_data->current_depth++;
> +
> +		if (!too_deep) goto include_it;
> +
>  		if (filter_data->omits) {
>  			oidset_insert(filter_data->omits, &obj->oid);
> -			/* _MARK_SEEN but not _DO_SHOW (hard omit) */
> -			return LOFR_MARK_SEEN;
> +			return LOFR_ZERO;
>  		}
>  		else
>  			/*
>  			 * Not collecting omits so no need to to traverse tree.
>  			 */
> -			return LOFR_SKIP_TREE | LOFR_MARK_SEEN;
> +			return LOFR_SKIP_TREE;
>  
>  	case LOFS_END_TREE:
>  		assert(obj->type == OBJ_TREE);
> +		filter_data->current_depth--;
>  		return LOFR_ZERO;
>  
>  	}
> +
> +include_it:
> +	if (filter_data->omits)
> +		oidset_remove(filter_data->omits, &obj->oid);
> +	return LOFR_MARK_SEEN | LOFR_DO_SHOW;
>  }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 1/3] list-objects: support for skipping tree traversal
  2018-10-14 23:15   ` Junio C Hamano
@ 2018-10-18  0:25     ` Matthew DeVore
  0 siblings, 0 replies; 14+ messages in thread
From: Matthew DeVore @ 2018-10-18  0:25 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Stefan Beller, git, jeffhost, Jeff King, Stefan Beller,
	Jonathan Tan, pclouds

On Sun, Oct 14, 2018 at 4:15 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> This step looks more like "ow, we could have done the tree:0 support
> that is in 'next' better" than a part of "here is a series to do
> tree:N for non zero value of N".
>
> If that is the case, I'd prefer to see this step polished enough
> before [2-3/3] of this RFC is worked on, so that we can merge the
> tree:0 (but not yet tree:N) support that is solid down to 'master'
> soonish.
That's fair. So I will prioritize this patch above the rest of the
patchset and send it separately in the future.

> OK, so "not collecting omits" is synonymous to "N==0, we aren't
> doing tree at any level", and at this point in the series before the
> support for N>0 is introduced, we'd always take this "else" clause
> because tree:0 is the only thing we support?
Actually "not collecting omits" and depth==0 are orthogonal concepts.
"Collect omits" is the logic needed to implement the
--filter-print-omitted flag on git rev-list. You can do this when
depth==0 (it will actually print all the trees and blobs recursively).
"Collect omits" is tested *with* tree:0 in test t6112 in the test
labeled 'verify tree:0 includes trees in "filtered" output'

IOW, both branches here are important even in this patch. If we are
collecting omits, then we can't skip the tree because omits collecting
is recursive. Otherwise, we *can* skip the tree.

But maybe printing omits should not be recursive? The decision was
never discussed. The code to not be recursive is simpler, because we
don't need this if/else. Recursiveness is counter-intuitive since we
would "skip" a tree and at the same time print its contents.

>
> Style: our modern style is to use {} around the body which is a
> single statement on the else clause when the body of the
> corresponding if clause needs {} around (and vice versa), so
>
Fixed, and I didn't realize I was supposed to be hugging "else" with
the curly braces. What you say is what CodingGuidelines says to do.
Thanks for pointing that out.

>
> Even when failed_parse==true, i.e. we found that the tree object's
> data cannot be understood, if we have skip-tree bit set, that means
> we do not care---we won't be descending into its children anyway.
>
Yes.

> > +# Make sure tree:0 does not iterate through any trees.
> > +
> > +test_expect_success 'filter a GIANT tree through tree:0' '
> > +     GIT_TRACE=1 git -C r3 rev-list \
> > +             --objects --filter=tree:0 HEAD 2>filter_trace &&
> > +     grep "Skipping contents of tree [.][.][.]" filter_trace >actual &&
>
> Here you are not jus tmaking sure SKIP_TREE bit is set for some
> tree, but it is set when base->buf is an empty string (i.e. the top
> level tree)?  Which makes sense, and the next text makes sure that
> between the two commits, the number of total "top level" trees is 2,
> but I wonder if it is more direct to also make sure that the code is
> not even seeing or skipping any tree inside these top level trees
> (i.e. the same message but for ""!=base->buf should never appear in
> the trace).
Makes sense. I added another check in this test for other "Skipping" messages.

Thank you for reviewing.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/3] Documentation/git-rev-list: s/<commit>/<object>/
  2018-10-14 23:25   ` Junio C Hamano
@ 2018-10-20  0:03     ` Matthew DeVore
  2018-10-22  2:21       ` Junio C Hamano
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew DeVore @ 2018-10-20  0:03 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Stefan Beller, git, jeffhost, Jeff King, Stefan Beller,
	Jonathan Tan, pclouds

On Sun, Oct 14, 2018 at 4:25 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Matthew DeVore <matvore@google.com> writes:
>
> > -List commits that are reachable by following the `parent` links from the
> > -given commit(s), but exclude commits that are reachable from the one(s)
> > -given with a '{caret}' in front of them.  The output is given in reverse
> > -chronological order by default.
> > +List objects that are reachable by following references from the given
> > +object(s), but exclude objects that are reachable from the one(s) given
> > +with a '{caret}' in front of them.
> >
> > +By default, only commit objects are shown, and the commits are shown in
> > +reverse chronological order. The '--object' flag causes non-commit objects
> > +to also be shown.
> >
> > +You can think of this as a set operation.  Objects given on the command
> > +line form a set of objects that are reachable from any of them, and then
> > +objects reachable from any of the ones given with '{caret}' in front are
> > +subtracted from that set.  The remaining objects are what come out in the
> >  command's output.  Various other options and paths parameters can be used
> >  to further limit the result.
>
> I am not sure if this is a good rewrite.  It gives a false
> impression as if you'd not see anything if I did this:
>
>         git rev-list --objects ^master md/filter-trees:t/valgrind
>
Oh that's interesting. So my mental model conflicts with the command's
behavior. It actually is surprising behavior because:

# this shows all files that were modified in the HEAD commit
git rev-list --objects ^HEAD~1^{tree} HEAD:

# but this shows *all* files at HEAD
git rev-list --objects ^HEAD~1 HEAD:

Which means that ^commit and ^non-commit are treated inherently
differently. Maybe I should fix this bug before clarifying this
documentation...

> It is more like "this is a set operation across commits.  We also
> show objects that are reachable from the commits in the resulting
> set and are not reachable from the commits in the set that were
> excluded when --objects option is given".
>
That would be correct though it wouldn't tell that you can use
"--objects ^foo-tree bar-tree." Without fixing the above bug, I could
add to your wording something to the effect of "You can also use trees
to include and exclude sets of objects rather than commits." Which
implies that mixing "--objects ^commit tree" on the command line is
undefined.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/3] Documentation/git-rev-list: s/<commit>/<object>/
  2018-10-20  0:03     ` Matthew DeVore
@ 2018-10-22  2:21       ` Junio C Hamano
  2018-12-07 22:55         ` Matthew DeVore
  0 siblings, 1 reply; 14+ messages in thread
From: Junio C Hamano @ 2018-10-22  2:21 UTC (permalink / raw)
  To: Matthew DeVore
  Cc: git, Stefan Beller, git, jeffhost, Jeff King, Stefan Beller,
	Jonathan Tan, pclouds

Matthew DeVore <matvore@google.com> writes:

>> It is more like "this is a set operation across commits.  We also
>> show objects that are reachable from the commits in the resulting
>> set and are not reachable from the commits in the set that were
>> excluded when --objects option is given".
>>
> That would be correct though it wouldn't tell that you can use
> "--objects ^foo-tree bar-tree."

Yeah, but quite honestly, I consider that it is working by accident,
not by design, in the current code (iow, you are looking at a
behaviour of whatever the code happens to do).  "rev-list" is pretty
much set operation across commits, and anything that deals with a
non commit-ish given from the command line is an afterthought at
best, and happenstance in reality.

I do not mean to say that the code must stay that way, though.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 3/3] list-objects-filter: teach tree:# how to handle >0
  2018-10-15  2:31   ` Junio C Hamano
@ 2018-11-08  0:47     ` Matthew DeVore
  2018-12-10 23:15     ` Matthew DeVore
  1 sibling, 0 replies; 14+ messages in thread
From: Matthew DeVore @ 2018-11-08  0:47 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Stefan Beller, git, jeffhost, Jeff King, Stefan Beller,
	Jonathan Tan, pclouds

A quick update: I've read Junio's comments on this patchset and
basically agree with them, but I haven't had a chance to apply them
yet. I plan to pick this patchset up again (as well as the patch
"md/list-lazy-objects-fix") once things settle down in my day job.
On Sun, Oct 14, 2018 at 7:31 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Matthew DeVore <matvore@google.com> writes:
>
> > The long-term goal at the end of this is to allow a partial clone to
> > eagerly fetch an entire directory of files by fetching a tree and
> > specifying <depth>=1. This, for instance, would make a build operation
> > fast and convenient
>
> This would reduce round-trip, as you do not have to fetch the tree
> to see what its contents are before issuing another set of fetches
> for them.  Those who are building virtual filesystem that let you
> mount a specific tree object, perhaps via fuse, may find it useful,
> too, even though I suspect that may not be your primary focus.
>
> > diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
> > index c2c1c40e6..c78985c41 100644
> > --- a/Documentation/rev-list-options.txt
> > +++ b/Documentation/rev-list-options.txt
> > @@ -734,8 +734,12 @@ specification contained in <path>.
> >  +
> >  The form '--filter=tree:<depth>' omits all blobs and trees whose depth
> >  from the root tree is >= <depth> (minimum depth if an object is located
> > -at multiple depths in the commits traversed). Currently, only <depth>=0
> > -is supported, which omits all blobs and trees.
> > +at multiple depths in the commits traversed). <depth>=0 will not include
> > +any trees or blobs unless included explicitly in <object>. <depth>=1
>
> Here, <object> refers to the objects directly requested on the
> command line (or --stdin)?  Triggering this question from me is a
> sign that this description may want to say a bit more to avoid the
> same question from the real readers.  Perhaps replace "included
> explicitly in <object>" with "explicitly requested by listing them
> on the command line or feeding them with `--stdin`", or something
> like that?
>
> > diff --git a/list-objects-filter-options.c b/list-objects-filter-options.c
> > index e8da2e858..9dc61d6e6 100644
> > --- a/list-objects-filter-options.c
> > +++ b/list-objects-filter-options.c
> > @@ -50,12 +50,12 @@ static int gently_parse_list_objects_filter(
> >               }
> >
> >       } else if (skip_prefix(arg, "tree:", &v0)) {
> > -             unsigned long depth;
> > -             if (!git_parse_ulong(v0, &depth) || depth != 0) {
> > +             if (!git_parse_ulong(v0,
> > +                                  &filter_options->tree_depth_limit_value)) {
> >                       if (errbuf) {
> >                               strbuf_addstr(
> >                                       errbuf,
> > -                                     _("only 'tree:0' is supported"));
> > +                                     _("expected 'tree:<int>'"));
>
> We do not accept "tree:-1", even though "-1" is an int.  Is it too
> obvious to worry about?  I do not think we want to say tree:<uint>
> even if we do want to make it clear we reject "tree:-1"
>
> I am wondering if "expected 'tree:<depth>'" would work better.
>
> > diff --git a/list-objects-filter-options.h b/list-objects-filter-options.h
> > index af64e5c66..c1ae70cd8 100644
> > --- a/list-objects-filter-options.h
> > +++ b/list-objects-filter-options.h
> > @@ -44,6 +44,7 @@ struct list_objects_filter_options {
> >       struct object_id *sparse_oid_value;
> >       char *sparse_path_value;
> >       unsigned long blob_limit_value;
> > +     unsigned long tree_depth_limit_value;
> >  };
>
> This change gets it right by adding "depth" in the field name and it
> is not a comment on this patch, but someday not in too distant
> future we should rename the "blob_limit_value" to make it clear that
> it is filtering by number of bytes, as other filtering criteria on
> blobs that can be expressed in ulong are quite possible.
>
> > -static enum list_objects_filter_result filter_trees_none(
> > +static enum list_objects_filter_result filter_trees_depth(
> >       enum list_objects_filter_situation filter_situation,
> >       struct object *obj,
> >       const char *pathname,
> >       const char *filename,
> >       void *filter_data_)
> >  {
> > -     struct filter_trees_none_data *filter_data = filter_data_;
> > +     struct filter_trees_depth_data *filter_data = filter_data_;
> > +
> > +     int too_deep = filter_data->current_depth >= filter_data->max_depth;
>
> Does max mean "maximum allowed", or "this and anything larger are
> rejected".  The latter sound wrong, but I offhand do not know if
> your current_depth counts from 0 or 1, so there may not be
> off-by-one.
>
>  - dir.c::within_depth() that is used by pathspec matching that in turn
>    is used by "grep --max-depth=1" does "if (depth > max_depth)", which
>    sounds more in line with the usual convention, I would think.
>
>  - pack-objects has max_delta_cache_size, which also is used as
>    "maximum allowed", not "this is already too big".  So is its
>    max_depth.
>
> There may be other examples.  One existing violator I noticed was
> the "reject blobs that is this size or larger" in this file; it is
> called "max_bytes", but it is apparently not "maximum allowed",
> which we probably would want to fix.
>
> > +     /*
> > +      * Note that we do not use _MARK_SEEN in order to allow re-traversal in
> > +      * case we encounter a tree or blob again at a shallower depth.
> > +      */
>
> Hmph.  Earlier tree:0 support never even read the actual tree, so
> this was not a problem.  We wouldn't have found a tree in deeper
> places first and then at a shallower depth, as we wouldn't have seen
> any tree in any depth deeper than the surface anyway.
>
> Now we need to worry about a tree that originally gets seen in a
> deeper depth (that is still below the allowed maximum) to reappear
> at a shallower place, so a subtree within it that used to be too
> deep may now be within the allowed maximum depth.
>
> Step 1 of these three patches made sure trees are not even opened
> under "tree:0", so it was not just optimizing/shrinking the output
> of rev-list but also optimizing the traversal.  When we are
> collecting omits, however, this one now returns _ZERO which means we
> still traverse into the tree, even under "tree:0"?  I must be
> reading the code incorrectly (in general, when we are seeing a tree
> object that itself is at the maximum allowed depth, trees found by
> reading its contents will never become eligible for output, even if
> they are found at a shallower depth than their other copies were
> previously found, I would think).
>
> >       switch (filter_situation) {
> >       default:
> >               BUG("unknown filter_situation: %d", filter_situation);
> >
> > -     case LOFS_BEGIN_TREE:
> >       case LOFS_BLOB:
> > +             if (!too_deep) goto include_it;
>
> Style: on two lines, like you did below for the next if() statement.
>
> > +
> > +             if (filter_data->omits)
> > +                     oidset_insert(filter_data->omits, &obj->oid);
> > +
> > +             return LOFR_ZERO;
> > +
> > +     case LOFS_BEGIN_TREE:
> > +             filter_data->current_depth++;
> > +
> > +             if (!too_deep) goto include_it;
> > +
> >               if (filter_data->omits) {
> >                       oidset_insert(filter_data->omits, &obj->oid);
> > -                     /* _MARK_SEEN but not _DO_SHOW (hard omit) */
> > -                     return LOFR_MARK_SEEN;
> > +                     return LOFR_ZERO;
> >               }
> >               else
> >                       /*
> >                        * Not collecting omits so no need to to traverse tree.
> >                        */
> > -                     return LOFR_SKIP_TREE | LOFR_MARK_SEEN;
> > +                     return LOFR_SKIP_TREE;
> >
> >       case LOFS_END_TREE:
> >               assert(obj->type == OBJ_TREE);
> > +             filter_data->current_depth--;
> >               return LOFR_ZERO;
> >
> >       }
> > +
> > +include_it:
> > +     if (filter_data->omits)
> > +             oidset_remove(filter_data->omits, &obj->oid);
> > +     return LOFR_MARK_SEEN | LOFR_DO_SHOW;
> >  }

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/3] Documentation/git-rev-list: s/<commit>/<object>/
  2018-10-22  2:21       ` Junio C Hamano
@ 2018-12-07 22:55         ` Matthew DeVore
  2018-12-08  6:24           ` Junio C Hamano
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew DeVore @ 2018-12-07 22:55 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Stefan Beller, git, jeffhost, Jeff King, Stefan Beller,
	Jonathan Tan, pclouds

On Sun, Oct 21, 2018 at 7:21 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Matthew DeVore <matvore@google.com> writes:
>
> >> It is more like "this is a set operation across commits.  We also
> >> show objects that are reachable from the commits in the resulting
> >> set and are not reachable from the commits in the set that were
> >> excluded when --objects option is given".
> >>
> > That would be correct though it wouldn't tell that you can use
> > "--objects ^foo-tree bar-tree."
>
> Yeah, but quite honestly, I consider that it is working by accident,
> not by design, in the current code (iow, you are looking at a
> behaviour of whatever the code happens to do).  "rev-list" is pretty
> much set operation across commits, and anything that deals with a
> non commit-ish given from the command line is an afterthought at
> best, and happenstance in reality.
>
> I do not mean to say that the code must stay that way, though.

I tried fixing the issue of "--objects ^commitobj treeobj" not
properly excluding objects reachable from commitobj, but this ended up
causing t5616-partial-clone.sh to fail. In the test labeled "manual
prefetch of missing objects", we create a clone of srv.bare without
blobs called "pc1", then push some new commits to srv.bare (via a
separate "local" repo), and try to fetch missing blobs with this
command:

$ git -C pc1 fetch-pack --stdin "file://$(pwd)/srv.bare" <observed.oids

Where observed.oids contains all the blobs that were missing. It tells
the remote that it already has the "refs/heads/master" commit, which
means it is excluded. Before, this worked fine, since it didn't mean
the blobs were excluded, only the commit itself. With the fix I
naively put together (and didn't share), because one of the blobs is
reachable from "refs/heads/master" it wouldn't pull it. I guess I can
change it so that items given directly to stdin or argv are always
fetched, though I'm not sure how easy that will be, and I'm not as
interested in fixing this as I once was. I just wanted the
documentation to outline the object-enumeration capabilities. So I'll
send a re-roll which makes much more modest changes to the
documentation.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 2/3] Documentation/git-rev-list: s/<commit>/<object>/
  2018-12-07 22:55         ` Matthew DeVore
@ 2018-12-08  6:24           ` Junio C Hamano
  0 siblings, 0 replies; 14+ messages in thread
From: Junio C Hamano @ 2018-12-08  6:24 UTC (permalink / raw)
  To: Matthew DeVore
  Cc: git, Stefan Beller, git, jeffhost, Jeff King, Stefan Beller,
	Jonathan Tan, pclouds

Matthew DeVore <matvore@google.com> writes:

> $ git -C pc1 fetch-pack --stdin "file://$(pwd)/srv.bare" <observed.oids
>
> Where observed.oids contains all the blobs that were missing. It tells
> the remote that it already has the "refs/heads/master" commit, which
> means it is excluded. Before, this worked fine, since it didn't mean
> the blobs were excluded, only the commit itself.

So 'master' has some blob in its tree, which is missing from the
repository, in this test?  Which means that such a repository is
"corrupt" and does not pass connectivity check by fsck.

I am of two minds.  If we claim that we have everything that is
needed to complete the commit sitting at the tip of 'master', I
think it is correct for the other side not to send a blob that is in
'master' (or its ancestors), so your "fix" may (at least
technically) be more correct than the status quo.  On the other
hand, if possession of commit 'master' does not defeat an explicit
request for a blob in it, that would actually be a good thing---it
would be a very straight-forward way to recover from such form of
repository corruption.

Fetching isolated objects without walking is also something that
would help backfill a lazily created clone, and I even vaguely
recall an effort to allow objects explicitly requested to be always
fetched regardless of the connectivity, if I am not mistaken (JTan?)

Anyway, thanks for a thoughtful response.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC PATCH 3/3] list-objects-filter: teach tree:# how to handle >0
  2018-10-15  2:31   ` Junio C Hamano
  2018-11-08  0:47     ` Matthew DeVore
@ 2018-12-10 23:15     ` Matthew DeVore
  1 sibling, 0 replies; 14+ messages in thread
From: Matthew DeVore @ 2018-12-10 23:15 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: git, Stefan Beller, git, jeffhost, Jeff King, Stefan Beller,
	Jonathan Tan, pclouds

On Sun, Oct 14, 2018 at 7:31 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Matthew DeVore <matvore@google.com> writes:
>
> > The long-term goal at the end of this is to allow a partial clone to
> > eagerly fetch an entire directory of files by fetching a tree and
> > specifying <depth>=1. This, for instance, would make a build operation
> > fast and convenient
>
> This would reduce round-trip, as you do not have to fetch the tree
> to see what its contents are before issuing another set of fetches
> for them.  Those who are building virtual filesystem that let you
> mount a specific tree object, perhaps via fuse, may find it useful,
> too, even though I suspect that may not be your primary focus.

I added a paragraph mentioning the "roundtrip" aspect to the commit
message, since this *may* help someone find a use for it.

>
> > diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
> > index c2c1c40e6..c78985c41 100644
> > --- a/Documentation/rev-list-options.txt
> > +++ b/Documentation/rev-list-options.txt
> > @@ -734,8 +734,12 @@ specification contained in <path>.
> >  +
> >  The form '--filter=tree:<depth>' omits all blobs and trees whose depth
> >  from the root tree is >= <depth> (minimum depth if an object is located
> > -at multiple depths in the commits traversed). Currently, only <depth>=0
> > -is supported, which omits all blobs and trees.
> > +at multiple depths in the commits traversed). <depth>=0 will not include
> > +any trees or blobs unless included explicitly in <object>. <depth>=1
>
> Here, <object> refers to the objects directly requested on the
> command line (or --stdin)?  Triggering this question from me is a
> sign that this description may want to say a bit more to avoid the
> same question from the real readers.  Perhaps replace "included
> explicitly in <object>" with "explicitly requested by listing them
> on the command line or feeding them with `--stdin`", or something
> like that?

I have reworded this to be more explicit about "stdin," and also to
avoid directly admitting that trees are allowed in the <commit>
argument, since I've dropped the s/<commit>/<object>/ patch.

Here is the new paragraph:

The form '--filter=tree:<depth>' omits all blobs and trees whose depth
from the root tree is >= <depth> (minimum depth if an object is located
at multiple depths in the commits traversed). <depth>=0 will not include
any trees or blobs unless included explicitly in the command-line (or
standard input when --stdin is used). <depth>=1 will include only the
tree and blobs which are referenced directly by a commit reachable from
<commit> or an explicitly-given object. <depth>=2 is like <depth>=1
while also including trees and blobs one more level removed from an
explicitly-given commit or tree.

> > +                                     _("expected 'tree:<int>'"));
>
> We do not accept "tree:-1", even though "-1" is an int.  Is it too
> obvious to worry about?  I do not think we want to say tree:<uint>
> even if we do want to make it clear we reject "tree:-1"
>
> I am wondering if "expected 'tree:<depth>'" would work better.

Yes, I agree, <depth> is more helpful and not ambiguous in a
significant way. Fixed.

> > +     unsigned long tree_depth_limit_value;
> >  };
>
> This change gets it right by adding "depth" in the field name and it
> is not a comment on this patch, but someday not in too distant
> future we should rename the "blob_limit_value" to make it clear that
> it is filtering by number of bytes, as other filtering criteria on
> blobs that can be expressed in ulong are quite possible.

Agreed.

>
> > -static enum list_objects_filter_result filter_trees_none(
> > +static enum list_objects_filter_result filter_trees_depth(
> >       enum list_objects_filter_situation filter_situation,
> >       struct object *obj,
> >       const char *pathname,
> >       const char *filename,
> >       void *filter_data_)
> >  {
> > -     struct filter_trees_none_data *filter_data = filter_data_;
> > +     struct filter_trees_depth_data *filter_data = filter_data_;
> > +
> > +     int too_deep = filter_data->current_depth >= filter_data->max_depth;
>
> Does max mean "maximum allowed", or "this and anything larger are
> rejected".  The latter sound wrong, but I offhand do not know if
> your current_depth counts from 0 or 1, so there may not be
> off-by-one.
>
It means "this and anything larger are rejected." The documentation
words it similarly:
"
The form '--filter=tree:<depth>' omits all blobs and trees whose depth
from the root tree is >= <depth> ...
"

There is no intuitive phrase to mean "distance from root tree minus
one" (left operand) and I don't want to change the filter option field
to something different from what the user entered (right operand), so
I think we'd best use ">=" and I've renamed the field to
"exclusion_trigger_depth".

> There may be other examples.  One existing violator I noticed was
> the "reject blobs that is this size or larger" in this file; it is
> called "max_bytes", but it is apparently not "maximum allowed",
> which we probably would want to fix.
>

That's not ideal. The documentation suggests it means "maximum
allowed," and JGit apparently is interpreting the value as "maximum
allowed," so we should probably fix the semantics in Git. Or were you
suggesting to keep the behavior the same but fix the naming and docs?

> > +     /*
> > +      * Note that we do not use _MARK_SEEN in order to allow re-traversal in
> > +      * case we encounter a tree or blob again at a shallower depth.
> > +      */
>
> Hmph.  Earlier tree:0 support never even read the actual tree, so
> this was not a problem.  We wouldn't have found a tree in deeper
> places first and then at a shallower depth, as we wouldn't have seen
> any tree in any depth deeper than the surface anyway.
>
> Now we need to worry about a tree that originally gets seen in a
> deeper depth (that is still below the allowed maximum) to reappear
> at a shallower place, so a subtree within it that used to be too
> deep may now be within the allowed maximum depth.

That's true, and it points out the fact that we can't use
LOFR_MARK_SEEN even if we do LOFR_DO_SHOW, since subtrees may or may
not be within the limit in the future. I've fixed that for v+1 of the
patch, and added a corresponding test. I'm also refactoring the
filters_tree_depth function to not use goto. I think it is easier to
follow that way.

>
> Step 1 of these three patches made sure trees are not even opened
> under "tree:0", so it was not just optimizing/shrinking the output
> of rev-list but also optimizing the traversal.  When we are
> collecting omits, however, this one now returns _ZERO which means we
> still traverse into the tree, even under "tree:0"?  I must be
> reading the code incorrectly (in general, when we are seeing a tree
> object that itself is at the maximum allowed depth, trees found by
> reading its contents will never become eligible for output, even if
> they are found at a shallower depth than their other copies were
> previously found, I would think).

You are reading the code right - when the tree is too deep, we
sometimes return LOFR_ZERO, sometimes _SKIP_TREE - the latter if we
are collecting omits. But as you observe, we don't need to traverse an
omitted tree *if it has already been marked as omitted*, since that
suggests that its children have also been marked as omitted. So I've
added a new commit to this patchset which optimizes this case. The
next patchset will have 2 commits (subtracting the documentation
rewrite one and ones that are already scheduled to merge):

Patch 1/2 - implement tree:<depth> for positive depths
Patch 2/2 - stop iterating through trees for collecting omits if the
omits are already marked

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-12-10 23:15 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-11 23:08 [RFC PATCH 0/3] support for filtering trees and blobs based on depth Matthew DeVore
2018-10-11 23:09 ` [RFC PATCH 1/3] list-objects: support for skipping tree traversal Matthew DeVore
2018-10-14 23:15   ` Junio C Hamano
2018-10-18  0:25     ` Matthew DeVore
2018-10-11 23:09 ` [RFC PATCH 2/3] Documentation/git-rev-list: s/<commit>/<object>/ Matthew DeVore
2018-10-14 23:25   ` Junio C Hamano
2018-10-20  0:03     ` Matthew DeVore
2018-10-22  2:21       ` Junio C Hamano
2018-12-07 22:55         ` Matthew DeVore
2018-12-08  6:24           ` Junio C Hamano
2018-10-11 23:09 ` [RFC PATCH 3/3] list-objects-filter: teach tree:# how to handle >0 Matthew DeVore
2018-10-15  2:31   ` Junio C Hamano
2018-11-08  0:47     ` Matthew DeVore
2018-12-10 23:15     ` Matthew DeVore

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).