git@vger.kernel.org list mirror (unofficial, one of many)
 help / color / mirror / code / Atom feed
* [PATCH] ls-tree: add --sparse-filter-oid argument
@ 2023-01-11 17:01 William Sprent via GitGitGadget
  2023-01-13 14:17 ` Eric Sunshine
  2023-01-23 11:46 ` [PATCH v2] " William Sprent via GitGitGadget
  0 siblings, 2 replies; 19+ messages in thread
From: William Sprent via GitGitGadget @ 2023-01-11 17:01 UTC (permalink / raw)
  To: git; +Cc: William Sprent, William Sprent

From: William Sprent <williams@unity3d.com>

There is currently no way to ask git the question "which files would be
part of a sparse checkout of commit X with sparse checkout patterns Y".
One use-case would be that tooling may want know whether sparse checkouts
of two commits contain the same content even if the full trees differ.
Another interesting use-case would be for tooling to use in conjunction
with 'git update-index --index-info'.

'rev-list --objects --filter=sparse:oid' comes close, but as rev-list is
concerned with objects rather than directory trees, it leaves files out
when the same blob occurs in at two different paths.

It is possible to ask git about the sparse status of files currently in
the index with 'ls-files -t'. However, this does not work well when the
caller is interested in another commit, intererested in sparsity
patterns that aren't currently in '.git/info/sparse-checkout', or when
working in with bare repo.

To fill this gap, add a new argument to ls-tree '--sparse-filter-oid'
which takes the object id of a blob containing sparse checkout patterns
that filters the output of 'ls-tree'. When filtering with given sparsity
patterns, 'ls-tree' only outputs blobs and commit objects that
match the given patterns.

While it may be valid in some situations to output a tree object -- e.g.
when a cone pattern matches all blobs recursively contained in a tree --
it is less unclear what should be output if a sparse pattern matches
parts of a tree.

To allow for reusing the pattern matching logic found in
'path_in_sparse_checkout_1()' in 'dir.c' with arbitrary patterns,
extract the pattern matching part of the function into its own new
function 'recursively_match_path_with_sparse_patterns()'.

Signed-off-by: William Sprent <williams@unity3d.com>
---
    [RFC] ls-tree: add --sparse-filter-oid argument
    
    Hi,
    
    I found the need for being able to query git about the contents of
    sparse checkouts for arbitrary commits with arbitrary patterns from a
    bare repository. And it seems to me that ls-tree would be the best fit
    for answering those kinds of questions.
    
    I think it makes sense to have git be able to answer these kinds of
    questions, but I haven't been able to find any previous discussions
    about it. And 'sparse-checkout.txt' is mostly around working with the
    patterns in '.git/info/sparse-checkout'. So I figure this feature maybe
    needs a bit of discussion before landing.
    
    I'm also not very well versed in the code base, so please let me know if
    my approach here is off in general. :)
    
    Note that one of the tests only pass when run on top of commit
    5842710dc2 (dir: check for single file cone patterns, 2023-01-03), which
    was submitted separately and is currently is merged to 'next'.
    
    Thanks, William

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1459%2Fwilliams-unity%2Fls-tree-sparse-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1459/williams-unity/ls-tree-sparse-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1459

 Documentation/git-ls-tree.txt      |   5 ++
 builtin/ls-tree.c                  |  87 +++++++++++++++++++++++-
 dir.c                              |  46 +++++++------
 dir.h                              |   5 ++
 t/t3106-ls-tree-sparse-checkout.sh | 103 +++++++++++++++++++++++++++++
 5 files changed, 224 insertions(+), 22 deletions(-)
 create mode 100755 t/t3106-ls-tree-sparse-checkout.sh

diff --git a/Documentation/git-ls-tree.txt b/Documentation/git-ls-tree.txt
index 0240adb8eec..69cbfb278eb 100644
--- a/Documentation/git-ls-tree.txt
+++ b/Documentation/git-ls-tree.txt
@@ -48,6 +48,11 @@ OPTIONS
 	Show tree entries even when going to recurse them. Has no effect
 	if `-r` was not passed. `-d` implies `-t`.
 
+--sparse-filter-oid=<blob-ish>::
+	Omit showing tree objects and paths that do not match the
+	sparse-checkout specification contained in the blob
+	(or blob-expression) <blob-ish>.
+
 -l::
 --long::
 	Show object size of blob (file) entries.
diff --git a/builtin/ls-tree.c b/builtin/ls-tree.c
index c3ea09281af..8f1fd5c26c2 100644
--- a/builtin/ls-tree.c
+++ b/builtin/ls-tree.c
@@ -7,12 +7,14 @@
 #include "config.h"
 #include "object-store.h"
 #include "blob.h"
+#include "repository.h"
 #include "tree.h"
 #include "commit.h"
 #include "quote.h"
 #include "builtin.h"
 #include "parse-options.h"
 #include "pathspec.h"
+#include "dir.h"
 
 static int line_termination = '\n';
 #define LS_RECURSIVE 1
@@ -329,12 +331,79 @@ static struct ls_tree_cmdmode_to_fmt ls_tree_cmdmode_format[] = {
 	},
 };
 
+struct sparse_filter_data {
+	read_tree_fn_t fn;
+	struct pattern_list pl;
+	struct strbuf full_path_buf;
+};
+
+static void sparse_filter_data__init(
+	struct sparse_filter_data **d,
+	const char *sparse_oid_name, read_tree_fn_t fn)
+{
+	struct object_id sparse_oid;
+	struct object_context oc;
+
+	(*d) = xcalloc(1, sizeof(**d));
+	(*d)->fn = fn;
+	(*d)->pl.use_cone_patterns = core_sparse_checkout_cone;
+
+	strbuf_init(&(*d)->full_path_buf, 0);
+
+
+	if (get_oid_with_context(the_repository,
+				 sparse_oid_name,
+				 GET_OID_BLOB, &sparse_oid, &oc))
+		die(_("unable to access sparse blob in '%s'"), sparse_oid_name);
+	if (add_patterns_from_blob_to_list(&sparse_oid, "", 0, &(*d)->pl) < 0)
+		die(_("unable to parse sparse filter data in %s"),
+		    oid_to_hex(&sparse_oid));
+}
+
+static void sparse_filter_data__free(struct sparse_filter_data *d)
+{
+	clear_pattern_list(&d->pl);
+	strbuf_release(&d->full_path_buf);
+	free(d);
+}
+
+static int path_matches_sparse_checkout_patterns(struct strbuf *full_path, struct pattern_list *pl, int dtype)
+{
+	enum pattern_match_result match = recursively_match_path_with_sparse_patterns(full_path->buf, the_repository->index, dtype, pl);
+	return match > 0;
+}
+
+
+static int filter_sparse(const struct object_id *oid, struct strbuf *base,
+			 const char *pathname, unsigned mode, void *context)
+{
+
+	struct sparse_filter_data *data = context;
+
+	int do_recurse = show_recursive(base->buf, base->len, pathname);
+	if (object_type(mode) == OBJ_TREE)
+		return do_recurse;
+
+	strbuf_reset(&data->full_path_buf);
+	strbuf_addbuf(&data->full_path_buf, base);
+	strbuf_addstr(&data->full_path_buf, pathname);
+
+	if (!path_matches_sparse_checkout_patterns(&data->full_path_buf, &data->pl, DT_REG))
+		return 0;
+
+	return data->fn(oid, base, pathname, mode, context);
+}
+
+
 int cmd_ls_tree(int argc, const char **argv, const char *prefix)
 {
 	struct object_id oid;
 	struct tree *tree;
-	int i, full_tree = 0;
+	int ret, i, full_tree = 0;
 	read_tree_fn_t fn = NULL;
+	char *sparse_oid_name = NULL;
+	void *read_tree_context = NULL;
+	struct sparse_filter_data *sparse_filter_data = NULL;
 	const struct option ls_tree_options[] = {
 		OPT_BIT('d', NULL, &ls_options, N_("only show trees"),
 			LS_TREE_ONLY),
@@ -360,6 +429,9 @@ int cmd_ls_tree(int argc, const char **argv, const char *prefix)
 		OPT_STRING_F(0, "format", &format, N_("format"),
 					 N_("format to use for the output"),
 					 PARSE_OPT_NONEG),
+		OPT_STRING_F(0, "filter-sparse-oid", &sparse_oid_name, N_("filter-sparse-oid"),
+			   N_("filter output with sparse-checkout specification contained in the blob"),
+			     PARSE_OPT_NONEG),
 		OPT__ABBREV(&abbrev),
 		OPT_END()
 	};
@@ -433,5 +505,16 @@ int cmd_ls_tree(int argc, const char **argv, const char *prefix)
 		break;
 	}
 
-	return !!read_tree(the_repository, tree, &pathspec, fn, NULL);
+	if (sparse_oid_name) {
+		sparse_filter_data__init(&sparse_filter_data, sparse_oid_name, fn);
+		read_tree_context = sparse_filter_data;
+		fn = filter_sparse;
+	}
+
+	ret = !!read_tree(the_repository, tree, &pathspec, fn, read_tree_context);
+
+	if (sparse_filter_data)
+		sparse_filter_data__free(sparse_filter_data);
+
+	return ret;
 }
diff --git a/dir.c b/dir.c
index fbdb24fc819..cf1f843ceac 100644
--- a/dir.c
+++ b/dir.c
@@ -1450,45 +1450,51 @@ int init_sparse_checkout_patterns(struct index_state *istate)
 	return 0;
 }
 
-static int path_in_sparse_checkout_1(const char *path,
-				     struct index_state *istate,
-				     int require_cone_mode)
+int recursively_match_path_with_sparse_patterns(const char *path,
+						struct index_state *istate,
+						int dtype,
+						struct pattern_list *pl)
 {
-	int dtype = DT_REG;
 	enum pattern_match_result match = UNDECIDED;
 	const char *end, *slash;
-
-	/*
-	 * We default to accepting a path if the path is empty, there are no
-	 * patterns, or the patterns are of the wrong type.
-	 */
-	if (!*path ||
-	    init_sparse_checkout_patterns(istate) ||
-	    (require_cone_mode &&
-	     !istate->sparse_checkout_patterns->use_cone_patterns))
-		return 1;
-
 	/*
 	 * If UNDECIDED, use the match from the parent dir (recursively), or
 	 * fall back to NOT_MATCHED at the topmost level. Note that cone mode
 	 * never returns UNDECIDED, so we will execute only one iteration in
 	 * this case.
 	 */
-	for (end = path + strlen(path);
-	     end > path && match == UNDECIDED;
+	for (end = path + strlen(path); end > path && match == UNDECIDED;
 	     end = slash) {
-
 		for (slash = end - 1; slash > path && *slash != '/'; slash--)
 			; /* do nothing */
 
 		match = path_matches_pattern_list(path, end - path,
 				slash > path ? slash + 1 : path, &dtype,
-				istate->sparse_checkout_patterns, istate);
+				pl, istate);
 
 		/* We are going to match the parent dir now */
 		dtype = DT_DIR;
 	}
-	return match > 0;
+
+	return match;
+}
+
+static int path_in_sparse_checkout_1(const char *path,
+				     struct index_state *istate,
+				     int require_cone_mode)
+{
+
+	/*
+	 * We default to accepting a path if the path is empty, there are no
+	 * patterns, or the patterns are of the wrong type.
+	 */
+	if (!*path ||
+	    init_sparse_checkout_patterns(istate) ||
+	    (require_cone_mode &&
+	     !istate->sparse_checkout_patterns->use_cone_patterns))
+		return 1;
+
+	return recursively_match_path_with_sparse_patterns(path, istate, DT_REG, istate->sparse_checkout_patterns) > 0;
 }
 
 int path_in_sparse_checkout(const char *path,
diff --git a/dir.h b/dir.h
index 8acfc044181..d186d271606 100644
--- a/dir.h
+++ b/dir.h
@@ -403,6 +403,11 @@ int path_in_sparse_checkout(const char *path,
 int path_in_cone_mode_sparse_checkout(const char *path,
 				      struct index_state *istate);
 
+int recursively_match_path_with_sparse_patterns(const char *path,
+						struct index_state *istate,
+						int dtype,
+						struct pattern_list *pl);
+
 struct dir_entry *dir_add_ignored(struct dir_struct *dir,
 				  struct index_state *istate,
 				  const char *pathname, int len);
diff --git a/t/t3106-ls-tree-sparse-checkout.sh b/t/t3106-ls-tree-sparse-checkout.sh
new file mode 100755
index 00000000000..d850950d651
--- /dev/null
+++ b/t/t3106-ls-tree-sparse-checkout.sh
@@ -0,0 +1,103 @@
+#!/bin/sh
+
+test_description='ls-tree with sparse filter patterns'
+
+. ./test-lib.sh
+
+check_agrees_with_ls_files () {
+	REPO=repo
+	git -C $REPO submodule deinit -f --all
+	git -C $REPO cat-file -p ${filter_oid} >${REPO}/.git/info/sparse-checkout
+	git -C $REPO sparse-checkout init --cone 2>err
+	git -C $REPO submodule init
+	git -C $REPO ls-files -t| grep -v "^S "|cut -d" " -f2 >ls-files
+	test_cmp ls-files actual
+}
+
+check_same_result_in_bare_repo () {
+	FULL=repo
+	BARE=bare
+	FILTER=$1
+	git -C repo cat-file -p ${filter_oid}| git -C bare hash-object -w --stdin
+	git -C bare ls-tree --name-only --filter-sparse-oid=${filter_oid} -r HEAD >bare-result
+	test_cmp expect bare-result
+}
+
+test_expect_success 'setup' '
+	git init submodule &&
+	(
+		cd submodule &&
+		test_commit file
+	) &&
+
+	git init repo &&
+	(
+		cd repo &&
+		mkdir dir &&
+		test_commit dir/sub-file &&
+		test_commit dir/sub-file2 &&
+		mkdir dir2 &&
+		test_commit dir2/sub-file1 &&
+		test_commit dir2/sub-file2 &&
+		test_commit top-file &&
+		git clone ../submodule submodule &&
+		git submodule add ./submodule &&
+		git submodule absorbgitdirs &&
+		git commit -m"add submodule" &&
+		git sparse-checkout init --cone
+	) &&
+	git clone --bare ./repo bare
+'
+
+test_expect_success 'toplevel filter only shows toplevel file' '
+	filter_oid=$(git -C repo hash-object -w --stdin <<-\EOF
+	/*
+	!/*/
+	EOF
+	) &&
+	cat >expect <<-EOF &&
+	.gitmodules
+	submodule
+	top-file.t
+	EOF
+	git -C repo ls-tree --name-only --filter-sparse-oid=${filter_oid} -r HEAD >actual &&
+	test_cmp expect actual &&
+	check_agrees_with_ls_files &&
+	check_same_result_in_bare_repo ${filter_oid}
+'
+
+test_expect_success 'non cone single file filter' '
+	filter_oid=$(git -C repo hash-object -w --stdin <<-\EOF
+	/dir/sub-file.t
+	EOF
+	) &&
+	cat >expect <<-EOF &&
+	dir/sub-file.t
+	EOF
+	git -C repo ls-tree --name-only --filter-sparse-oid=${filter_oid} -r HEAD >actual &&
+	test_cmp expect actual &&
+	check_agrees_with_ls_files &&
+	check_same_result_in_bare_repo ${filter_oid}
+'
+
+test_expect_success 'cone filter matching one dir' '
+	filter_oid=$(git -C repo hash-object -w --stdin <<-\EOF
+	/*
+	!/*/
+	/dir/
+	EOF
+	) &&
+	cat >expect <<-EOF &&
+	.gitmodules
+	dir/sub-file.t
+	dir/sub-file2.t
+	submodule
+	top-file.t
+	EOF
+	git -C repo ls-tree --name-only --filter-sparse-oid=${filter_oid} -r HEAD >actual &&
+	test_cmp expect actual &&
+	check_agrees_with_ls_files &&
+	check_same_result_in_bare_repo ${filter_oid}
+'
+
+test_done

base-commit: a38d39a4c50d1275833aba54c4dbdfce9e2e9ca1
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] ls-tree: add --sparse-filter-oid argument
  2023-01-11 17:01 [PATCH] ls-tree: add --sparse-filter-oid argument William Sprent via GitGitGadget
@ 2023-01-13 14:17 ` Eric Sunshine
  2023-01-13 20:01   ` Junio C Hamano
  2023-01-16 12:14   ` William Sprent
  2023-01-23 11:46 ` [PATCH v2] " William Sprent via GitGitGadget
  1 sibling, 2 replies; 19+ messages in thread
From: Eric Sunshine @ 2023-01-13 14:17 UTC (permalink / raw)
  To: William Sprent via GitGitGadget; +Cc: git, William Sprent

On Wed, Jan 11, 2023 at 12:05 PM William Sprent via GitGitGadget
<gitgitgadget@gmail.com> wrote:
> [...]
> To fill this gap, add a new argument to ls-tree '--sparse-filter-oid'
> which takes the object id of a blob containing sparse checkout patterns
> that filters the output of 'ls-tree'. When filtering with given sparsity
> patterns, 'ls-tree' only outputs blobs and commit objects that
> match the given patterns.
> [...]
> Signed-off-by: William Sprent <williams@unity3d.com>

This is not a proper review, but rather just some comments about
issues I noticed while quickly running my eye over the patch. Many are
just micro-nits about style; a few regarding the tests are probably
actionable.

>     Note that one of the tests only pass when run on top of commit
>     5842710dc2 (dir: check for single file cone patterns, 2023-01-03), which
>     was submitted separately and is currently is merged to 'next'.

Thanks for mentioning this. It's exactly the sort of information the
maintainer needs when applying your patch to his tree. And it can be
helpful for reviewers too.

> diff --git a/Documentation/git-ls-tree.txt b/Documentation/git-ls-tree.txt
> @@ -48,6 +48,11 @@ OPTIONS
> +--sparse-filter-oid=<blob-ish>::
> +       Omit showing tree objects and paths that do not match the
> +       sparse-checkout specification contained in the blob
> +       (or blob-expression) <blob-ish>.

Good to see a documentation update. The SYNOPSIS probably deserves an
update too.

> diff --git a/builtin/ls-tree.c b/builtin/ls-tree.c
> @@ -329,12 +331,79 @@ static struct ls_tree_cmdmode_to_fmt ls_tree_cmdmode_format[] = {
> +static void sparse_filter_data__init(
> +       struct sparse_filter_data **d,
> +       const char *sparse_oid_name, read_tree_fn_t fn)
> +{
> +       struct object_id sparse_oid;
> +       struct object_context oc;
> +
> +       (*d) = xcalloc(1, sizeof(**d));
> +       (*d)->fn = fn;
> +       (*d)->pl.use_cone_patterns = core_sparse_checkout_cone;
> +
> +       strbuf_init(&(*d)->full_path_buf, 0);
> +
> +

Nit: too many blank lines.

> +       if (get_oid_with_context(the_repository,
> +                                sparse_oid_name,
> +                                GET_OID_BLOB, &sparse_oid, &oc))

Pure nit: somewhat odd choice of wrapping; I'd have expected something
along the lines of:

    if (get_oid_with_context(the_repository, sparse_oid_name, GET_OID_BLOB,
                    &sparse_oid, &oc))

or:

    if (get_oid_with_context(the_repository, sparse_oid_name,
                    GET_OID_BLOB, &sparse_oid, &oc))

> +static void sparse_filter_data__free(struct sparse_filter_data *d)
> +{
> +       clear_pattern_list(&d->pl);
> +       strbuf_release(&d->full_path_buf);
> +       free(d);
> +}

Is the double-underscore convention in function names imported from
elsewhere? I don't recall seeing it used in this codebase.

> +static int path_matches_sparse_checkout_patterns(struct strbuf *full_path, struct pattern_list *pl, int dtype)
> +{
> +       enum pattern_match_result match = recursively_match_path_with_sparse_patterns(full_path->buf, the_repository->index, dtype, pl);
> +       return match > 0;
> +}
> +
> +

Nit: too many blank lines

> +static int filter_sparse(const struct object_id *oid, struct strbuf *base,
> +                        const char *pathname, unsigned mode, void *context)
> +{
> +
> +       struct sparse_filter_data *data = context;

Nit: unnecessary blank line after "{"

> +       int do_recurse = show_recursive(base->buf, base->len, pathname);
> +       if (object_type(mode) == OBJ_TREE)
> +               return do_recurse;
> +
> +       strbuf_reset(&data->full_path_buf);
> +       strbuf_addbuf(&data->full_path_buf, base);
> +       strbuf_addstr(&data->full_path_buf, pathname);
> +
> +       if (!path_matches_sparse_checkout_patterns(&data->full_path_buf, &data->pl, DT_REG))
> +               return 0;
> +
> +       return data->fn(oid, base, pathname, mode, context);
> +}
> +
> +

Nit: too many blank lines

> diff --git a/dir.c b/dir.c
> @@ -1450,45 +1450,51 @@ int init_sparse_checkout_patterns(struct index_state *istate)
> +static int path_in_sparse_checkout_1(const char *path,
> +                                    struct index_state *istate,
> +                                    int require_cone_mode)
> +{
> +
> +       /*

Nit: unnecessary blank line after "{"

> diff --git a/t/t3106-ls-tree-sparse-checkout.sh b/t/t3106-ls-tree-sparse-checkout.sh
> new file mode 100755

We often try to avoid introducing a new test script if the tests being
added can fit well into an existing script. If you didn't find any
existing script where these tests would fit, then creating a new
script may be fine.

> @@ -0,0 +1,103 @@
> +check_agrees_with_ls_files () {
> +       REPO=repo
> +       git -C $REPO submodule deinit -f --all
> +       git -C $REPO cat-file -p ${filter_oid} >${REPO}/.git/info/sparse-checkout
> +       git -C $REPO sparse-checkout init --cone 2>err
> +       git -C $REPO submodule init
> +       git -C $REPO ls-files -t| grep -v "^S "|cut -d" " -f2 >ls-files
> +       test_cmp ls-files actual
> +}

Several comments:

Since the return code of this function is significant and callers care
about it, you should &&-chain all of the code in the function itself
(just like you do within a test) so that failure of any command in the
function is noticed and causes the calling test to fail. It's a good
idea to &&-chain the variable assignments at the top of the function
too (just in case someone later inserts code above the assignments).

It's odd to see a mixture of $VAR and ${VAR}. It's better to be
consistent. We typically use the $VAR form (though it's not exclusive
by any means).

Some shells complain when the pathname following ">" redirection
operator is not quoted, so:

    git -C $REPO cat-file -p ${filter_oid} >"$REPO/.git/info/sparse-checkout" &&

Style: add space around "|" pipe operator

We usually avoid having a Git command upstream of the pipe since its
exit code gets swallowed by the pipe, so we usually do this instead:

    git -C $REPO ls-files -t >out &&
    grep -v "^S " out | cut -d" " -f2 >ls-files &&

Minor: The two-command pipeline `grep -v "^S " | cut -d" " -f2
>ls-files` could be expressed via a single `sed` invocation:

    sed -n "/^S /!s/^. //p" out &&

Nit: The first argument to test_cmp() is typically named "expect".

Error output is captured to file "err" but that file is never consulted.

> +check_same_result_in_bare_repo () {
> +       FULL=repo
> +       BARE=bare
> +       FILTER=$1
> +       git -C repo cat-file -p ${filter_oid}| git -C bare hash-object -w --stdin
> +       git -C bare ls-tree --name-only --filter-sparse-oid=${filter_oid} -r HEAD >bare-result
> +       test_cmp expect bare-result
> +}

Same comments as above, plus:

What is the purpose of the variables FULL, BARE, FILTER? They don't
seem to be used in the function.

I suspect that the function should be using $FILTER internally rather
than $filter_oid.

> +test_expect_success 'setup' '
> +       git init submodule &&
> +       (
> +               cd submodule &&
> +               test_commit file
> +       ) &&

Minor: test_commit() accepts a -C option, so this could be done
without the subshell:

    test_commit -C submodule file

> +       git init repo &&
> +       (
> +               cd repo &&
> +               mkdir dir &&
> +               test_commit dir/sub-file &&
> +               test_commit dir/sub-file2 &&
> +               mkdir dir2 &&
> +               test_commit dir2/sub-file1 &&
> +               test_commit dir2/sub-file2 &&
> +               test_commit top-file &&
> +               git clone ../submodule submodule &&
> +               git submodule add ./submodule &&
> +               git submodule absorbgitdirs &&
> +               git commit -m"add submodule" &&
> +               git sparse-checkout init --cone
> +       ) &&

Here the subshell makes sense since so many commands are run in
directory "repo." Fine.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] ls-tree: add --sparse-filter-oid argument
  2023-01-13 14:17 ` Eric Sunshine
@ 2023-01-13 20:01   ` Junio C Hamano
  2023-01-16 15:13     ` William Sprent
  2023-01-16 12:14   ` William Sprent
  1 sibling, 1 reply; 19+ messages in thread
From: Junio C Hamano @ 2023-01-13 20:01 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: William Sprent via GitGitGadget, git, William Sprent

Eric Sunshine <sunshine@sunshineco.com> writes:

>>     Note that one of the tests only pass when run on top of commit
>>     5842710dc2 (dir: check for single file cone patterns, 2023-01-03), which
>>     was submitted separately and is currently is merged to 'next'.
>
> Thanks for mentioning this. It's exactly the sort of information the
> maintainer needs when applying your patch to his tree. And it can be
> helpful for reviewers too.

Indeed it is.  Thanks for pointing it out---I just wish other
contributors see and follow such a wise piece of advice buried in
review of a patch by others.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] ls-tree: add --sparse-filter-oid argument
  2023-01-13 14:17 ` Eric Sunshine
  2023-01-13 20:01   ` Junio C Hamano
@ 2023-01-16 12:14   ` William Sprent
  1 sibling, 0 replies; 19+ messages in thread
From: William Sprent @ 2023-01-16 12:14 UTC (permalink / raw)
  To: Eric Sunshine, William Sprent via GitGitGadget; +Cc: git

On 13/01/2023 15.17, Eric Sunshine wrote:
> On Wed, Jan 11, 2023 at 12:05 PM William Sprent via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>> [...]
>> To fill this gap, add a new argument to ls-tree '--sparse-filter-oid'
>> which takes the object id of a blob containing sparse checkout patterns
>> that filters the output of 'ls-tree'. When filtering with given sparsity
>> patterns, 'ls-tree' only outputs blobs and commit objects that
>> match the given patterns.
>> [...]
>> Signed-off-by: William Sprent <williams@unity3d.com>
> 
> This is not a proper review, but rather just some comments about
> issues I noticed while quickly running my eye over the patch. Many are
> just micro-nits about style; a few regarding the tests are probably
> actionable.
> 
>>      Note that one of the tests only pass when run on top of commit
>>      5842710dc2 (dir: check for single file cone patterns, 2023-01-03), which
>>      was submitted separately and is currently is merged to 'next'.
> 
> Thanks for mentioning this. It's exactly the sort of information the
> maintainer needs when applying your patch to his tree. And it can be
> helpful for reviewers too.
> 
>> diff --git a/Documentation/git-ls-tree.txt b/Documentation/git-ls-tree.txt
>> @@ -48,6 +48,11 @@ OPTIONS
>> +--sparse-filter-oid=<blob-ish>::
>> +       Omit showing tree objects and paths that do not match the
>> +       sparse-checkout specification contained in the blob
>> +       (or blob-expression) <blob-ish>.
> 
> Good to see a documentation update. The SYNOPSIS probably deserves an
> update too.
> 

Nice catch. I'll update it.

>> diff --git a/builtin/ls-tree.c b/builtin/ls-tree.c
>> @@ -329,12 +331,79 @@ static struct ls_tree_cmdmode_to_fmt ls_tree_cmdmode_format[] = {
>> +static void sparse_filter_data__init(
>> +       struct sparse_filter_data **d,
>> +       const char *sparse_oid_name, read_tree_fn_t fn)
>> +{
>> +       struct object_id sparse_oid;
>> +       struct object_context oc;
>> +
>> +       (*d) = xcalloc(1, sizeof(**d));
>> +       (*d)->fn = fn;
>> +       (*d)->pl.use_cone_patterns = core_sparse_checkout_cone;
>> +
>> +       strbuf_init(&(*d)->full_path_buf, 0);
>> +
>> +
> 
> Nit: too many blank lines.
> 
>> +       if (get_oid_with_context(the_repository,
>> +                                sparse_oid_name,
>> +                                GET_OID_BLOB, &sparse_oid, &oc))
> 
> Pure nit: somewhat odd choice of wrapping; I'd have expected something
> along the lines of:
> 
>      if (get_oid_with_context(the_repository, sparse_oid_name, GET_OID_BLOB,
>                      &sparse_oid, &oc))
> 
> or:
> 
>      if (get_oid_with_context(the_repository, sparse_oid_name,
>                      GET_OID_BLOB, &sparse_oid, &oc))
> 
>> +static void sparse_filter_data__free(struct sparse_filter_data *d)
>> +{
>> +       clear_pattern_list(&d->pl);
>> +       strbuf_release(&d->full_path_buf);
>> +       free(d);
>> +}
> 
> Is the double-underscore convention in function names imported from
> elsewhere? I don't recall seeing it used in this codebase.
> 

I took inspiration from the similar '--filter:sparse' argument for rev-list
which is implemented in 'list-objects-filter.c'. I've just given it a
search  and it looks like it that convention isn't really used outside the
list-objects files, which makes it a bit awkward here.

I get a bunch more hits for '(init|free)_', so I will change them to that.


>> +static int path_matches_sparse_checkout_patterns(struct strbuf *full_path, struct pattern_list *pl, int dtype)
>> +{
>> +       enum pattern_match_result match = recursively_match_path_with_sparse_patterns(full_path->buf, the_repository->index, dtype, pl);
>> +       return match > 0;
>> +}
>> +
>> +
> 
> Nit: too many blank lines
> 
>> +static int filter_sparse(const struct object_id *oid, struct strbuf *base,
>> +                        const char *pathname, unsigned mode, void *context)
>> +{
>> +
>> +       struct sparse_filter_data *data = context;
> 
> Nit: unnecessary blank line after "{"
> 
>> +       int do_recurse = show_recursive(base->buf, base->len, pathname);
>> +       if (object_type(mode) == OBJ_TREE)
>> +               return do_recurse;
>> +
>> +       strbuf_reset(&data->full_path_buf);
>> +       strbuf_addbuf(&data->full_path_buf, base);
>> +       strbuf_addstr(&data->full_path_buf, pathname);
>> +
>> +       if (!path_matches_sparse_checkout_patterns(&data->full_path_buf, &data->pl, DT_REG))
>> +               return 0;
>> +
>> +       return data->fn(oid, base, pathname, mode, context);	
>> +}
>> +
>> +
> 
> Nit: too many blank lines
> 
>> diff --git a/dir.c b/dir.c
>> @@ -1450,45 +1450,51 @@ int init_sparse_checkout_patterns(struct index_state *istate)
>> +static int path_in_sparse_checkout_1(const char *path,
>> +                                    struct index_state *istate,
>> +                                    int require_cone_mode)
>> +{
>> +
>> +       /*
> 
> Nit: unnecessary blank line after "{"
> 
>> diff --git a/t/t3106-ls-tree-sparse-checkout.sh b/t/t3106-ls-tree-sparse-checkout.sh
>> new file mode 100755
> 
> We often try to avoid introducing a new test script if the tests being
> added can fit well into an existing script. If you didn't find any
> existing script where these tests would fit, then creating a new
> script may be fine.
> 

Well there's a couple of things. The only ls-tree related test file
that isn't too specific is 't3103-ls-tree-misc.sh'. However, the
'sparse-checkout' command leaks memory, so that would require setting
'TEST_PASSES_SANITIZE_LEAK' to false.

For tests related to sparse-checkout, we both have the large
't1092-sparse-checkout-compatibility.sh' file. But there is also a
collection of files like 't3705-add-sparse-checkout.sh'. I guess
both could work fine in this case. I think I went with the latter
because I personally find the smaller test files a bit easier to
parse.

If someone feels strongly the other way, I'll gladly change it.

>> @@ -0,0 +1,103 @@
>> +check_agrees_with_ls_files () {
>> +       REPO=repo
>> +       git -C $REPO submodule deinit -f --all
>> +       git -C $REPO cat-file -p ${filter_oid} >${REPO}/.git/info/sparse-checkout
>> +       git -C $REPO sparse-checkout init --cone 2>err
>> +       git -C $REPO submodule init
>> +       git -C $REPO ls-files -t| grep -v "^S "|cut -d" " -f2 >ls-files
>> +       test_cmp ls-files actual
>> +}
> 
> Several comments:
> 
> Since the return code of this function is significant and callers care
> about it, you should &&-chain all of the code in the function itself
> (just like you do within a test) so that failure of any command in the
> function is noticed and causes the calling test to fail. It's a good
> idea to &&-chain the variable assignments at the top of the function
> too (just in case someone later inserts code above the assignments).
> 

That was an oversight. Thanks for noticing.

> It's odd to see a mixture of $VAR and ${VAR}. It's better to be
> consistent. We typically use the $VAR form (though it's not exclusive
> by any means).
> 

I'll remove the braces.

> Some shells complain when the pathname following ">" redirection
> operator is not quoted, so:
> 
>      git -C $REPO cat-file -p ${filter_oid} >"$REPO/.git/info/sparse-checkout" &&
> 
> Style: add space around "|" pipe operator
> 
> We usually avoid having a Git command upstream of the pipe since its
> exit code gets swallowed by the pipe, so we usually do this instead:
> 
>      git -C $REPO ls-files -t >out &&
>      grep -v "^S " out | cut -d" " -f2 >ls-files &&
> 

Good point. I'll do that.

> Minor: The two-command pipeline `grep -v "^S " | cut -d" " -f2
>> ls-files` could be expressed via a single `sed` invocation:
> 
>      sed -n "/^S /!s/^. //p" out &&
> 
> Nit: The first argument to test_cmp() is typically named "expect".
> 
> Error output is captured to file "err" but that file is never consulted.
> 

I'll remove the redirection.

>> +check_same_result_in_bare_repo () {
>> +       FULL=repo
>> +       BARE=bare
>> +       FILTER=$1
>> +       git -C repo cat-file -p ${filter_oid}| git -C bare hash-object -w --stdin
>> +       git -C bare ls-tree --name-only --filter-sparse-oid=${filter_oid} -r HEAD >bare-result
>> +       test_cmp expect bare-result
>> +}
> 
> Same comments as above, plus:
> 
> What is the purpose of the variables FULL, BARE, FILTER? They don't
> seem to be used in the function.
> 
> I suspect that the function should be using $FILTER internally rather
> than $filter_oid.
> 

I think I probably had plans with them and then forgot. Thanks for noticing.
I'll keep (and use) FILTER, but remove the others.

>> +test_expect_success 'setup' '
>> +       git init submodule &&
>> +       (
>> +               cd submodule &&
>> +               test_commit file
>> +       ) &&
> 
> Minor: test_commit() accepts a -C option, so this could be done
> without the subshell:
> 
>      test_commit -C submodule file
> 
>> +       git init repo &&
>> +       (
>> +               cd repo &&
>> +               mkdir dir &&
>> +               test_commit dir/sub-file &&
>> +               test_commit dir/sub-file2 &&
>> +               mkdir dir2 &&
>> +               test_commit dir2/sub-file1 &&
>> +               test_commit dir2/sub-file2 &&
>> +               test_commit top-file &&
>> +               git clone ../submodule submodule &&
>> +               git submodule add ./submodule &&
>> +               git submodule absorbgitdirs &&
>> +               git commit -m"add submodule" &&
>> +               git sparse-checkout init --cone
>> +       ) &&
> 
> Here the subshell makes sense since so many commands are run in
> directory "repo." Fine.

Thanks for taking the time to give some comments. I think they all
make sense to me. Besides the comments I've explicitly acknowledged above,
I've also applied all of the minor/nit/style comments to my local tree.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] ls-tree: add --sparse-filter-oid argument
  2023-01-13 20:01   ` Junio C Hamano
@ 2023-01-16 15:13     ` William Sprent
  0 siblings, 0 replies; 19+ messages in thread
From: William Sprent @ 2023-01-16 15:13 UTC (permalink / raw)
  To: Junio C Hamano, Eric Sunshine; +Cc: William Sprent via GitGitGadget, git

On 13/01/2023 21.01, Junio C Hamano wrote:
> Eric Sunshine <sunshine@sunshineco.com> writes:
> 
>>>      Note that one of the tests only pass when run on top of commit
>>>      5842710dc2 (dir: check for single file cone patterns, 2023-01-03), which
>>>      was submitted separately and is currently is merged to 'next'.
>>
>> Thanks for mentioning this. It's exactly the sort of information the
>> maintainer needs when applying your patch to his tree. And it can be
>> helpful for reviewers too.
> 
> Indeed it is.  Thanks for pointing it out---I just wish other
> contributors see and follow such a wise piece of advice buried in
> review of a patch by others.

Apropos where to apply the patch: This patch conflicts with the topic
'ls-tree.c: clean-up works'[1], which refactors parts of 'ls-tree.c'.
It has recently been merged to 'next' in [2].

Should I go ahead and rebase my changes on top of 'tl/ls-tree-code-clean-up'
and then resubmit a new patch, or does it make sure sense to wait?

1: https://public-inbox.org/git/20230112091135.20050-1-tenglong.tl@alibaba-inc.com
2: Commit f7238037fd (Merge branch 'tl/ls-tree-code-clean-up' into next, 2023-01-14)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2] ls-tree: add --sparse-filter-oid argument
  2023-01-11 17:01 [PATCH] ls-tree: add --sparse-filter-oid argument William Sprent via GitGitGadget
  2023-01-13 14:17 ` Eric Sunshine
@ 2023-01-23 11:46 ` William Sprent via GitGitGadget
  2023-01-23 13:00   ` Ævar Arnfjörð Bjarmason
                     ` (3 more replies)
  1 sibling, 4 replies; 19+ messages in thread
From: William Sprent via GitGitGadget @ 2023-01-23 11:46 UTC (permalink / raw)
  To: git
  Cc: Eric Sunshine, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, Victoria Dye, Elijah Newren, William Sprent,
	William Sprent

From: William Sprent <williams@unity3d.com>

There is currently no way to ask git the question "which files would be
part of a sparse checkout of commit X with sparse checkout patterns Y".
One use-case would be that tooling may want know whether sparse checkouts
of two commits contain the same content even if the full trees differ.
Another interesting use-case would be for tooling to use in conjunction
with 'git update-index --index-info'.

'rev-list --objects --filter=sparse:oid' comes close, but as rev-list is
concerned with objects rather than directory trees, it leaves files out
when the same blob occurs in at two different paths.

It is possible to ask git about the sparse status of files currently in
the index with 'ls-files -t'. However, this does not work well when the
caller is interested in another commit, intererested in sparsity
patterns that aren't currently in '.git/info/sparse-checkout', or when
working in with bare repo.

To fill this gap, add a new argument to ls-tree '--sparse-filter-oid'
which takes the object id of a blob containing sparse checkout patterns
that filters the output of 'ls-tree'. When filtering with given sparsity
patterns, 'ls-tree' only outputs blobs and commit objects that
match the given patterns.

While it may be valid in some situations to output a tree object -- e.g.
when a cone pattern matches all blobs recursively contained in a tree --
it is less unclear what should be output if a sparse pattern matches
parts of a tree.

To allow for reusing the pattern matching logic found in
'path_in_sparse_checkout_1()' in 'dir.c' with arbitrary patterns,
extract the pattern matching part of the function into its own new
function 'recursively_match_path_with_sparse_patterns()'.

Signed-off-by: William Sprent <williams@unity3d.com>
---
    ls-tree: add --sparse-filter-oid argument
    
    I'm resubmitting this change as rebased on top of 'master', as it
    conflicted with the topic 'ls-tree.c: clean-up works' 1
    [https://public-inbox.org/git/20230112091135.20050-1-tenglong.tl@alibaba-inc.com],
    which was merged to 'master' recently.
    
    This versions also incorporates changes based on the comments made in 2
    [https://public-inbox.org/git/CAPig+cRgZ0CrkqY7mufuWmhf6BC8yXjXXuOTEQjuz+Y0NA+N7Q@mail.gmail.com/].
    
    I'm also looping in contributors that have touched ls-tree and/or
    sparse-checkouts recently. I hope that's okay.
    
    William

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1459%2Fwilliams-unity%2Fls-tree-sparse-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1459/williams-unity/ls-tree-sparse-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1459

Contributor requested no range-diff. You can review it using these commands:
   git fetch https://github.com/gitgitgadget/git a38d39a4 07a4e24e
   git range-diff <options> a38d39a4..8891b110 56c8fb1e..07a4e24e

 Documentation/git-ls-tree.txt      |  6 ++
 builtin/ls-tree.c                  | 79 +++++++++++++++++++++++-
 dir.c                              | 45 ++++++++------
 dir.h                              |  5 ++
 t/t3106-ls-tree-sparse-checkout.sh | 99 ++++++++++++++++++++++++++++++
 5 files changed, 213 insertions(+), 21 deletions(-)
 create mode 100755 t/t3106-ls-tree-sparse-checkout.sh

diff --git a/Documentation/git-ls-tree.txt b/Documentation/git-ls-tree.txt
index 0240adb8eec..724bb1f9dbd 100644
--- a/Documentation/git-ls-tree.txt
+++ b/Documentation/git-ls-tree.txt
@@ -11,6 +11,7 @@ SYNOPSIS
 [verse]
 'git ls-tree' [-d] [-r] [-t] [-l] [-z]
 	    [--name-only] [--name-status] [--object-only] [--full-name] [--full-tree] [--abbrev[=<n>]] [--format=<format>]
+	    [--sparse-filter-oid=<blob-ish>]
 	    <tree-ish> [<path>...]
 
 DESCRIPTION
@@ -48,6 +49,11 @@ OPTIONS
 	Show tree entries even when going to recurse them. Has no effect
 	if `-r` was not passed. `-d` implies `-t`.
 
+--sparse-filter-oid=<blob-ish>::
+	Omit showing tree objects and paths that do not match the
+	sparse-checkout specification contained in the blob
+	(or blob-expression) <blob-ish>.
+
 -l::
 --long::
 	Show object size of blob (file) entries.
diff --git a/builtin/ls-tree.c b/builtin/ls-tree.c
index 72eb70823d5..46a815fc7dc 100644
--- a/builtin/ls-tree.c
+++ b/builtin/ls-tree.c
@@ -13,6 +13,7 @@
 #include "builtin.h"
 #include "parse-options.h"
 #include "pathspec.h"
+#include "dir.h"
 
 static const char * const ls_tree_usage[] = {
 	N_("git ls-tree [<options>] <tree-ish> [<path>...]"),
@@ -364,6 +365,65 @@ static struct ls_tree_cmdmode_to_fmt ls_tree_cmdmode_format[] = {
 	},
 };
 
+struct sparse_filter_data {
+	read_tree_fn_t fn;
+	struct pattern_list pl;
+	struct strbuf full_path_buf;
+	struct ls_tree_options *options;
+};
+
+static void init_sparse_filter_data(struct sparse_filter_data **d, struct ls_tree_options *options,
+	const char *sparse_oid_name, read_tree_fn_t fn)
+{
+	struct object_id sparse_oid;
+	struct object_context oc;
+
+	(*d) = xcalloc(1, sizeof(**d));
+	(*d)->fn = fn;
+	(*d)->pl.use_cone_patterns = core_sparse_checkout_cone;
+	(*d)->options = options;
+	strbuf_init(&(*d)->full_path_buf, 0);
+
+	if (get_oid_with_context(the_repository, sparse_oid_name, GET_OID_BLOB,
+			&sparse_oid, &oc))
+		die(_("unable to access sparse blob in '%s'"), sparse_oid_name);
+	if (add_patterns_from_blob_to_list(&sparse_oid, "", 0, &(*d)->pl) < 0)
+		die(_("unable to parse sparse filter data in %s"),
+		    oid_to_hex(&sparse_oid));
+}
+
+static void free_sparse_filter_data(struct sparse_filter_data *d)
+{
+	clear_pattern_list(&d->pl);
+	strbuf_release(&d->full_path_buf);
+	free(d);
+}
+
+static int path_matches_sparse_checkout_patterns(struct strbuf *full_path, struct pattern_list *pl, int dtype)
+{
+	enum pattern_match_result match = recursively_match_path_with_sparse_patterns(full_path->buf, the_repository->index, dtype, pl);
+	return match > 0;
+}
+
+static int filter_sparse(const struct object_id *oid, struct strbuf *base,
+			 const char *pathname, unsigned mode, void *context)
+{
+	struct sparse_filter_data *data = context;
+
+	int do_recurse = show_recursive(data->options, base->buf, base->len, pathname);
+	if (object_type(mode) == OBJ_TREE)
+		return do_recurse;
+
+	strbuf_reset(&data->full_path_buf);
+	strbuf_addbuf(&data->full_path_buf, base);
+	strbuf_addstr(&data->full_path_buf, pathname);
+
+	if (!path_matches_sparse_checkout_patterns(&data->full_path_buf, &data->pl, DT_REG))
+			return 0;
+
+	return data->fn(oid, base, pathname, mode, data->options);
+}
+
 int cmd_ls_tree(int argc, const char **argv, const char *prefix)
 {
 	struct object_id oid;
@@ -372,7 +432,11 @@ int cmd_ls_tree(int argc, const char **argv, const char *prefix)
 	read_tree_fn_t fn = NULL;
 	enum ls_tree_cmdmode cmdmode = MODE_DEFAULT;
 	int null_termination = 0;
+
 	struct ls_tree_options options = { 0 };
+	char *sparse_oid_name = NULL;
+	void *read_tree_context = NULL;
+	struct sparse_filter_data *sparse_filter_data = NULL;
 	const struct option ls_tree_options[] = {
 		OPT_BIT('d', NULL, &options.ls_options, N_("only show trees"),
 			LS_TREE_ONLY),
@@ -399,6 +463,9 @@ int cmd_ls_tree(int argc, const char **argv, const char *prefix)
 					 N_("format to use for the output"),
 					 PARSE_OPT_NONEG),
 		OPT__ABBREV(&options.abbrev),
+		OPT_STRING_F(0, "filter-sparse-oid", &sparse_oid_name, N_("filter-sparse-oid"),
+			   N_("filter output with sparse-checkout specification contained in the blob"),
+			     PARSE_OPT_NONEG),
 		OPT_END()
 	};
 	struct ls_tree_cmdmode_to_fmt *m2f = ls_tree_cmdmode_format;
@@ -474,7 +541,17 @@ int cmd_ls_tree(int argc, const char **argv, const char *prefix)
 		break;
 	}
 
-	ret = !!read_tree(the_repository, tree, &options.pathspec, fn, &options);
+	if (sparse_oid_name) {
+		init_sparse_filter_data(&sparse_filter_data, &options, sparse_oid_name, fn);
+		read_tree_context = sparse_filter_data;
+		fn = filter_sparse;
+	} else
+		read_tree_context = &options;
+
+	ret = !!read_tree(the_repository, tree, &options.pathspec, fn, read_tree_context);
 	clear_pathspec(&options.pathspec);
+	if (sparse_filter_data)
+		free_sparse_filter_data(sparse_filter_data);
+
 	return ret;
 }
diff --git a/dir.c b/dir.c
index 4e99f0c868f..122ebced08e 100644
--- a/dir.c
+++ b/dir.c
@@ -1457,45 +1457,50 @@ int init_sparse_checkout_patterns(struct index_state *istate)
 	return 0;
 }
 
-static int path_in_sparse_checkout_1(const char *path,
-				     struct index_state *istate,
-				     int require_cone_mode)
+int recursively_match_path_with_sparse_patterns(const char *path,
+						struct index_state *istate,
+						int dtype,
+						struct pattern_list *pl)
 {
-	int dtype = DT_REG;
 	enum pattern_match_result match = UNDECIDED;
 	const char *end, *slash;
-
-	/*
-	 * We default to accepting a path if the path is empty, there are no
-	 * patterns, or the patterns are of the wrong type.
-	 */
-	if (!*path ||
-	    init_sparse_checkout_patterns(istate) ||
-	    (require_cone_mode &&
-	     !istate->sparse_checkout_patterns->use_cone_patterns))
-		return 1;
-
 	/*
 	 * If UNDECIDED, use the match from the parent dir (recursively), or
 	 * fall back to NOT_MATCHED at the topmost level. Note that cone mode
 	 * never returns UNDECIDED, so we will execute only one iteration in
 	 * this case.
 	 */
-	for (end = path + strlen(path);
-	     end > path && match == UNDECIDED;
+	for (end = path + strlen(path); end > path && match == UNDECIDED;
 	     end = slash) {
-
 		for (slash = end - 1; slash > path && *slash != '/'; slash--)
 			; /* do nothing */
 
 		match = path_matches_pattern_list(path, end - path,
 				slash > path ? slash + 1 : path, &dtype,
-				istate->sparse_checkout_patterns, istate);
+				pl, istate);
 
 		/* We are going to match the parent dir now */
 		dtype = DT_DIR;
 	}
-	return match > 0;
+
+	return match;
+}
+
+static int path_in_sparse_checkout_1(const char *path,
+				     struct index_state *istate,
+				     int require_cone_mode)
+{
+	/*
+	 * We default to accepting a path if the path is empty, there are no
+	 * patterns, or the patterns are of the wrong type.
+	 */
+	if (!*path ||
+	    init_sparse_checkout_patterns(istate) ||
+	    (require_cone_mode &&
+	     !istate->sparse_checkout_patterns->use_cone_patterns))
+		return 1;
+
+	return recursively_match_path_with_sparse_patterns(path, istate, DT_REG, istate->sparse_checkout_patterns) > 0;
 }
 
 int path_in_sparse_checkout(const char *path,
diff --git a/dir.h b/dir.h
index 8acfc044181..d186d271606 100644
--- a/dir.h
+++ b/dir.h
@@ -403,6 +403,11 @@ int path_in_sparse_checkout(const char *path,
 int path_in_cone_mode_sparse_checkout(const char *path,
 				      struct index_state *istate);
 
+int recursively_match_path_with_sparse_patterns(const char *path,
+						struct index_state *istate,
+						int dtype,
+						struct pattern_list *pl);
+
 struct dir_entry *dir_add_ignored(struct dir_struct *dir,
 				  struct index_state *istate,
 				  const char *pathname, int len);
diff --git a/t/t3106-ls-tree-sparse-checkout.sh b/t/t3106-ls-tree-sparse-checkout.sh
new file mode 100755
index 00000000000..945f3ed1888
--- /dev/null
+++ b/t/t3106-ls-tree-sparse-checkout.sh
@@ -0,0 +1,99 @@
+#!/bin/sh
+
+test_description='ls-tree with sparse filter patterns'
+
+. ./test-lib.sh
+
+check_agrees_with_ls_files () {
+	REPO=repo  &&
+	git -C $REPO submodule deinit -f --all &&
+	git -C $REPO cat-file -p $filter_oid >"$REPO/.git/info/sparse-checkout" &&
+	git -C $REPO sparse-checkout init --cone &&
+	git -C $REPO submodule init &&
+	git -C $REPO ls-files -t >out &&
+	sed -n "/^S /!s/^. //p" out >expect &&
+	test_cmp expect actual
+}
+
+check_same_result_in_bare_repo () {
+	FILTER_OID=$1 &&
+	git -C repo cat-file -p $FILTER_OID >out &&
+	git -C bare hash-object -w --stdin <out &&
+	git -C bare ls-tree --name-only --filter-sparse-oid=$FILTER_OID -r HEAD >bare-result &&
+	test_cmp expect bare-result
+}
+
+test_expect_success 'setup' '
+	git init submodule &&
+	test_commit -C submodule file &&
+	git init repo &&
+	(
+		cd repo &&
+		mkdir dir &&
+		test_commit dir/sub-file &&
+		test_commit dir/sub-file2 &&
+		mkdir dir2 &&
+		test_commit dir2/sub-file1 &&
+		test_commit dir2/sub-file2 &&
+		test_commit top-file &&
+		git clone ../submodule submodule &&
+		git submodule add ./submodule &&
+		git submodule absorbgitdirs &&
+		git commit -m"add submodule" &&
+		git sparse-checkout init --cone
+	) &&
+	git clone --bare ./repo bare
+'
+
+test_expect_success 'toplevel filter only shows toplevel file' '
+	filter_oid=$(git -C repo hash-object -w --stdin <<-\EOF
+	/*
+	!/*/
+	EOF
+	) &&
+	cat >expect <<-EOF &&
+	.gitmodules
+	submodule
+	top-file.t
+	EOF
+	git -C repo ls-tree --name-only --filter-sparse-oid=$filter_oid -r HEAD >actual &&
+	test_cmp expect actual &&
+	check_agrees_with_ls_files &&
+	check_same_result_in_bare_repo $filter_oid
+'
+
+test_expect_success 'non cone single file filter' '
+	filter_oid=$(git -C repo hash-object -w --stdin <<-\EOF
+	/dir/sub-file.t
+	EOF
+	) &&
+	cat >expect <<-EOF &&
+	dir/sub-file.t
+	EOF
+	git -C repo ls-tree --name-only --filter-sparse-oid=$filter_oid -r HEAD >actual &&
+	test_cmp expect actual &&
+	check_agrees_with_ls_files &&
+	check_same_result_in_bare_repo $filter_oid
+'
+
+test_expect_success 'cone filter matching one dir' '
+	filter_oid=$(git -C repo hash-object -w --stdin <<-\EOF
+	/*
+	!/*/
+	/dir/
+	EOF
+	) &&
+	cat >expect <<-EOF &&
+	.gitmodules
+	dir/sub-file.t
+	dir/sub-file2.t
+	submodule
+	top-file.t
+	EOF
+	git -C repo ls-tree --name-only --filter-sparse-oid=$filter_oid -r HEAD >actual &&
+	test_cmp expect actual &&
+	check_agrees_with_ls_files &&
+	check_same_result_in_bare_repo $filter_oid
+'
+
+test_done

base-commit: 56c8fb1e95377900ec9d53c07886022af0a5d3c2
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] ls-tree: add --sparse-filter-oid argument
  2023-01-23 11:46 ` [PATCH v2] " William Sprent via GitGitGadget
@ 2023-01-23 13:00   ` Ævar Arnfjörð Bjarmason
  2023-01-24 15:30     ` William Sprent
  2023-01-23 13:06   ` Ævar Arnfjörð Bjarmason
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-01-23 13:00 UTC (permalink / raw)
  To: William Sprent via GitGitGadget
  Cc: git, Eric Sunshine, Derrick Stolee, Victoria Dye, Elijah Newren,
	William Sprent


On Mon, Jan 23 2023, William Sprent via GitGitGadget wrote:

> From: William Sprent <williams@unity3d.com>
>
> There is currently no way to ask git the question "which files would be
> part of a sparse checkout of commit X with sparse checkout patterns Y".
> One use-case would be that tooling may want know whether sparse checkouts
> of two commits contain the same content even if the full trees differ.
> Another interesting use-case would be for tooling to use in conjunction
> with 'git update-index --index-info'.

Rather than commenting on individual points, I checked this out and
produced something squashable on-top, it fixes various issues (some
aspects of which still remain):

 * You need to wrap your code at 79 chars (and some of that could be
   helped by picking less verbose identifiers & variable names,
   e.g. just use "context" rather than "read_tree_context" in
   cmd_ls_tree(), it has no other "context"...)
 * You're making the memory management aroind init_sparse_filter_data()
   needlessly hard, just put it on the stack instead. That also allows
   for init-ing most of it right away, or via a macro in the assignment.
 * You have a stray refactoring of dir.c in your proposed change, this
   changes it back, let's leave that sort of thing out.
 * You're adding a function that returns an enum, but you declare it as
   returning "int", let's retain that type in the header & declaration.


diff --git a/builtin/ls-tree.c b/builtin/ls-tree.c
index 46a815fc7dc..68d6ef675f2 100644
--- a/builtin/ls-tree.c
+++ b/builtin/ls-tree.c
@@ -372,36 +372,37 @@ struct sparse_filter_data {
 	struct ls_tree_options *options;
 };
 
-static void init_sparse_filter_data(struct sparse_filter_data **d, struct ls_tree_options *options,
-	const char *sparse_oid_name, read_tree_fn_t fn)
+static void init_sparse_filter_data(struct sparse_filter_data *d,
+				    const char *sparse_oid_name)
 {
 	struct object_id sparse_oid;
 	struct object_context oc;
 
-	(*d) = xcalloc(1, sizeof(**d));
-	(*d)->fn = fn;
-	(*d)->pl.use_cone_patterns = core_sparse_checkout_cone;
-	(*d)->options = options;
-	strbuf_init(&(*d)->full_path_buf, 0);
-
 	if (get_oid_with_context(the_repository, sparse_oid_name, GET_OID_BLOB,
 			&sparse_oid, &oc))
-		die(_("unable to access sparse blob in '%s'"), sparse_oid_name);
-	if (add_patterns_from_blob_to_list(&sparse_oid, "", 0, &(*d)->pl) < 0)
+		die(_("unable to access sparse blob in '%s'"),
+		    sparse_oid_name);
+	if (add_patterns_from_blob_to_list(&sparse_oid, "", 0, &d->pl) < 0)
 		die(_("unable to parse sparse filter data in %s"),
 		    oid_to_hex(&sparse_oid));
 }
 
-static void free_sparse_filter_data(struct sparse_filter_data *d)
+static void release_sparse_filter_data(struct sparse_filter_data *d)
 {
 	clear_pattern_list(&d->pl);
 	strbuf_release(&d->full_path_buf);
-	free(d);
 }
 
-static int path_matches_sparse_checkout_patterns(struct strbuf *full_path, struct pattern_list *pl, int dtype)
+static int path_matches_sparse_checkout_patterns(struct strbuf *full_path,
+						 struct pattern_list *pl,
+						 int dtype)
 {
-	enum pattern_match_result match = recursively_match_path_with_sparse_patterns(full_path->buf, the_repository->index, dtype, pl);
+	enum pattern_match_result match;
+
+	match = recursively_match_path_with_sparse_patterns(full_path->buf,
+							    the_repository->index,
+							    dtype, pl);
+
 	return match > 0;
 }
 
@@ -436,7 +437,10 @@ int cmd_ls_tree(int argc, const char **argv, const char *prefix)
 	struct ls_tree_options options = { 0 };
 	char *sparse_oid_name = NULL;
 	void *read_tree_context = NULL;
-	struct sparse_filter_data *sparse_filter_data = NULL;
+	struct sparse_filter_data sparse_filter_data = {
+		.options = &options,
+		.full_path_buf = STRBUF_INIT,
+	};
 	const struct option ls_tree_options[] = {
 		OPT_BIT('d', NULL, &options.ls_options, N_("only show trees"),
 			LS_TREE_ONLY),
@@ -542,16 +546,17 @@ int cmd_ls_tree(int argc, const char **argv, const char *prefix)
 	}
 
 	if (sparse_oid_name) {
-		init_sparse_filter_data(&sparse_filter_data, &options, sparse_oid_name, fn);
-		read_tree_context = sparse_filter_data;
+		sparse_filter_data.fn = fn;
+		init_sparse_filter_data(&sparse_filter_data, sparse_oid_name);
+		read_tree_context = &sparse_filter_data;
 		fn = filter_sparse;
 	} else
 		read_tree_context = &options;
 
-	ret = !!read_tree(the_repository, tree, &options.pathspec, fn, read_tree_context);
+	ret = !!read_tree(the_repository, tree, &options.pathspec, fn,
+			  read_tree_context);
 	clear_pathspec(&options.pathspec);
-	if (sparse_filter_data)
-		free_sparse_filter_data(sparse_filter_data);
+	release_sparse_filter_data(&sparse_filter_data);
 
 	return ret;
 }
diff --git a/dir.c b/dir.c
index 122ebced08e..57ec6404901 100644
--- a/dir.c
+++ b/dir.c
@@ -1457,10 +1457,10 @@ int init_sparse_checkout_patterns(struct index_state *istate)
 	return 0;
 }
 
-int recursively_match_path_with_sparse_patterns(const char *path,
-						struct index_state *istate,
-						int dtype,
-						struct pattern_list *pl)
+enum pattern_match_result recursively_match_path_with_sparse_patterns(const char *path,
+								      struct index_state *istate,
+								      int dtype,
+								      struct pattern_list *pl)
 {
 	enum pattern_match_result match = UNDECIDED;
 	const char *end, *slash;
@@ -1470,7 +1470,8 @@ int recursively_match_path_with_sparse_patterns(const char *path,
 	 * never returns UNDECIDED, so we will execute only one iteration in
 	 * this case.
 	 */
-	for (end = path + strlen(path); end > path && match == UNDECIDED;
+	for (end = path + strlen(path);
+	     end > path && match == UNDECIDED;
 	     end = slash) {
 		for (slash = end - 1; slash > path && *slash != '/'; slash--)
 			; /* do nothing */
diff --git a/dir.h b/dir.h
index d186d271606..3c65e68ca40 100644
--- a/dir.h
+++ b/dir.h
@@ -403,10 +403,10 @@ int path_in_sparse_checkout(const char *path,
 int path_in_cone_mode_sparse_checkout(const char *path,
 				      struct index_state *istate);
 
-int recursively_match_path_with_sparse_patterns(const char *path,
-						struct index_state *istate,
-						int dtype,
-						struct pattern_list *pl);
+enum pattern_match_result recursively_match_path_with_sparse_patterns(const char *path,
+								      struct index_state *istate,
+								      int dtype,
+								      struct pattern_list *pl);
 
 struct dir_entry *dir_add_ignored(struct dir_struct *dir,
 				  struct index_state *istate,

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] ls-tree: add --sparse-filter-oid argument
  2023-01-23 11:46 ` [PATCH v2] " William Sprent via GitGitGadget
  2023-01-23 13:00   ` Ævar Arnfjörð Bjarmason
@ 2023-01-23 13:06   ` Ævar Arnfjörð Bjarmason
  2023-01-24 15:30     ` William Sprent
  2023-01-24 20:11   ` Victoria Dye
  2023-01-25  5:11   ` Elijah Newren
  3 siblings, 1 reply; 19+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2023-01-23 13:06 UTC (permalink / raw)
  To: William Sprent via GitGitGadget
  Cc: git, Eric Sunshine, Derrick Stolee, Victoria Dye, Elijah Newren,
	William Sprent


On Mon, Jan 23 2023, William Sprent via GitGitGadget wrote:

> From: William Sprent <williams@unity3d.com>

...some further inline commentary, in addition to my proposed squash-in...

> diff --git a/Documentation/git-ls-tree.txt b/Documentation/git-ls-tree.txt
> index 0240adb8eec..724bb1f9dbd 100644
> --- a/Documentation/git-ls-tree.txt
> +++ b/Documentation/git-ls-tree.txt
> @@ -11,6 +11,7 @@ SYNOPSIS
>  [verse]
>  'git ls-tree' [-d] [-r] [-t] [-l] [-z]
>  	    [--name-only] [--name-status] [--object-only] [--full-name] [--full-tree] [--abbrev[=<n>]] [--format=<format>]
> +	    [--sparse-filter-oid=<blob-ish>]
>  	    <tree-ish> [<path>...]
>  
>  DESCRIPTION
> @@ -48,6 +49,11 @@ OPTIONS
>  	Show tree entries even when going to recurse them. Has no effect
>  	if `-r` was not passed. `-d` implies `-t`.
>  
> +--sparse-filter-oid=<blob-ish>::
> +	Omit showing tree objects and paths that do not match the
> +	sparse-checkout specification contained in the blob
> +	(or blob-expression) <blob-ish>.
> +
>  -l::
>  --long::
>  	Show object size of blob (file) entries.
> diff --git a/builtin/ls-tree.c b/builtin/ls-tree.c
> index 72eb70823d5..46a815fc7dc 100644
> --- a/builtin/ls-tree.c
> +++ b/builtin/ls-tree.c
> @@ -13,6 +13,7 @@
>  #include "builtin.h"
>  #include "parse-options.h"
>  #include "pathspec.h"
> +#include "dir.h"
>  
>  static const char * const ls_tree_usage[] = {
>  	N_("git ls-tree [<options>] <tree-ish> [<path>...]"),
> @@ -364,6 +365,65 @@ static struct ls_tree_cmdmode_to_fmt ls_tree_cmdmode_format[] = {
>  	},
>  };

Okey, here yo uupdate the *.txt, but not the "-h", but it's one of those
where way say <options>.

I for one wouldn't mind some preceding change like my 423be1f83c5 (doc
txt & -h consistency: make "commit" consistent, 2022-10-13), which in
turn would make t/t0450-txt-doc-vs-help.sh pass for this command, but
maybe it's too much of a digression...

> +	(*d) = xcalloc(1, sizeof(**d));
> +	(*d)->fn = fn;
> +	(*d)->pl.use_cone_patterns = core_sparse_checkout_cone;

I forgot to note in my proposed fix-up that I omitted this, but your
tests still pass, which suggests it's either not needed, or your tests
are lacking....

> +	(*d)->options = options;
> +	strbuf_init(&(*d)->full_path_buf, 0);
> +
> +	if (get_oid_with_context(the_repository, sparse_oid_name, GET_OID_BLOB,
> +			&sparse_oid, &oc))
> +		die(_("unable to access sparse blob in '%s'"), sparse_oid_name);
> +	if (add_patterns_from_blob_to_list(&sparse_oid, "", 0, &(*d)->pl) < 0)

As noted you can avoid the malloc here, which also sidesteps this (IMO
at last) ugly &(*v)->m syntax.

> +		die(_("unable to parse sparse filter data in %s"),
> +		    oid_to_hex(&sparse_oid));
> +}
> +
> +static void free_sparse_filter_data(struct sparse_filter_data *d)
> +{
> +	clear_pattern_list(&d->pl);
> +	strbuf_release(&d->full_path_buf);
> +	free(d);
> +}
> +
> +static int path_matches_sparse_checkout_patterns(struct strbuf *full_path, struct pattern_list *pl, int dtype)
> +{
> +	enum pattern_match_result match = recursively_match_path_with_sparse_patterns(full_path->buf, the_repository->index, dtype, pl);

I for one would welcome e.g. abbreviating "sparse_checkout_patterns" as
"scp" or whatever throughout, with resulting shortening of very long
lines & identifiers. E.g. for this one "patch_match_scp" or whatever.


> +	struct sparse_filter_data *data = context;
> +

Style: Don't add a \n\n between variable decls,

> +	int do_recurse = show_recursive(data->options, base->buf, base->len, pathname);

Instead add it here, before the code proper.


> +	if (object_type(mode) == OBJ_TREE)
> +		return do_recurse;
> +
> +	strbuf_reset(&data->full_path_buf);

I wondered if we need this after, but we don't, I for one would welcome a fix-up like:
	
	diff --git a/builtin/ls-tree.c b/builtin/ls-tree.c
	index 68d6ef675f2..74dfa9a4580 100644
	--- a/builtin/ls-tree.c
	+++ b/builtin/ls-tree.c
	@@ -410,19 +410,21 @@ static int filter_sparse(const struct object_id *oid, struct strbuf *base,
	 			 const char *pathname, unsigned mode, void *context)
	 {
	 	struct sparse_filter_data *data = context;
	-
	 	int do_recurse = show_recursive(data->options, base->buf, base->len, pathname);
	+	int ret = 0;
	+
	 	if (object_type(mode) == OBJ_TREE)
	 		return do_recurse;
	 
	-	strbuf_reset(&data->full_path_buf);
	 	strbuf_addbuf(&data->full_path_buf, base);
	 	strbuf_addstr(&data->full_path_buf, pathname);
	 
	 	if (!path_matches_sparse_checkout_patterns(&data->full_path_buf, &data->pl, DT_REG))
	-			return 0;
	-
	-	return data->fn(oid, base, pathname, mode, data->options);
	+		goto cleanup;
	+	ret = data->fn(oid, base, pathname, mode, data->options);
	+cleanup:
	+	strbuf_reset(&data->full_path_buf);
	+	return ret;
	 }
	 
	 int cmd_ls_tree(int argc, const char **argv, const char *prefix)
	
It's slightly more verbose, and we do end up needlessly reset-ing the
"last" one, but makes it clear that we don't have need for this after
this.

Which to me, also raises the question of why you have this
"full_path_buf" at all. It looks like a memory optimization, as you're
trying to avoid a malloc()/free() in the loop.

That's fair, but you could also do that with:

	const size_t oldlen = base->len;
	strbuf_addstr(base, pathname);
        /* do the path_matches_sparse_checkout_patterns() call here */
	/* before "cleanup" */
	strbuf_setlen(base, oldlen);

Or whatever, those snippets are untested, but we use similar tricks
elsewhere, and as it's contained to a few lines here I think it's better
than carrying another buffer.

> +	strbuf_addbuf(&data->full_path_buf, base);
> +	strbuf_addstr(&data->full_path_buf, pathname);

Nit: Shorter as: strbuf_addf(&sb, "%s%s", base.buf, pathname) (or equivalent);

> +
> +	if (!path_matches_sparse_checkout_patterns(&data->full_path_buf, &data->pl, DT_REG))
> +			return 0;
> +
> +	return data->fn(oid, base, pathname, mode, data->options);
> +}
> +
>  int cmd_ls_tree(int argc, const char **argv, const char *prefix)
>  {
>  	struct object_id oid;
> @@ -372,7 +432,11 @@ int cmd_ls_tree(int argc, const char **argv, const char *prefix)
>  	read_tree_fn_t fn = NULL;
>  	enum ls_tree_cmdmode cmdmode = MODE_DEFAULT;
>  	int null_termination = 0;
> +

Don't add this stray \n.

>  	struct ls_tree_options options = { 0 };
> +	char *sparse_oid_name = NULL;
> +	void *read_tree_context = NULL;
> +	struct sparse_filter_data *sparse_filter_data = NULL;
>  	const struct option ls_tree_options[] = {
>  		OPT_BIT('d', NULL, &options.ls_options, N_("only show trees"),
>  			LS_TREE_ONLY),
> @@ -399,6 +463,9 @@ int cmd_ls_tree(int argc, const char **argv, const char *prefix)
>  					 N_("format to use for the output"),
>  					 PARSE_OPT_NONEG),
>  		OPT__ABBREV(&options.abbrev),
> +		OPT_STRING_F(0, "filter-sparse-oid", &sparse_oid_name, N_("filter-sparse-oid"),

Make that N_(...) be "object-id", "oid", "hash" or something?

I.e. the RHS with the <type> should be <thingy>, not
<thingy-for-some-purpose>.

> +			   N_("filter output with sparse-checkout specification contained in the blob"),
> +			     PARSE_OPT_NONEG),
>  		OPT_END()
>  	};
>  	struct ls_tree_cmdmode_to_fmt *m2f = ls_tree_cmdmode_format;
> @@ -474,7 +541,17 @@ int cmd_ls_tree(int argc, const char **argv, const char *prefix)
>  		break;
>  	}
>  
> -	ret = !!read_tree(the_repository, tree, &options.pathspec, fn, &options);
> +	if (sparse_oid_name) {
> +		init_sparse_filter_data(&sparse_filter_data, &options, sparse_oid_name, fn);
> +		read_tree_context = sparse_filter_data;
> +		fn = filter_sparse;
> +	} else
> +		read_tree_context = &options;

You're missing a {} here for the "else".

Better yet, delete that "else" and change the decl above to be:

	void *context = &options;

Then just keep the "if" here, where you might clobber the "context".

> +
> +	ret = !!read_tree(the_repository, tree, &options.pathspec, fn, read_tree_context);
>  	clear_pathspec(&options.pathspec);
> +	if (sparse_filter_data)
> +		free_sparse_filter_data(sparse_filter_data);

I suggested a fixup for this, but as an aside don't make a free_*()
function that's not happy to accept NULL, it should work like the free()
itself.

> +
>  	return ret;
>  }
> diff --git a/dir.c b/dir.c
> index 4e99f0c868f..122ebced08e 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -1457,45 +1457,50 @@ int init_sparse_checkout_patterns(struct index_state *istate)
>  	return 0;
>  }
>  
> -static int path_in_sparse_checkout_1(const char *path,
> -				     struct index_state *istate,
> -				     int require_cone_mode)
> +int recursively_match_path_with_sparse_patterns(const char *path,
> +						struct index_state *istate,
> +						int dtype,
> +						struct pattern_list *pl)
>  {
> -	int dtype = DT_REG;
>  	enum pattern_match_result match = UNDECIDED;
>  	const char *end, *slash;
> -
> -	/*
> -	 * We default to accepting a path if the path is empty, there are no
> -	 * patterns, or the patterns are of the wrong type.
> -	 */
> -	if (!*path ||
> -	    init_sparse_checkout_patterns(istate) ||
> -	    (require_cone_mode &&
> -	     !istate->sparse_checkout_patterns->use_cone_patterns))
> -		return 1;
> -
>  	/*
>  	 * If UNDECIDED, use the match from the parent dir (recursively), or
>  	 * fall back to NOT_MATCHED at the topmost level. Note that cone mode
>  	 * never returns UNDECIDED, so we will execute only one iteration in
>  	 * this case.
>  	 */
> -	for (end = path + strlen(path);
> -	     end > path && match == UNDECIDED;
> +	for (end = path + strlen(path); end > path && match == UNDECIDED;

All good, aside from this whitespace refactoring, as noted.

>  	     end = slash) {
> -
>  		for (slash = end - 1; slash > path && *slash != '/'; slash--)
>  			; /* do nothing */
>  
>  		match = path_matches_pattern_list(path, end - path,
>  				slash > path ? slash + 1 : path, &dtype,
> -				istate->sparse_checkout_patterns, istate);
> +				pl, istate);
>  
>  		/* We are going to match the parent dir now */
>  		dtype = DT_DIR;
>  	}
> -	return match > 0;
> +
> +	return match;
> +}
> +
> +static int path_in_sparse_checkout_1(const char *path,
> +				     struct index_state *istate,
> +				     int require_cone_mode)
> +{
> +	/*
> +	 * We default to accepting a path if the path is empty, there are no
> +	 * patterns, or the patterns are of the wrong type.
> +	 */
> +	if (!*path ||
> +	    init_sparse_checkout_patterns(istate) ||
> +	    (require_cone_mode &&
> +	     !istate->sparse_checkout_patterns->use_cone_patterns))
> +		return 1;
> +
> +	return recursively_match_path_with_sparse_patterns(path, istate, DT_REG, istate->sparse_checkout_patterns) > 0;

All good, except for the really long line, aside from the previous
suggestion, maybe pull "istate->sparse_checkout_patterns" into a
variable, as it's now used twice here in this function?

> +check_agrees_with_ls_files () {
> +	REPO=repo  &&
> +	git -C $REPO submodule deinit -f --all &&
> +	git -C $REPO cat-file -p $filter_oid >"$REPO/.git/info/sparse-checkout" &&
> +	git -C $REPO sparse-checkout init --cone &&
> +	git -C $REPO submodule init &&
> +	git -C $REPO ls-files -t >out &&
> +	sed -n "/^S /!s/^. //p" out >expect &&
> +	test_cmp expect actual

Instead of this last "sed" munging, why not use the "--format" option to
"ls-tree" to just make the output the same?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] ls-tree: add --sparse-filter-oid argument
  2023-01-23 13:00   ` Ævar Arnfjörð Bjarmason
@ 2023-01-24 15:30     ` William Sprent
  0 siblings, 0 replies; 19+ messages in thread
From: William Sprent @ 2023-01-24 15:30 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, William Sprent via GitGitGadget
  Cc: git, Eric Sunshine, Derrick Stolee, Victoria Dye, Elijah Newren

On 23/01/2023 14.00, Ævar Arnfjörð Bjarmason wrote:
> 
> On Mon, Jan 23 2023, William Sprent via GitGitGadget wrote:
> 
>> From: William Sprent <williams@unity3d.com>
>>
>> There is currently no way to ask git the question "which files would be
>> part of a sparse checkout of commit X with sparse checkout patterns Y".
>> One use-case would be that tooling may want know whether sparse checkouts
>> of two commits contain the same content even if the full trees differ.
>> Another interesting use-case would be for tooling to use in conjunction
>> with 'git update-index --index-info'.
> 
> Rather than commenting on individual points, I checked this out and
> produced something squashable on-top, it fixes various issues (some
> aspects of which still remain):

Thanks. The changes all make sense to me. I'll apply them for v3 and
shorten the remaining long lines.

> 
>   * You need to wrap your code at 79 chars (and some of that could be
>     helped by picking less verbose identifiers & variable names,
>     e.g. just use "context" rather than "read_tree_context" in
>     cmd_ls_tree(), it has no other "context"...)>   * You're making the memory management aroind init_sparse_filter_data()
>     needlessly hard, just put it on the stack instead. That also allows
>     for init-ing most of it right away, or via a macro in the assignment.

Fair point. I do think that having all the initialisation of the struct
in one place makes the codes a bit more readable. But I guess it doesn't
make a big difference in this case.

>   * You have a stray refactoring of dir.c in your proposed change, this
>     changes it back, let's leave that sort of thing out.
>   * You're adding a function that returns an enum, but you declare it as
>     returning "int", let's retain that type in the header & declaration.

These were oversights. Thanks for catching them.

> 
> diff --git a/builtin/ls-tree.c b/builtin/ls-tree.c
> index 46a815fc7dc..68d6ef675f2 100644
> --- a/builtin/ls-tree.c
> +++ b/builtin/ls-tree.c
> @@ -372,36 +372,37 @@ struct sparse_filter_data {
>   	struct ls_tree_options *options;
>   };
>   
> -static void init_sparse_filter_data(struct sparse_filter_data **d, struct ls_tree_options *options,
> -	const char *sparse_oid_name, read_tree_fn_t fn)
> +static void init_sparse_filter_data(struct sparse_filter_data *d,
> +				    const char *sparse_oid_name)
>   {
>   	struct object_id sparse_oid;
>   	struct object_context oc;
>   
> -	(*d) = xcalloc(1, sizeof(**d));
> -	(*d)->fn = fn;
> -	(*d)->pl.use_cone_patterns = core_sparse_checkout_cone;
> -	(*d)->options = options;
> -	strbuf_init(&(*d)->full_path_buf, 0);
> -
>   	if (get_oid_with_context(the_repository, sparse_oid_name, GET_OID_BLOB,
>   			&sparse_oid, &oc))
> -		die(_("unable to access sparse blob in '%s'"), sparse_oid_name);
> -	if (add_patterns_from_blob_to_list(&sparse_oid, "", 0, &(*d)->pl) < 0)
> +		die(_("unable to access sparse blob in '%s'"),
> +		    sparse_oid_name);
> +	if (add_patterns_from_blob_to_list(&sparse_oid, "", 0, &d->pl) < 0)
>   		die(_("unable to parse sparse filter data in %s"),
>   		    oid_to_hex(&sparse_oid));
>   }
>   
> -static void free_sparse_filter_data(struct sparse_filter_data *d)
> +static void release_sparse_filter_data(struct sparse_filter_data *d)
>   {
>   	clear_pattern_list(&d->pl);
>   	strbuf_release(&d->full_path_buf);
> -	free(d);
>   }
>   
> -static int path_matches_sparse_checkout_patterns(struct strbuf *full_path, struct pattern_list *pl, int dtype)
> +static int path_matches_sparse_checkout_patterns(struct strbuf *full_path,
> +						 struct pattern_list *pl,
> +						 int dtype)
>   {
> -	enum pattern_match_result match = recursively_match_path_with_sparse_patterns(full_path->buf, the_repository->index, dtype, pl);
> +	enum pattern_match_result match;
> +
> +	match = recursively_match_path_with_sparse_patterns(full_path->buf,
> +							    the_repository->index,
> +							    dtype, pl);
> +
>   	return match > 0;
>   }
>   
> @@ -436,7 +437,10 @@ int cmd_ls_tree(int argc, const char **argv, const char *prefix)
>   	struct ls_tree_options options = { 0 };
>   	char *sparse_oid_name = NULL;
>   	void *read_tree_context = NULL;
> -	struct sparse_filter_data *sparse_filter_data = NULL;
> +	struct sparse_filter_data sparse_filter_data = {
> +		.options = &options,
> +		.full_path_buf = STRBUF_INIT,
> +	};
>   	const struct option ls_tree_options[] = {
>   		OPT_BIT('d', NULL, &options.ls_options, N_("only show trees"),
>   			LS_TREE_ONLY),
> @@ -542,16 +546,17 @@ int cmd_ls_tree(int argc, const char **argv, const char *prefix)
>   	}
>   
>   	if (sparse_oid_name) {
> -		init_sparse_filter_data(&sparse_filter_data, &options, sparse_oid_name, fn);
> -		read_tree_context = sparse_filter_data;
> +		sparse_filter_data.fn = fn;
> +		init_sparse_filter_data(&sparse_filter_data, sparse_oid_name);
> +		read_tree_context = &sparse_filter_data;
>   		fn = filter_sparse;
>   	} else
>   		read_tree_context = &options;
>   
> -	ret = !!read_tree(the_repository, tree, &options.pathspec, fn, read_tree_context);
> +	ret = !!read_tree(the_repository, tree, &options.pathspec, fn,
> +			  read_tree_context);
>   	clear_pathspec(&options.pathspec);
> -	if (sparse_filter_data)
> -		free_sparse_filter_data(sparse_filter_data);
> +	release_sparse_filter_data(&sparse_filter_data);
>   
>   	return ret;
>   }
> diff --git a/dir.c b/dir.c
> index 122ebced08e..57ec6404901 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -1457,10 +1457,10 @@ int init_sparse_checkout_patterns(struct index_state *istate)
>   	return 0;
>   }
>   
> -int recursively_match_path_with_sparse_patterns(const char *path,
> -						struct index_state *istate,
> -						int dtype,
> -						struct pattern_list *pl)
> +enum pattern_match_result recursively_match_path_with_sparse_patterns(const char *path,
> +								      struct index_state *istate,
> +								      int dtype,
> +								      struct pattern_list *pl)
>   {
>   	enum pattern_match_result match = UNDECIDED;
>   	const char *end, *slash;
> @@ -1470,7 +1470,8 @@ int recursively_match_path_with_sparse_patterns(const char *path,
>   	 * never returns UNDECIDED, so we will execute only one iteration in
>   	 * this case.
>   	 */
> -	for (end = path + strlen(path); end > path && match == UNDECIDED;
> +	for (end = path + strlen(path);
> +	     end > path && match == UNDECIDED;
>   	     end = slash) {
>   		for (slash = end - 1; slash > path && *slash != '/'; slash--)
>   			; /* do nothing */
> diff --git a/dir.h b/dir.h
> index d186d271606..3c65e68ca40 100644
> --- a/dir.h
> +++ b/dir.h
> @@ -403,10 +403,10 @@ int path_in_sparse_checkout(const char *path,
>   int path_in_cone_mode_sparse_checkout(const char *path,
>   				      struct index_state *istate);
>   
> -int recursively_match_path_with_sparse_patterns(const char *path,
> -						struct index_state *istate,
> -						int dtype,
> -						struct pattern_list *pl);
> +enum pattern_match_result recursively_match_path_with_sparse_patterns(const char *path,
> +								      struct index_state *istate,
> +								      int dtype,
> +								      struct pattern_list *pl);
>   
>   struct dir_entry *dir_add_ignored(struct dir_struct *dir,
>   				  struct index_state *istate,


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] ls-tree: add --sparse-filter-oid argument
  2023-01-23 13:06   ` Ævar Arnfjörð Bjarmason
@ 2023-01-24 15:30     ` William Sprent
  0 siblings, 0 replies; 19+ messages in thread
From: William Sprent @ 2023-01-24 15:30 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason, William Sprent via GitGitGadget
  Cc: git, Eric Sunshine, Derrick Stolee, Victoria Dye, Elijah Newren

On 23/01/2023 14.06, Ævar Arnfjörð Bjarmason wrote:
> 
> On Mon, Jan 23 2023, William Sprent via GitGitGadget wrote:
> 
>> From: William Sprent <williams@unity3d.com>
> 
> ...some further inline commentary, in addition to my proposed squash-in...
> 

Thanks for taking the time. :)

>> diff --git a/Documentation/git-ls-tree.txt b/Documentation/git-ls-tree.txt
>> index 0240adb8eec..724bb1f9dbd 100644
>> --- a/Documentation/git-ls-tree.txt
>> +++ b/Documentation/git-ls-tree.txt
>> @@ -11,6 +11,7 @@ SYNOPSIS
>>   [verse]
>>   'git ls-tree' [-d] [-r] [-t] [-l] [-z]
>>   	    [--name-only] [--name-status] [--object-only] [--full-name] [--full-tree] [--abbrev[=<n>]] [--format=<format>]
>> +	    [--sparse-filter-oid=<blob-ish>]
>>   	    <tree-ish> [<path>...]
>>   
>>   DESCRIPTION
>> @@ -48,6 +49,11 @@ OPTIONS
>>   	Show tree entries even when going to recurse them. Has no effect
>>   	if `-r` was not passed. `-d` implies `-t`.
>>   
>> +--sparse-filter-oid=<blob-ish>::
>> +	Omit showing tree objects and paths that do not match the
>> +	sparse-checkout specification contained in the blob
>> +	(or blob-expression) <blob-ish>.
>> +
>>   -l::
>>   --long::
>>   	Show object size of blob (file) entries.
>> diff --git a/builtin/ls-tree.c b/builtin/ls-tree.c
>> index 72eb70823d5..46a815fc7dc 100644
>> --- a/builtin/ls-tree.c
>> +++ b/builtin/ls-tree.c
>> @@ -13,6 +13,7 @@
>>   #include "builtin.h"
>>   #include "parse-options.h"
>>   #include "pathspec.h"
>> +#include "dir.h"
>>   
>>   static const char * const ls_tree_usage[] = {
>>   	N_("git ls-tree [<options>] <tree-ish> [<path>...]"),
>> @@ -364,6 +365,65 @@ static struct ls_tree_cmdmode_to_fmt ls_tree_cmdmode_format[] = {
>>   	},
>>   };
> 
> Okey, here yo uupdate the *.txt, but not the "-h", but it's one of those
> where way say <options>.
> 
> I for one wouldn't mind some preceding change like my 423be1f83c5 (doc
> txt & -h consistency: make "commit" consistent, 2022-10-13), which in
> turn would make t/t0450-txt-doc-vs-help.sh pass for this command, but
> maybe it's too much of a digression...
> 

I don't mind doing that.

>> +	(*d) = xcalloc(1, sizeof(**d));
>> +	(*d)->fn = fn;
>> +	(*d)->pl.use_cone_patterns = core_sparse_checkout_cone;
> 
> I forgot to note in my proposed fix-up that I omitted this, but your
> tests still pass, which suggests it's either not needed, or your tests
> are lacking....
> 

Well.. The line exists to enable the cone mode optimisations. The only 
observable differences are performance and that a warning is emitted 
when cone mode is enabled in the config but the patterns given aren't in 
the cone mode pattern set.

I can look into writing a performance test that captures the performance 
differences.

I can also test for the latter behaviour. But that is testing for the 
side effect of a main behaviour to ensure that the main behaviour is 
there. Which isn't the nicest thing I can think of, but it might be 
better than nothing.

>> +	(*d)->options = options;
>> +	strbuf_init(&(*d)->full_path_buf, 0);
>> +
>> +	if (get_oid_with_context(the_repository, sparse_oid_name, GET_OID_BLOB,
>> +			&sparse_oid, &oc))
>> +		die(_("unable to access sparse blob in '%s'"), sparse_oid_name);
>> +	if (add_patterns_from_blob_to_list(&sparse_oid, "", 0, &(*d)->pl) < 0)
> 
> As noted you can avoid the malloc here, which also sidesteps this (IMO
> at last) ugly &(*v)->m syntax.
> 

Right.

>> +		die(_("unable to parse sparse filter data in %s"),
>> +		    oid_to_hex(&sparse_oid));
>> +}
>> +
>> +static void free_sparse_filter_data(struct sparse_filter_data *d)
>> +{
>> +	clear_pattern_list(&d->pl);
>> +	strbuf_release(&d->full_path_buf);
>> +	free(d);
>> +}
>> +
>> +static int path_matches_sparse_checkout_patterns(struct strbuf *full_path, struct pattern_list *pl, int dtype)
>> +{
>> +	enum pattern_match_result match = recursively_match_path_with_sparse_patterns(full_path->buf, the_repository->index, dtype, pl);
> 
> I for one would welcome e.g. abbreviating "sparse_checkout_patterns" as
> "scp" or whatever throughout, with resulting shortening of very long
> lines & identifiers. E.g. for this one "patch_match_scp" or whatever.
> 
> 

I don't mind either way, but a quick search doesn't reveal any of uses 
of an abbreviation like, though.

Either way, I could also drop the '_checkout_' from 
'sparse_checkout_patterns'. I don't think that would make it less clear 
what we are talking about.

>> +	struct sparse_filter_data *data = context;
>> +
> 
> Style: Don't add a \n\n between variable decls,
> 
>> +	int do_recurse = show_recursive(data->options, base->buf, base->len, pathname);
> 
> Instead add it here, before the code proper.
> 
> 

Alright. I'll apply that.

>> +	if (object_type(mode) == OBJ_TREE)
>> +		return do_recurse;
>> +
>> +	strbuf_reset(&data->full_path_buf);
> 
> I wondered if we need this after, but we don't, I for one would welcome a fix-up like:
> 	
> 	diff --git a/builtin/ls-tree.c b/builtin/ls-tree.c
> 	index 68d6ef675f2..74dfa9a4580 100644
> 	--- a/builtin/ls-tree.c
> 	+++ b/builtin/ls-tree.c
> 	@@ -410,19 +410,21 @@ static int filter_sparse(const struct object_id *oid, struct strbuf *base,
> 	 			 const char *pathname, unsigned mode, void *context)
> 	 {
> 	 	struct sparse_filter_data *data = context;
> 	-
> 	 	int do_recurse = show_recursive(data->options, base->buf, base->len, pathname);
> 	+	int ret = 0;
> 	+
> 	 	if (object_type(mode) == OBJ_TREE)
> 	 		return do_recurse;
> 	
> 	-	strbuf_reset(&data->full_path_buf);
> 	 	strbuf_addbuf(&data->full_path_buf, base);
> 	 	strbuf_addstr(&data->full_path_buf, pathname);
> 	
> 	 	if (!path_matches_sparse_checkout_patterns(&data->full_path_buf, &data->pl, DT_REG))
> 	-			return 0;
> 	-
> 	-	return data->fn(oid, base, pathname, mode, data->options);
> 	+		goto cleanup;
> 	+	ret = data->fn(oid, base, pathname, mode, data->options);
> 	+cleanup:
> 	+	strbuf_reset(&data->full_path_buf);
> 	+	return ret;
> 	 }
> 	
> 	 int cmd_ls_tree(int argc, const char **argv, const char *prefix)
> 	
> It's slightly more verbose, and we do end up needlessly reset-ing the
> "last" one, but makes it clear that we don't have need for this after
> this.
> 
> Which to me, also raises the question of why you have this
> "full_path_buf" at all. It looks like a memory optimization, as you're
> trying to avoid a malloc()/free() in the loop.
> 
> That's fair, but you could also do that with:
> 
> 	const size_t oldlen = base->len;
> 	strbuf_addstr(base, pathname);
>          /* do the path_matches_sparse_checkout_patterns() call here */
> 	/* before "cleanup" */
> 	strbuf_setlen(base, oldlen);
> 
> Or whatever, those snippets are untested, but we use similar tricks
> elsewhere, and as it's contained to a few lines here I think it's better
> than carrying another buffer.
> 

Okay. That makes sense. And if it is a common trick then I guess it 
isn't too mysterious.

>> +	strbuf_addbuf(&data->full_path_buf, base);
>> +	strbuf_addstr(&data->full_path_buf, pathname);
> 
> Nit: Shorter as: strbuf_addf(&sb, "%s%s", base.buf, pathname) (or equivalent);
> 
>> +
>> +	if (!path_matches_sparse_checkout_patterns(&data->full_path_buf, &data->pl, DT_REG))
>> +			return 0;
>> +
>> +	return data->fn(oid, base, pathname, mode, data->options);
>> +}
>> +
>>   int cmd_ls_tree(int argc, const char **argv, const char *prefix)
>>   {
>>   	struct object_id oid;
>> @@ -372,7 +432,11 @@ int cmd_ls_tree(int argc, const char **argv, const char *prefix)
>>   	read_tree_fn_t fn = NULL;
>>   	enum ls_tree_cmdmode cmdmode = MODE_DEFAULT;
>>   	int null_termination = 0;
>> +
> 
> Don't add this stray \n.
> 

Ok.

>>   	struct ls_tree_options options = { 0 };
>> +	char *sparse_oid_name = NULL;
>> +	void *read_tree_context = NULL;
>> +	struct sparse_filter_data *sparse_filter_data = NULL;
>>   	const struct option ls_tree_options[] = {
>>   		OPT_BIT('d', NULL, &options.ls_options, N_("only show trees"),
>>   			LS_TREE_ONLY),
>> @@ -399,6 +463,9 @@ int cmd_ls_tree(int argc, const char **argv, const char *prefix)
>>   					 N_("format to use for the output"),
>>   					 PARSE_OPT_NONEG),
>>   		OPT__ABBREV(&options.abbrev),
>> +		OPT_STRING_F(0, "filter-sparse-oid", &sparse_oid_name, N_("filter-sparse-oid"),
> 
> Make that N_(...) be "object-id", "oid", "hash" or something?
> 
> I.e. the RHS with the <type> should be <thingy>, not
> <thingy-for-some-purpose>.
> 

Right. I've used the "blob-ish" wording in the .txt file, I think it 
makes sense to reuse it here.

>> +			   N_("filter output with sparse-checkout specification contained in the blob"),
>> +			     PARSE_OPT_NONEG),
>>   		OPT_END()
>>   	};
>>   	struct ls_tree_cmdmode_to_fmt *m2f = ls_tree_cmdmode_format;
>> @@ -474,7 +541,17 @@ int cmd_ls_tree(int argc, const char **argv, const char *prefix)
>>   		break;
>>   	}
>>   
>> -	ret = !!read_tree(the_repository, tree, &options.pathspec, fn, &options);
>> +	if (sparse_oid_name) {
>> +		init_sparse_filter_data(&sparse_filter_data, &options, sparse_oid_name, fn);
>> +		read_tree_context = sparse_filter_data;
>> +		fn = filter_sparse;
>> +	} else
>> +		read_tree_context = &options;
> 
> You're missing a {} here for the "else".
> 
> Better yet, delete that "else" and change the decl above to be:
> 
> 	void *context = &options;
> 
> Then just keep the "if" here, where you might clobber the "context".
> 

I'll delete the else as you suggest.

>> +
>> +	ret = !!read_tree(the_repository, tree, &options.pathspec, fn, read_tree_context);
>>   	clear_pathspec(&options.pathspec);
>> +	if (sparse_filter_data)
>> +		free_sparse_filter_data(sparse_filter_data);
> 
> I suggested a fixup for this, but as an aside don't make a free_*()
> function that's not happy to accept NULL, it should work like the free()
> itself.
> 

Ah. That's a fair point. Thanks.

>> +
>>   	return ret;
>>   }
>> diff --git a/dir.c b/dir.c
>> index 4e99f0c868f..122ebced08e 100644
>> --- a/dir.c
>> +++ b/dir.c
>> @@ -1457,45 +1457,50 @@ int init_sparse_checkout_patterns(struct index_state *istate)
>>   	return 0;
>>   }
>>   
>> -static int path_in_sparse_checkout_1(const char *path,
>> -				     struct index_state *istate,
>> -				     int require_cone_mode)
>> +int recursively_match_path_with_sparse_patterns(const char *path,
>> +						struct index_state *istate,
>> +						int dtype,
>> +						struct pattern_list *pl)
>>   {
>> -	int dtype = DT_REG;
>>   	enum pattern_match_result match = UNDECIDED;
>>   	const char *end, *slash;
>> -
>> -	/*
>> -	 * We default to accepting a path if the path is empty, there are no
>> -	 * patterns, or the patterns are of the wrong type.
>> -	 */
>> -	if (!*path ||
>> -	    init_sparse_checkout_patterns(istate) ||
>> -	    (require_cone_mode &&
>> -	     !istate->sparse_checkout_patterns->use_cone_patterns))
>> -		return 1;
>> -
>>   	/*
>>   	 * If UNDECIDED, use the match from the parent dir (recursively), or
>>   	 * fall back to NOT_MATCHED at the topmost level. Note that cone mode
>>   	 * never returns UNDECIDED, so we will execute only one iteration in
>>   	 * this case.
>>   	 */
>> -	for (end = path + strlen(path);
>> -	     end > path && match == UNDECIDED;
>> +	for (end = path + strlen(path); end > path && match == UNDECIDED;
> 
> All good, aside from this whitespace refactoring, as noted.
> 

Ok.

>>   	     end = slash) {
>> -
>>   		for (slash = end - 1; slash > path && *slash != '/'; slash--)
>>   			; /* do nothing */
>>   
>>   		match = path_matches_pattern_list(path, end - path,
>>   				slash > path ? slash + 1 : path, &dtype,
>> -				istate->sparse_checkout_patterns, istate);
>> +				pl, istate);
>>   
>>   		/* We are going to match the parent dir now */
>>   		dtype = DT_DIR;
>>   	}
>> -	return match > 0;
>> +
>> +	return match;
>> +}
>> +
>> +static int path_in_sparse_checkout_1(const char *path,
>> +				     struct index_state *istate,
>> +				     int require_cone_mode)
>> +{
>> +	/*
>> +	 * We default to accepting a path if the path is empty, there are no
>> +	 * patterns, or the patterns are of the wrong type.
>> +	 */
>> +	if (!*path ||
>> +	    init_sparse_checkout_patterns(istate) ||
>> +	    (require_cone_mode &&
>> +	     !istate->sparse_checkout_patterns->use_cone_patterns))
>> +		return 1;
>> +
>> +	return recursively_match_path_with_sparse_patterns(path, istate, DT_REG, istate->sparse_checkout_patterns) > 0;
> 
> All good, except for the really long line, aside from the previous
> suggestion, maybe pull "istate->sparse_checkout_patterns" into a
> variable, as it's now used twice here in this function?
> 

I think that makes sense.

>> +check_agrees_with_ls_files () {
>> +	REPO=repo  &&
>> +	git -C $REPO submodule deinit -f --all &&
>> +	git -C $REPO cat-file -p $filter_oid >"$REPO/.git/info/sparse-checkout" &&
>> +	git -C $REPO sparse-checkout init --cone &&
>> +	git -C $REPO submodule init &&
>> +	git -C $REPO ls-files -t >out &&
>> +	sed -n "/^S /!s/^. //p" out >expect &&
>> +	test_cmp expect actual
> 
> Instead of this last "sed" munging, why not use the "--format" option to
> "ls-tree" to just make the output the same?

I hadn't thought about changing that side of the output. I would have to 
call ls-tree once more instead of relying on the 'expect' file to be 
there when the function is called. But that might be nicer anyway.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] ls-tree: add --sparse-filter-oid argument
  2023-01-23 11:46 ` [PATCH v2] " William Sprent via GitGitGadget
  2023-01-23 13:00   ` Ævar Arnfjörð Bjarmason
  2023-01-23 13:06   ` Ævar Arnfjörð Bjarmason
@ 2023-01-24 20:11   ` Victoria Dye
  2023-01-25 13:47     ` William Sprent
  2023-01-25  5:11   ` Elijah Newren
  3 siblings, 1 reply; 19+ messages in thread
From: Victoria Dye @ 2023-01-24 20:11 UTC (permalink / raw)
  To: William Sprent via GitGitGadget, git
  Cc: Eric Sunshine, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, Elijah Newren, William Sprent

William Sprent via GitGitGadget wrote:
> From: William Sprent <williams@unity3d.com>
> 
> There is currently no way to ask git the question "which files would be
> part of a sparse checkout of commit X with sparse checkout patterns Y".
> One use-case would be that tooling may want know whether sparse checkouts
> of two commits contain the same content even if the full trees differ.
> Another interesting use-case would be for tooling to use in conjunction
> with 'git update-index --index-info'.

These types of use cases align nicely with "Behavior A" as described in
'Documentation/technical/sparse-checkout.txt' [1]. I think adding a
'--scope=(sparse|all)' option to 'ls-trees' would be a good way to make
progress on the goals of that design document as well as serve the needs
described above.

[1] https://lore.kernel.org/git/pull.1367.v4.git.1667714666810.gitgitgadget@gmail.com/

> 
> 'rev-list --objects --filter=sparse:oid' comes close, but as rev-list is
> concerned with objects rather than directory trees, it leaves files out
> when the same blob occurs in at two different paths.
> 
> It is possible to ask git about the sparse status of files currently in
> the index with 'ls-files -t'. However, this does not work well when the
> caller is interested in another commit, intererested in sparsity
> patterns that aren't currently in '.git/info/sparse-checkout', or when
> working in with bare repo.

I'm a bit confused by your described use cases here, though. Right now,
'sparse-checkout' is a local repo-only optimization (for performance and/or
scoping user workspaces), but you seem to be implying a need for
"sparse-checkout patterns" as a general-purpose data format (like a config
file or 'rebase-todo') that can be used to manually configure command
behavior.

If that is what you're getting at, it seems like a pretty substantial
expansion to the scope of what we consider "sparse checkout". That's not to
say it isn't a good idea - I can definitely imagine tooling where this type
of functionality is useful - just that it warrants careful consideration so
we don't over-complicate the typical sparse-checkout user experience. 

> 
> To fill this gap, add a new argument to ls-tree '--sparse-filter-oid'
> which takes the object id of a blob containing sparse checkout patterns
> that filters the output of 'ls-tree'. When filtering with given sparsity
> patterns, 'ls-tree' only outputs blobs and commit objects that
> match the given patterns.

I don't think a blob OID is the best way to specify an arbitrary pattern set
in this case. OIDs will only be created for blobs that are tracked in Git;
it's possible that your project tracks potential sparse-checkout patterns in
Git, but it seems overly restrictive. You could instead specify the file
with a path on-disk (like the '--file' options in various commands, e.g.
'git commit'); if you did need to get that file from the object store, you
could first get its contents with 'git cat-file'. 

> 
> While it may be valid in some situations to output a tree object -- e.g.
> when a cone pattern matches all blobs recursively contained in a tree --
> it is less unclear what should be output if a sparse pattern matches
> parts of a tree.
> 
> To allow for reusing the pattern matching logic found in
> 'path_in_sparse_checkout_1()' in 'dir.c' with arbitrary patterns,
> extract the pattern matching part of the function into its own new
> function 'recursively_match_path_with_sparse_patterns()'.
> 
> Signed-off-by: William Sprent <williams@unity3d.com>
> ---
>     ls-tree: add --sparse-filter-oid argument
>     
>     I'm resubmitting this change as rebased on top of 'master', as it
>     conflicted with the topic 'ls-tree.c: clean-up works' 1
>     [https://public-inbox.org/git/20230112091135.20050-1-tenglong.tl@alibaba-inc.com],
>     which was merged to 'master' recently.
>     
>     This versions also incorporates changes based on the comments made in 2
>     [https://public-inbox.org/git/CAPig+cRgZ0CrkqY7mufuWmhf6BC8yXjXXuOTEQjuz+Y0NA+N7Q@mail.gmail.com/].
>     
>     I'm also looping in contributors that have touched ls-tree and/or
>     sparse-checkouts recently. I hope that's okay.

Definitely okay, thanks for CC-ing!

Overall, this is an interesting extension to 'sparse-checkout', and opens up
some opportunities for expanded tooling. However, at an implementation
level, I think it's addressing two separate needs ("constrain 'ls-files' to
display files matching patterns" and "specify sparse patterns not in
'.git/info/sparse-checkout'") in one option, which makes for a somewhat
confusing and inflexible user experience. What about instead breaking this
into two options:

* --scope=(sparse|all): limits the scope of the files output by ("Behavior
  A" vs. "Behavior B" from the document linked earlier). 
* --sparse-patterns=<file>: specifies "override" patterns to use instead of
  those in '.git/info/sparse-checkout' (whether 'sparse-checkout' is enabled
  for the repository or not).

I haven't looked at your implementation in detail yet, but I did want to
offer a recommendation in case you hadn't considered it: if you want to
allow the use of patterns from a user-specified specific file, it would be
nice to do it in a way that fully replaces the "default" sparse-checkout
settings at the lowest level (i.e., override the values of
'core_apply_sparse_checkout', 'core_sparse_checkout_cone', and
'get_sparse_checkout_filename()'). Doing it that way would both make it
easier for other commands to add a '--sparse-patterns' option, and avoid the
separate code path ('path_in_sparse_checkout_1()' vs.
'recursively_match_path_with_sparse_patterns()', for example) when dealing
with '.git/info/sparse-checkout' patterns vs. manually-specified patterns.

>     
>     William
> 
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1459%2Fwilliams-unity%2Fls-tree-sparse-v2
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1459/williams-unity/ls-tree-sparse-v2
> Pull-Request: https://github.com/gitgitgadget/git/pull/1459


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] ls-tree: add --sparse-filter-oid argument
  2023-01-23 11:46 ` [PATCH v2] " William Sprent via GitGitGadget
                     ` (2 preceding siblings ...)
  2023-01-24 20:11   ` Victoria Dye
@ 2023-01-25  5:11   ` Elijah Newren
  2023-01-25 16:16     ` William Sprent
  3 siblings, 1 reply; 19+ messages in thread
From: Elijah Newren @ 2023-01-25  5:11 UTC (permalink / raw)
  To: William Sprent via GitGitGadget
  Cc: git, Eric Sunshine, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, Victoria Dye, William Sprent

It looks like Ævar and Victoria have both given really good reviews
already, but I think I spotted some additional things to comment on.

On Mon, Jan 23, 2023 at 3:46 AM William Sprent via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: William Sprent <williams@unity3d.com>
>
> There is currently no way to ask git the question "which files would be
> part of a sparse checkout of commit X with sparse checkout patterns Y".
> One use-case would be that tooling may want know whether sparse checkouts
> of two commits contain the same content even if the full trees differ.

Could you say more about this usecase?  Why does tooling need or want
to know this; won't a checkout of the new commit end up being quick
and simple?  (I'm not saying your usecase is bad, just curious that it
had never occurred to me, and I'm afraid I'm still not sure what your
purpose might be.)

> Another interesting use-case would be for tooling to use in conjunction
> with 'git update-index --index-info'.

Sorry, I'm not following.  Could you expound on this a bit?

> 'rev-list --objects --filter=sparse:oid' comes close, but as rev-list is
> concerned with objects rather than directory trees, it leaves files out
> when the same blob occurs in at two different paths.

s/in at/at/ ?

> It is possible to ask git about the sparse status of files currently in
> the index with 'ls-files -t'. However, this does not work well when the
> caller is interested in another commit, intererested in sparsity

s/intererested/interested/

> patterns that aren't currently in '.git/info/sparse-checkout', or when
> working in with bare repo.

s/in with bare/with a bare/ or s/in with bare/in a bare/?

> To fill this gap, add a new argument to ls-tree '--sparse-filter-oid'
> which takes the object id of a blob containing sparse checkout patterns
> that filters the output of 'ls-tree'. When filtering with given sparsity
> patterns, 'ls-tree' only outputs blobs and commit objects that
> match the given patterns.

This seems slightly unfortunate in that it makes things difficult for
cone mode users to take advantage of.  They will have to figure out
how to translate their directory list into sparse checkout patterns
before passing it along, and currently the only way to do that is via
`git sparse-checkout set <patterns>` and reading the patterns from
$GIT_DIR/info/sparse-checkout, but that toggles the sparsity of the
current working tree and avoiding changing the current sparse-checkout
was something you listed in your commit message as something you
wanted to avoid.

> While it may be valid in some situations to output a tree object -- e.g.
> when a cone pattern matches all blobs recursively contained in a tree --
> it is less unclear what should be output if a sparse pattern matches
> parts of a tree.
>
> To allow for reusing the pattern matching logic found in
> 'path_in_sparse_checkout_1()' in 'dir.c' with arbitrary patterns,
> extract the pattern matching part of the function into its own new
> function 'recursively_match_path_with_sparse_patterns()'.
>
> Signed-off-by: William Sprent <williams@unity3d.com>
> ---
>     ls-tree: add --sparse-filter-oid argument
>
>     I'm resubmitting this change as rebased on top of 'master', as it
>     conflicted with the topic 'ls-tree.c: clean-up works' 1
>     [https://public-inbox.org/git/20230112091135.20050-1-tenglong.tl@alibaba-inc.com],
>     which was merged to 'master' recently.
>
>     This versions also incorporates changes based on the comments made in 2
>     [https://public-inbox.org/git/CAPig+cRgZ0CrkqY7mufuWmhf6BC8yXjXXuOTEQjuz+Y0NA+N7Q@mail.gmail.com/].
>
>     I'm also looping in contributors that have touched ls-tree and/or
>     sparse-checkouts recently. I hope that's okay.

Of course!  It's encouraged, even.

[...]
> diff --git a/builtin/ls-tree.c b/builtin/ls-tree.c
[...]
> +static void init_sparse_filter_data(struct sparse_filter_data **d, struct ls_tree_options *options,
> +       const char *sparse_oid_name, read_tree_fn_t fn)
> +{
> +       struct object_id sparse_oid;
> +       struct object_context oc;
> +
> +       (*d) = xcalloc(1, sizeof(**d));
> +       (*d)->fn = fn;
> +       (*d)->pl.use_cone_patterns = core_sparse_checkout_cone;

Hmm, so the behavior still depends upon the current sparse-checkout
(or lack thereof), despite the documentation and rationale of your
feature as being there to check how a different sparse checkout would
behave?

I would hate to unconditionally turn cone_patterns off, since that
would come with a huge performance penalty for the biggest repos.  But
turning it unconditionally on wouldn't be good for the non-cone users.
This probably suggests we need something like another flag, or perhaps
separate flags for each mode.  Separate flags might provide the
benefit of allowing cone mode users to specify directories rather than
patterns, which would make it much easier for them to use.

[...]
> +static int path_matches_sparse_checkout_patterns(struct strbuf *full_path, struct pattern_list *pl, int dtype)
> +{
> +       enum pattern_match_result match = recursively_match_path_with_sparse_patterns(full_path->buf, the_repository->index, dtype, pl);
> +       return match > 0;

So your new caller doesn't care about the pattern_match_result, it
just wants to know if it got MATCHED or MATCHED_RECURSIVELY...

[...]
> diff --git a/dir.c b/dir.c
> index 4e99f0c868f..122ebced08e 100644
> --- a/dir.c
> +++ b/dir.c
> @@ -1457,45 +1457,50 @@ int init_sparse_checkout_patterns(struct index_state *istate)
>         return 0;
>  }
>
> -static int path_in_sparse_checkout_1(const char *path,
> -                                    struct index_state *istate,
> -                                    int require_cone_mode)
> +int recursively_match_path_with_sparse_patterns(const char *path,

You claim it returns an int here, but previously you presumed an enum
pattern_match_result from the new caller.

> +                                               struct index_state *istate,
> +                                               int dtype,
> +                                               struct pattern_list *pl)
>  {
> -       int dtype = DT_REG;
>         enum pattern_match_result match = UNDECIDED;
>         const char *end, *slash;
> -
> -       /*
> -        * We default to accepting a path if the path is empty, there are no
> -        * patterns, or the patterns are of the wrong type.
> -        */
> -       if (!*path ||
> -           init_sparse_checkout_patterns(istate) ||
> -           (require_cone_mode &&
> -            !istate->sparse_checkout_patterns->use_cone_patterns))
> -               return 1;
> -
>         /*
>          * If UNDECIDED, use the match from the parent dir (recursively), or
>          * fall back to NOT_MATCHED at the topmost level. Note that cone mode
>          * never returns UNDECIDED, so we will execute only one iteration in
>          * this case.
>          */
> -       for (end = path + strlen(path);
> -            end > path && match == UNDECIDED;
> +       for (end = path + strlen(path); end > path && match == UNDECIDED;
>              end = slash) {
> -
>                 for (slash = end - 1; slash > path && *slash != '/'; slash--)
>                         ; /* do nothing */
>
>                 match = path_matches_pattern_list(path, end - path,
>                                 slash > path ? slash + 1 : path, &dtype,
> -                               istate->sparse_checkout_patterns, istate);
> +                               pl, istate);
>
>                 /* We are going to match the parent dir now */
>                 dtype = DT_DIR;
>         }
> -       return match > 0;
> +
> +       return match;

Um, this last line seems like a potentially scary change in behavior.
Why should UNDECIDED return a non-zero value?  Previously, we returned
a 0 value for both UNDECIDED and NOT_MATCHED, but you've changed that
here.  If the change in this last line is actually correct, it should
be split out into its own commit and explained in detail in the commit
message.

> +}
> +
> +static int path_in_sparse_checkout_1(const char *path,
> +                                    struct index_state *istate,
> +                                    int require_cone_mode)
> +{
> +       /*
> +        * We default to accepting a path if the path is empty, there are no
> +        * patterns, or the patterns are of the wrong type.
> +        */
> +       if (!*path ||
> +           init_sparse_checkout_patterns(istate) ||
> +           (require_cone_mode &&
> +            !istate->sparse_checkout_patterns->use_cone_patterns))
> +               return 1;
> +
> +       return recursively_match_path_with_sparse_patterns(path, istate, DT_REG, istate->sparse_checkout_patterns) > 0;

Oh, you compare to > 0 here...and digging around your only other
caller just immediately compares to > 0 as well.

Why not just have recursively_match_path_with_sparse_patterns() do the
> 0 check?  If it does, returning int is fine.  If it doesn't, it
should be declared to return enum pattern_match_result.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] ls-tree: add --sparse-filter-oid argument
  2023-01-24 20:11   ` Victoria Dye
@ 2023-01-25 13:47     ` William Sprent
  2023-01-25 18:32       ` Victoria Dye
  0 siblings, 1 reply; 19+ messages in thread
From: William Sprent @ 2023-01-25 13:47 UTC (permalink / raw)
  To: Victoria Dye, William Sprent via GitGitGadget, git
  Cc: Eric Sunshine, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, Elijah Newren

On 24/01/2023 21.11, Victoria Dye wrote:
> William Sprent via GitGitGadget wrote:
>> From: William Sprent <williams@unity3d.com>
>>
>> There is currently no way to ask git the question "which files would be
>> part of a sparse checkout of commit X with sparse checkout patterns Y".
>> One use-case would be that tooling may want know whether sparse checkouts
>> of two commits contain the same content even if the full trees differ.
>> Another interesting use-case would be for tooling to use in conjunction
>> with 'git update-index --index-info'.
> 
> These types of use cases align nicely with "Behavior A" as described in
> 'Documentation/technical/sparse-checkout.txt' [1]. I think adding a
> '--scope=(sparse|all)' option to 'ls-trees' would be a good way to make
> progress on the goals of that design document as well as serve the needs
> described above.
> 
> [1] https://lore.kernel.org/git/pull.1367.v4.git.1667714666810.gitgitgadget@gmail.com/
> >>
>> 'rev-list --objects --filter=sparse:oid' comes close, but as rev-list is
>> concerned with objects rather than directory trees, it leaves files out
>> when the same blob occurs in at two different paths.
>>
>> It is possible to ask git about the sparse status of files currently in
>> the index with 'ls-files -t'. However, this does not work well when the
>> caller is interested in another commit, intererested in sparsity
>> patterns that aren't currently in '.git/info/sparse-checkout', or when
>> working in with bare repo.
> 
> I'm a bit confused by your described use cases here, though. Right now,
> 'sparse-checkout' is a local repo-only optimization (for performance and/or
> scoping user workspaces), but you seem to be implying a need for
> "sparse-checkout patterns" as a general-purpose data format (like a config
> file or 'rebase-todo') that can be used to manually configure command
> behavior.
> 
> If that is what you're getting at, it seems like a pretty substantial
> expansion to the scope of what we consider "sparse checkout". That's not to
> say it isn't a good idea - I can definitely imagine tooling where this type
> of functionality is useful - just that it warrants careful consideration so
> we don't over-complicate the typical sparse-checkout user experience.
> 

I guess it is sort of what I'm getting at. I want to enable tooling (in 
particular) to be able to interrogate git about sparse checkouts without 
having to checkout a commit and load a set of patterns. There's already 
a deterministic relationship between a (commit * sparse checkout 
patterns) and a list of included files, so I think it makes sense to be 
able to ask git about it.

I agree with your concerns about not over complicating things for the 
average user. I think the cone/non-cone distinctions are already a bit 
hard to grasp (at least it was for me). FWIW, it isn't my intention that 
an average user should use the "explicitly pass patterns to ls-tree" 
functionality. But of course we should still get it right.


>>
>> To fill this gap, add a new argument to ls-tree '--sparse-filter-oid'
>> which takes the object id of a blob containing sparse checkout patterns
>> that filters the output of 'ls-tree'. When filtering with given sparsity
>> patterns, 'ls-tree' only outputs blobs and commit objects that
>> match the given patterns.
> 
> I don't think a blob OID is the best way to specify an arbitrary pattern set
> in this case. OIDs will only be created for blobs that are tracked in Git;
> it's possible that your project tracks potential sparse-checkout patterns in
> Git, but it seems overly restrictive. You could instead specify the file
> with a path on-disk (like the '--file' options in various commands, e.g.
> 'git commit'); if you did need to get that file from the object store, you
> could first get its contents with 'git cat-file'.
> 

I'm fine with changing it to be a file. I picked it to align with

  'git rev-list --filter=sparse:<blob-oid>'

which I think is the only other place in the code base where you can 
query about arbitrary filters. But I agree, it is a bit awkward to use.

>>
>> While it may be valid in some situations to output a tree object -- e.g.
>> when a cone pattern matches all blobs recursively contained in a tree --
>> it is less unclear what should be output if a sparse pattern matches
>> parts of a tree.
>>
>> To allow for reusing the pattern matching logic found in
>> 'path_in_sparse_checkout_1()' in 'dir.c' with arbitrary patterns,
>> extract the pattern matching part of the function into its own new
>> function 'recursively_match_path_with_sparse_patterns()'.
>>
>> Signed-off-by: William Sprent <williams@unity3d.com>
>> ---
>>      ls-tree: add --sparse-filter-oid argument
>>      
>>      I'm resubmitting this change as rebased on top of 'master', as it
>>      conflicted with the topic 'ls-tree.c: clean-up works' 1
>>      [https://public-inbox.org/git/20230112091135.20050-1-tenglong.tl@alibaba-inc.com],
>>      which was merged to 'master' recently.
>>      
>>      This versions also incorporates changes based on the comments made in 2
>>      [https://public-inbox.org/git/CAPig+cRgZ0CrkqY7mufuWmhf6BC8yXjXXuOTEQjuz+Y0NA+N7Q@mail.gmail.com/].
>>      
>>      I'm also looping in contributors that have touched ls-tree and/or
>>      sparse-checkouts recently. I hope that's okay.
> 
> Definitely okay, thanks for CC-ing!
> 
> Overall, this is an interesting extension to 'sparse-checkout', and opens up
> some opportunities for expanded tooling. However, at an implementation
> level, I think it's addressing two separate needs ("constrain 'ls-files' to
> display files matching patterns" and "specify sparse patterns not in
> '.git/info/sparse-checkout'") in one option, which makes for a somewhat
> confusing and inflexible user experience. What about instead breaking this
> into two options:
> 
> * --scope=(sparse|all): limits the scope of the files output by ("Behavior
>    A" vs. "Behavior B" from the document linked earlier).
> * --sparse-patterns=<file>: specifies "override" patterns to use instead of
>    those in '.git/info/sparse-checkout' (whether 'sparse-checkout' is enabled
>    for the repository or not).
> 

That makes sense to me. I'll look into making those changes.

> I haven't looked at your implementation in detail yet, but I did want to
> offer a recommendation in case you hadn't considered it: if you want to
> allow the use of patterns from a user-specified specific file, it would be
> nice to do it in a way that fully replaces the "default" sparse-checkout
> settings at the lowest level (i.e., override the values of
> 'core_apply_sparse_checkout', 'core_sparse_checkout_cone', and
> 'get_sparse_checkout_filename()'). Doing it that way would both make it
> easier for other commands to add a '--sparse-patterns' option, and avoid the
> separate code path ('path_in_sparse_checkout_1()' vs.
> 'recursively_match_path_with_sparse_patterns()', for example) when dealing
> with '.git/info/sparse-checkout' patterns vs. manually-specified patterns.
> 

Thanks for the pointers. I'll see what I can do. Do you mean something 
along the line of the following?

    set_sparse_checkout_file(filename);
    init_sparse_checkout_patterns(istate);
    _ = path_in_sparse_checkout_1(some_path, istate, ...);


I do I think I like the explicitness of first loading the pattern_list 
and then passing it to the matching function, though. But maybe it is 
too cumbersome.

>>      
>>      William
>>
>> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1459%2Fwilliams-unity%2Fls-tree-sparse-v2
>> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1459/williams-unity/ls-tree-sparse-v2
>> Pull-Request: https://github.com/gitgitgadget/git/pull/1459
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] ls-tree: add --sparse-filter-oid argument
  2023-01-25  5:11   ` Elijah Newren
@ 2023-01-25 16:16     ` William Sprent
  2023-01-26  3:25       ` Elijah Newren
  0 siblings, 1 reply; 19+ messages in thread
From: William Sprent @ 2023-01-25 16:16 UTC (permalink / raw)
  To: Elijah Newren, William Sprent via GitGitGadget
  Cc: git, Eric Sunshine, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, Victoria Dye

On 25/01/2023 06.11, Elijah Newren wrote:
> It looks like Ævar and Victoria have both given really good reviews
> already, but I think I spotted some additional things to comment on.
> 
> On Mon, Jan 23, 2023 at 3:46 AM William Sprent via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: William Sprent <williams@unity3d.com>
>>
>> There is currently no way to ask git the question "which files would be
>> part of a sparse checkout of commit X with sparse checkout patterns Y".
>> One use-case would be that tooling may want know whether sparse checkouts
>> of two commits contain the same content even if the full trees differ.
> 
> Could you say more about this usecase?  Why does tooling need or want
> to know this; won't a checkout of the new commit end up being quick
> and simple?  (I'm not saying your usecase is bad, just curious that it
> had never occurred to me, and I'm afraid I'm still not sure what your
> purpose might be.)
> 

I'm thinking mainly about a monorepo context where there are a number of 
distinct 'units' that can be described with sparse checkout patterns. 
And perhaps there's some tooling that only wants to perform an action if 
the content of a 'unit' changes.

Depending on the repo, it won't necessarily be quick to check out the 
commit with the given patterns. However, it is more about it being 
inconvenient to have to have a working directory, especially so if you 
want use the tooling in some kind of service or query rapidly about 
different revisions/patterns.

>> Another interesting use-case would be for tooling to use in conjunction
>> with 'git update-index --index-info'.
> 
> Sorry, I'm not following.  Could you expound on this a bit?
> 

I was imagining something along the lines of being able to generate new 
tree objects based on what matches the given sparse checkout patterns. 
Not that I have a specific use case for it right now.

I think what I'm trying to evoke with that paragraph is that this 
enables integrations with git that seem interesting and weren't possible 
before.

>> 'rev-list --objects --filter=sparse:oid' comes close, but as rev-list is
>> concerned with objects rather than directory trees, it leaves files out
>> when the same blob occurs in at two different paths.
> 
> s/in at/at/ ?
> >> It is possible to ask git about the sparse status of files currently in
>> the index with 'ls-files -t'. However, this does not work well when the
>> caller is interested in another commit, intererested in sparsity
> 
> s/intererested/interested/
> 
>> patterns that aren't currently in '.git/info/sparse-checkout', or when
>> working in with bare repo.
> 
> s/in with bare/with a bare/ or s/in with bare/in a bare/?
> 

Ah. Thanks for catching those.

>> To fill this gap, add a new argument to ls-tree '--sparse-filter-oid'
>> which takes the object id of a blob containing sparse checkout patterns
>> that filters the output of 'ls-tree'. When filtering with given sparsity
>> patterns, 'ls-tree' only outputs blobs and commit objects that
>> match the given patterns.
> 
> This seems slightly unfortunate in that it makes things difficult for
> cone mode users to take advantage of.  They will have to figure out
> how to translate their directory list into sparse checkout patterns
> before passing it along, and currently the only way to do that is via
> `git sparse-checkout set <patterns>` and reading the patterns from
> $GIT_DIR/info/sparse-checkout, but that toggles the sparsity of the
> current working tree and avoiding changing the current sparse-checkout
> was something you listed in your commit message as something you
> wanted to avoid.
>  >> While it may be valid in some situations to output a tree object -- e.g.
>> when a cone pattern matches all blobs recursively contained in a tree --
>> it is less unclear what should be output if a sparse pattern matches
>> parts of a tree.
>>
>> To allow for reusing the pattern matching logic found in
>> 'path_in_sparse_checkout_1()' in 'dir.c' with arbitrary patterns,
>> extract the pattern matching part of the function into its own new
>> function 'recursively_match_path_with_sparse_patterns()'.
>>
>> Signed-off-by: William Sprent <williams@unity3d.com>
>> ---
>>      ls-tree: add --sparse-filter-oid argument
>>
>>      I'm resubmitting this change as rebased on top of 'master', as it
>>      conflicted with the topic 'ls-tree.c: clean-up works' 1
>>      [https://public-inbox.org/git/20230112091135.20050-1-tenglong.tl@alibaba-inc.com],
>>      which was merged to 'master' recently.
>>
>>      This versions also incorporates changes based on the comments made in 2
>>      [https://public-inbox.org/git/CAPig+cRgZ0CrkqY7mufuWmhf6BC8yXjXXuOTEQjuz+Y0NA+N7Q@mail.gmail.com/].
>>
>>      I'm also looping in contributors that have touched ls-tree and/or
>>      sparse-checkouts recently. I hope that's okay.
> 
> Of course!  It's encouraged, even.
> 
> [...]
>> diff --git a/builtin/ls-tree.c b/builtin/ls-tree.c
> [...]
>> +static void init_sparse_filter_data(struct sparse_filter_data **d, struct ls_tree_options *options,
>> +       const char *sparse_oid_name, read_tree_fn_t fn)
>> +{
>> +       struct object_id sparse_oid;
>> +       struct object_context oc;
>> +
>> +       (*d) = xcalloc(1, sizeof(**d));
>> +       (*d)->fn = fn;
>> +       (*d)->pl.use_cone_patterns = core_sparse_checkout_cone;
> 
> Hmm, so the behavior still depends upon the current sparse-checkout
> (or lack thereof), despite the documentation and rationale of your
> feature as being there to check how a different sparse checkout would
> behave?
> 
> I would hate to unconditionally turn cone_patterns off, since that
> would come with a huge performance penalty for the biggest repos.  But
> turning it unconditionally on wouldn't be good for the non-cone users.
> This probably suggests we need something like another flag, or perhaps
> separate flags for each mode.  Separate flags might provide the
> benefit of allowing cone mode users to specify directories rather than
> patterns, which would make it much easier for them to use.
>
I used 'core_sparse_checkout_cone' because I wanted to allow for the 
cone mode optimisations, but I also figured that I should respect the 
configuration. It doesn't change how the patterns are parsed in this case.

I agree that it is a bit awkward to have to "translate" the directories 
into patterns when wanting to use cone mode. I can try adding
'--[no]-cone' flags and see how that feels. Together with Victoria's 
suggestions that would result in having the following flags:

* --scope=(sparse|all)
* --sparse-patterns-file=<path>
* --[no]-cone: used together with --sparse-patterns-file to tell git
   whether to interpret the patterns given as directories (cone) or
   patterns (no-cone).

Which seems like a lot at first glance. But it allows for passing 
directories instead of patterns for cone mode, and is similar to the 
behaviour of 'sparse-checkout set'.

Does that seem like something that would make sense?

> [...]
>> +static int path_matches_sparse_checkout_patterns(struct strbuf *full_path, struct pattern_list *pl, int dtype)
>> +{
>> +       enum pattern_match_result match = recursively_match_path_with_sparse_patterns(full_path->buf, the_repository->index, dtype, pl);
>> +       return match > 0;
> 
> So your new caller doesn't care about the pattern_match_result, it
> just wants to know if it got MATCHED or MATCHED_RECURSIVELY...
> 
> [...]
>> diff --git a/dir.c b/dir.c
>> index 4e99f0c868f..122ebced08e 100644
>> --- a/dir.c
>> +++ b/dir.c
>> @@ -1457,45 +1457,50 @@ int init_sparse_checkout_patterns(struct index_state *istate)
>>          return 0;
>>   }
>>
>> -static int path_in_sparse_checkout_1(const char *path,
>> -                                    struct index_state *istate,
>> -                                    int require_cone_mode)
>> +int recursively_match_path_with_sparse_patterns(const char *path,
> 
> You claim it returns an int here, but previously you presumed an enum
> pattern_match_result from the new caller.
> 
>> +                                               struct index_state *istate,
>> +                                               int dtype,
>> +                                               struct pattern_list *pl)
>>   {
>> -       int dtype = DT_REG;
>>          enum pattern_match_result match = UNDECIDED;
>>          const char *end, *slash;
>> -
>> -       /*
>> -        * We default to accepting a path if the path is empty, there are no
>> -        * patterns, or the patterns are of the wrong type.
>> -        */
>> -       if (!*path ||
>> -           init_sparse_checkout_patterns(istate) ||
>> -           (require_cone_mode &&
>> -            !istate->sparse_checkout_patterns->use_cone_patterns))
>> -               return 1;
>> -
>>          /*
>>           * If UNDECIDED, use the match from the parent dir (recursively), or
>>           * fall back to NOT_MATCHED at the topmost level. Note that cone mode
>>           * never returns UNDECIDED, so we will execute only one iteration in
>>           * this case.
>>           */
>> -       for (end = path + strlen(path);
>> -            end > path && match == UNDECIDED;
>> +       for (end = path + strlen(path); end > path && match == UNDECIDED;
>>               end = slash) {
>> -
>>                  for (slash = end - 1; slash > path && *slash != '/'; slash--)
>>                          ; /* do nothing */
>>
>>                  match = path_matches_pattern_list(path, end - path,
>>                                  slash > path ? slash + 1 : path, &dtype,
>> -                               istate->sparse_checkout_patterns, istate);
>> +                               pl, istate);
>>
>>                  /* We are going to match the parent dir now */
>>                  dtype = DT_DIR;
>>          }
>> -       return match > 0;
>> +
>> +       return match;
> 
> Um, this last line seems like a potentially scary change in behavior.
> Why should UNDECIDED return a non-zero value?  Previously, we returned
> a 0 value for both UNDECIDED and NOT_MATCHED, but you've changed that
> here.  If the change in this last line is actually correct, it should
> be split out into its own commit and explained in detail in the commit
> message.
> 
>> +}
>> +
>> +static int path_in_sparse_checkout_1(const char *path,
>> +                                    struct index_state *istate,
>> +                                    int require_cone_mode)
>> +{
>> +       /*
>> +        * We default to accepting a path if the path is empty, there are no
>> +        * patterns, or the patterns are of the wrong type.
>> +        */
>> +       if (!*path ||
>> +           init_sparse_checkout_patterns(istate) ||
>> +           (require_cone_mode &&
>> +            !istate->sparse_checkout_patterns->use_cone_patterns))
>> +               return 1;
>> +
>> +       return recursively_match_path_with_sparse_patterns(path, istate, DT_REG, istate->sparse_checkout_patterns) > 0;
> 
> Oh, you compare to > 0 here...and digging around your only other
> caller just immediately compares to > 0 as well.
> 
> Why not just have recursively_match_path_with_sparse_patterns() do the
>> 0 check?  If it does, returning int is fine.  If it doesn't, it
> should be declared to return enum pattern_match_result.

I think my thinking was that it made sense in general for 
'recursively_match_path_with_sparse_patterns()' to expose the match 
result, but then failed to declare it that way. Thanks for catching it.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] ls-tree: add --sparse-filter-oid argument
  2023-01-25 13:47     ` William Sprent
@ 2023-01-25 18:32       ` Victoria Dye
  2023-01-26 14:55         ` William Sprent
  0 siblings, 1 reply; 19+ messages in thread
From: Victoria Dye @ 2023-01-25 18:32 UTC (permalink / raw)
  To: William Sprent, William Sprent via GitGitGadget, git
  Cc: Eric Sunshine, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, Elijah Newren

William Sprent wrote:
> On 24/01/2023 21.11, Victoria Dye wrote>> I haven't looked at your implementation in detail yet, but I did want to
>> offer a recommendation in case you hadn't considered it: if you want to
>> allow the use of patterns from a user-specified specific file, it would be
>> nice to do it in a way that fully replaces the "default" sparse-checkout
>> settings at the lowest level (i.e., override the values of
>> 'core_apply_sparse_checkout', 'core_sparse_checkout_cone', and
>> 'get_sparse_checkout_filename()'). Doing it that way would both make it
>> easier for other commands to add a '--sparse-patterns' option, and avoid the
>> separate code path ('path_in_sparse_checkout_1()' vs.
>> 'recursively_match_path_with_sparse_patterns()', for example) when dealing
>> with '.git/info/sparse-checkout' patterns vs. manually-specified patterns.
>>
> 
> Thanks for the pointers. I'll see what I can do. Do you mean something
> along the line of the following?
> 
>    set_sparse_checkout_file(filename);
>    init_sparse_checkout_patterns(istate);
>    _ = path_in_sparse_checkout_1(some_path, istate, ...);
> 

Sort of. I mentioned separating the options into "specify the sparse pattern
file" and "restrict the displayed files to the active pattern set, if there
is one". For the former, you might add an option like:

	OPT_FILENAME(0, "sparse-patterns", &sparse_pattern_file, 
		     N_("override sparse-checkout behavior using patterns from <file>"))

Then do something like what you have, right after option parsing:

	if (sparse_pattern_file) {
		core_apply_sparse_checkout = 1;
		core_sparse_checkout_cone = <???>;
		set_sparse_checkout_file(filename);
	}

If this option is specified, but the repo already has sparse
patterns/settings of its own, you'll need to (carefully) override the repo's
existing configuration:

* 'core_apply_sparse_checkout' & 'core_sparse_checkout_cone' are set based
  on the repo config. You'll need to make sure those values are overridden
  before loading the sparse-checkout patterns, and also that they're set
  *after* loading the config.
* Speaking of 'core_sparse_checkout_cone', there are a bunch of ways you
  might configure "cone-" vs. "non-cone" for your patterns (user-specified
  with yet another option, always assume one or the other, try deriving from
  the patterns). My preference would be to always assume "cone mode" - it's
  the direction Git has been moving with sparse-checkout over the past year,
  and still automatically "falls back" on non-cone patterns if the patterns
  can't be used in cone mode (with a warning from
  'add_pattern_to_hashsets()': "disabling cone pattern matching").
* If the repo is using a sparse index, the existing sparse directories may
  not be compatible with the custom patterns. Probably easiest to force use
  of a full index, e.g. with 'command_requires_full_index = 1'.

Fair warning: this probably isn't an exhaustive list of things that would
need updating, and it'll need thorough testing to make sure there are no
regressions. But, extending the underlying sparse-checkout infrastructure
will (ideally) avoid duplicating code and make this behavior reusable across
other commands.

For the other desired behavior ("limit the files to the active
sparse-checkout patterns"), you could add an option:

	OPT_CALLBACK_F(0, "scope", &sparse_scope, "(sparse|all)",
		       N_("specify the scope of results with respect to the sparse-checkout"),
		       PARSE_OPT_NONEG, option_parse_scope),

...whose callback parses the string arg into a 'restrict_scope'
boolean/enum/something. Then, wherever in 'ls-files' a tree or the index are
iterated over, you can gate the per-file operation on:

	if (!restrict_scope || path_in_sparse_checkout(<path>, istate)) {
		/* do something with <path> */
	}

Note that you should use 'path_in_sparse_checkout()', *not* the
internal/private function 'path_in_sparse_checkout_1()'; you also don't need
to explicitly 'init_sparse_checkout_patterns()'. Regardless of whether you
specified custom patterns or are using the ones in
'.git/info/sparse-checkout', 'path_in_sparse_checkout()' will initialize &
use the appropriate patterns.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] ls-tree: add --sparse-filter-oid argument
  2023-01-25 16:16     ` William Sprent
@ 2023-01-26  3:25       ` Elijah Newren
  2023-01-27 11:58         ` William Sprent
  0 siblings, 1 reply; 19+ messages in thread
From: Elijah Newren @ 2023-01-26  3:25 UTC (permalink / raw)
  To: William Sprent
  Cc: William Sprent via GitGitGadget, git, Eric Sunshine,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Victoria Dye

On Wed, Jan 25, 2023 at 8:16 AM William Sprent <williams@unity3d.com> wrote:
>
> On 25/01/2023 06.11, Elijah Newren wrote:
> > It looks like Ævar and Victoria have both given really good reviews
> > already, but I think I spotted some additional things to comment on.
> >
> > On Mon, Jan 23, 2023 at 3:46 AM William Sprent via GitGitGadget
> > <gitgitgadget@gmail.com> wrote:
> >>
> >> From: William Sprent <williams@unity3d.com>
> >>
> >> There is currently no way to ask git the question "which files would be
> >> part of a sparse checkout of commit X with sparse checkout patterns Y".
> >> One use-case would be that tooling may want know whether sparse checkouts
> >> of two commits contain the same content even if the full trees differ.
> >
> > Could you say more about this usecase?  Why does tooling need or want
> > to know this; won't a checkout of the new commit end up being quick
> > and simple?  (I'm not saying your usecase is bad, just curious that it
> > had never occurred to me, and I'm afraid I'm still not sure what your
> > purpose might be.)
> >
>
> I'm thinking mainly about a monorepo context where there are a number of
> distinct 'units' that can be described with sparse checkout patterns.
> And perhaps there's some tooling that only wants to perform an action if
> the content of a 'unit' changes.

So, you're basically wanting to do something like
   git ls-tree --paths-matching-sparsity-file=<pattern-file> $COMMIT1
>sparse-files
   git ls-tree --paths-matching-sparsity-file=<pattern-file> $COMMIT2
>>sparse-files
   sort sparse-files | uniq >relevant-files
   git diff --name-only $COMMIT1 $COMMIT2 >changed-files
and then checking whether relevant-files and changed-files have a
non-empty intersection?

Would that potentially be better handled by
   git diff --name-only $COMMIT1 $COMMIT2 | git check-ignore
--ignore-file=<pattern-file> --stdin
and seeing whether the output is non-empty?  We'd have to add an
"--ignore-file" option to check-ignore to override reading of
.gitignore files and such, and it'd be slightly confusing because the
documentation talks about "ignored" files rather than "selected"
files, but that's a confusion point that has been with us ever since
the gitignore mechanism was repurposed for sparse checkouts.  Or maybe
we could just add a check-sparsity helper, and then allow it to take
directories in-lieu of patterns.

This seems nicer than opening a can of worms about letting every git
command specify a different set of sparsity rules.

> Depending on the repo, it won't necessarily be quick to check out the
> commit with the given patterns. However, it is more about it being
> inconvenient to have to have a working directory, especially so if you
> want use the tooling in some kind of service or query rapidly about
> different revisions/patterns.
>
> >> Another interesting use-case would be for tooling to use in conjunction
> >> with 'git update-index --index-info'.
> >
> > Sorry, I'm not following.  Could you expound on this a bit?
> >
>
> I was imagining something along the lines of being able to generate new
> tree objects based on what matches the given sparse checkout patterns.
> Not that I have a specific use case for it right now.
>
> I think what I'm trying to evoke with that paragraph is that this
> enables integrations with git that seem interesting and weren't possible
> before.

I'm not sure if it's interesting, frightening, or something else.
Hard to say without better descriptions of usecases, which we can't
have if we don't even have a usecase.  I think I'd just strike this
paragraph.

[...]
> >> +       (*d)->pl.use_cone_patterns = core_sparse_checkout_cone;
> >
> > Hmm, so the behavior still depends upon the current sparse-checkout
> > (or lack thereof), despite the documentation and rationale of your
> > feature as being there to check how a different sparse checkout would
> > behave?
> >
> > I would hate to unconditionally turn cone_patterns off, since that
> > would come with a huge performance penalty for the biggest repos.  But
> > turning it unconditionally on wouldn't be good for the non-cone users.
> > This probably suggests we need something like another flag, or perhaps
> > separate flags for each mode.  Separate flags might provide the
> > benefit of allowing cone mode users to specify directories rather than
> > patterns, which would make it much easier for them to use.
> >
> I used 'core_sparse_checkout_cone' because I wanted to allow for the
> cone mode optimisations, but I also figured that I should respect the
> configuration. It doesn't change how the patterns are parsed in this case.
>
> I agree that it is a bit awkward to have to "translate" the directories
> into patterns when wanting to use cone mode. I can try adding
> '--[no]-cone' flags and see how that feels. Together with Victoria's
> suggestions that would result in having the following flags:
>
> * --scope=(sparse|all)
> * --sparse-patterns-file=<path>
> * --[no]-cone: used together with --sparse-patterns-file to tell git
>    whether to interpret the patterns given as directories (cone) or
>    patterns (no-cone).
>
> Which seems like a lot at first glance. But it allows for passing
> directories instead of patterns for cone mode, and is similar to the
> behaviour of 'sparse-checkout set'.
>
> Does that seem like something that would make sense?

--sparse-patterns-file still implies patterns; I think that would need
some rewording.

More importantly, though, based on your usecase description, I wonder
if you might be better served by either extending the check-ignore
subcommand or adding a similar helper ("check-sparsity"?), rather than
tweaking ls-tree.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] ls-tree: add --sparse-filter-oid argument
  2023-01-25 18:32       ` Victoria Dye
@ 2023-01-26 14:55         ` William Sprent
  0 siblings, 0 replies; 19+ messages in thread
From: William Sprent @ 2023-01-26 14:55 UTC (permalink / raw)
  To: Victoria Dye, William Sprent via GitGitGadget, git
  Cc: Eric Sunshine, Ævar Arnfjörð Bjarmason,
	Derrick Stolee, Elijah Newren

On 25/01/2023 19.32, Victoria Dye wrote:
> William Sprent wrote:
>> On 24/01/2023 21.11, Victoria Dye wrote>> I haven't looked at your implementation in detail yet, but I did want to
>>> offer a recommendation in case you hadn't considered it: if you want to
>>> allow the use of patterns from a user-specified specific file, it would be
>>> nice to do it in a way that fully replaces the "default" sparse-checkout
>>> settings at the lowest level (i.e., override the values of
>>> 'core_apply_sparse_checkout', 'core_sparse_checkout_cone', and
>>> 'get_sparse_checkout_filename()'). Doing it that way would both make it
>>> easier for other commands to add a '--sparse-patterns' option, and avoid the
>>> separate code path ('path_in_sparse_checkout_1()' vs.
>>> 'recursively_match_path_with_sparse_patterns()', for example) when dealing
>>> with '.git/info/sparse-checkout' patterns vs. manually-specified patterns.
>>>
>>
>> Thanks for the pointers. I'll see what I can do. Do you mean something
>> along the line of the following?
>>
>>     set_sparse_checkout_file(filename);
>>     init_sparse_checkout_patterns(istate);
>>     _ = path_in_sparse_checkout_1(some_path, istate, ...);
>>
> 
> Sort of. I mentioned separating the options into "specify the sparse pattern
> file" and "restrict the displayed files to the active pattern set, if there
> is one". For the former, you might add an option like:
> 
> 	OPT_FILENAME(0, "sparse-patterns", &sparse_pattern_file,
> 		     N_("override sparse-checkout behavior using patterns from <file>"))
> 
> Then do something like what you have, right after option parsing:
> 
> 	if (sparse_pattern_file) {
> 		core_apply_sparse_checkout = 1;
> 		core_sparse_checkout_cone = <???>;
> 		set_sparse_checkout_file(filename);
> 	}
> 
> If this option is specified, but the repo already has sparse
> patterns/settings of its own, you'll need to (carefully) override the repo's
> existing configuration:
> 
> * 'core_apply_sparse_checkout' & 'core_sparse_checkout_cone' are set based
>    on the repo config. You'll need to make sure those values are overridden
>    before loading the sparse-checkout patterns, and also that they're set
>    *after* loading the config.

This sounds a bit easy to get wrong to me, but I assume I can trust that the
config has been loaded by the time 'cmd_ls_tree()' is invoked.

> * Speaking of 'core_sparse_checkout_cone', there are a bunch of ways you
>    might configure "cone-" vs. "non-cone" for your patterns (user-specified
>    with yet another option, always assume one or the other, try deriving from
>    the patterns). My preference would be to always assume "cone mode" - it's
>    the direction Git has been moving with sparse-checkout over the past year,
>    and still automatically "falls back" on non-cone patterns if the patterns
>    can't be used in cone mode (with a warning from
>    'add_pattern_to_hashsets()': "disabling cone pattern matching").

Alright. I've had similar thoughts. But I ended up deciding to respect the
config value since there wouldn't be any way for the user to silence the warning
when passing non-cone mode patterns to the command. I don't feel too strongly about
it, though.

> * If the repo is using a sparse index, the existing sparse directories may
>    not be compatible with the custom patterns. Probably easiest to force use
>    of a full index, e.g. with 'command_requires_full_index = 1'.
> 

OK.

> Fair warning: this probably isn't an exhaustive list of things that would
> need updating, and it'll need thorough testing to make sure there are no
> regressions. But, extending the underlying sparse-checkout infrastructure
> will (ideally) avoid duplicating code and make this behavior reusable across
> other commands.
> 

Alright. I'll give it a shot.

> For the other desired behavior ("limit the files to the active
> sparse-checkout patterns"), you could add an option:
> 
> 	OPT_CALLBACK_F(0, "scope", &sparse_scope, "(sparse|all)",
> 		       N_("specify the scope of results with respect to the sparse-checkout"),
> 		       PARSE_OPT_NONEG, option_parse_scope),
> 
> ...whose callback parses the string arg into a 'restrict_scope'
> boolean/enum/something. Then, wherever in 'ls-files' a tree or the index are
> iterated over, you can gate the per-file operation on:
> 
> 	if (!restrict_scope || path_in_sparse_checkout(<path>, istate)) {
> 		/* do something with <path> */
> 	}
> 
> Note that you should use 'path_in_sparse_checkout()', *not* the
> internal/private function 'path_in_sparse_checkout_1()'; you also don't need
> to explicitly 'init_sparse_checkout_patterns()'. Regardless of whether you
> specified custom patterns or are using the ones in
> '.git/info/sparse-checkout', 'path_in_sparse_checkout()' will initialize &
> use the appropriate patterns.
> 

Yeah. And I wouldn't need the refactor of 'path_in_sparse_checkout_1()'.

Thanks a lot (again) for the pointers.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] ls-tree: add --sparse-filter-oid argument
  2023-01-26  3:25       ` Elijah Newren
@ 2023-01-27 11:58         ` William Sprent
  2023-01-28 16:45           ` Elijah Newren
  0 siblings, 1 reply; 19+ messages in thread
From: William Sprent @ 2023-01-27 11:58 UTC (permalink / raw)
  To: Elijah Newren
  Cc: William Sprent via GitGitGadget, git, Eric Sunshine,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Victoria Dye

On 26/01/2023 04.25, Elijah Newren wrote:
> On Wed, Jan 25, 2023 at 8:16 AM William Sprent <williams@unity3d.com> wrote:
>>
>> On 25/01/2023 06.11, Elijah Newren wrote:
>>> It looks like Ævar and Victoria have both given really good reviews
>>> already, but I think I spotted some additional things to comment on.
>>>
>>> On Mon, Jan 23, 2023 at 3:46 AM William Sprent via GitGitGadget
>>> <gitgitgadget@gmail.com> wrote:
>>>>
>>>> From: William Sprent <williams@unity3d.com>
>>>>
>>>> There is currently no way to ask git the question "which files would be
>>>> part of a sparse checkout of commit X with sparse checkout patterns Y".
>>>> One use-case would be that tooling may want know whether sparse checkouts
>>>> of two commits contain the same content even if the full trees differ.
>>>
>>> Could you say more about this usecase?  Why does tooling need or want
>>> to know this; won't a checkout of the new commit end up being quick
>>> and simple?  (I'm not saying your usecase is bad, just curious that it
>>> had never occurred to me, and I'm afraid I'm still not sure what your
>>> purpose might be.)
>>>
>>
>> I'm thinking mainly about a monorepo context where there are a number of
>> distinct 'units' that can be described with sparse checkout patterns.
>> And perhaps there's some tooling that only wants to perform an action if
>> the content of a 'unit' changes.
> 
> So, you're basically wanting to do something like
>     git ls-tree --paths-matching-sparsity-file=<pattern-file> $COMMIT1
>> sparse-files
>     git ls-tree --paths-matching-sparsity-file=<pattern-file> $COMMIT2
>>> sparse-files
>     sort sparse-files | uniq >relevant-files
>     git diff --name-only $COMMIT1 $COMMIT2 >changed-files
> and then checking whether relevant-files and changed-files have a
> non-empty intersection?

Well, the concrete use-case I'm exploring is something along the lines
of using the content hashes of sparse checkouts as cache keys for resource
heavy jobs (builds/tests/etc).

So, that would be something along the lines of,

     git ls-tree -r --paths-matching-sparsity-file=<pattern-file> \
     | sha1sum > cache-key

and then performing a lookup before performing an action (which would
then only be done in the context of the sparse checkout). My thinking
is that this only would require git and no additional tooling, which in
turn makes it very easy to reproduce the state where the job took place.


> Would that potentially be better handled by
>     git diff --name-only $COMMIT1 $COMMIT2 | git check-ignore
> --ignore-file=<pattern-file> --stdin
> and seeing whether the output is non-empty?  We'd have to add an
> "--ignore-file" option to check-ignore to override reading of
> .gitignore files and such, and it'd be slightly confusing because the
> documentation talks about "ignored" files rather than "selected"
> files, but that's a confusion point that has been with us ever since
> the gitignore mechanism was repurposed for sparse checkouts.  Or maybe
> we could just add a check-sparsity helper, and then allow it to take
> directories in-lieu of patterns. 

I don't think it necessarily would be better handled by that. But it would
be workable. It would be a matter of collating the output of

   git ls-tree -r <commit>

with

   git ls-tree --name-only -r <commit> | git check-ignore ...

Which is less ergonomic. But it is also a less intrusive change.

Really, the main thing is to expose the sparse filtering logic somehow, and
allow for building tooling on top of it.

> This seems nicer than opening a can of worms about letting every git
> command specify a different set of sparsity rules.

I think you are the better judge of how much of a can of worms that would
be. I don't think it would be too out of line with how git acts in general
though, as we have things like the the 'GIT_INDEX_FILE' env-var.

>> Depending on the repo, it won't necessarily be quick to check out the
>> commit with the given patterns. However, it is more about it being
>> inconvenient to have to have a working directory, especially so if you
>> want use the tooling in some kind of service or query rapidly about
>> different revisions/patterns.
>>
>>>> Another interesting use-case would be for tooling to use in conjunction
>>>> with 'git update-index --index-info'.
>>>
>>> Sorry, I'm not following.  Could you expound on this a bit?
>>>
>>
>> I was imagining something along the lines of being able to generate new
>> tree objects based on what matches the given sparse checkout patterns.
>> Not that I have a specific use case for it right now.
>>
>> I think what I'm trying to evoke with that paragraph is that this
>> enables integrations with git that seem interesting and weren't possible
>> before.
> 
> I'm not sure if it's interesting, frightening, or something else.
> Hard to say without better descriptions of usecases, which we can't
> have if we don't even have a usecase.  I think I'd just strike this
> paragraph.
> 
> [...]

Fair. Will do.

>>>> +       (*d)->pl.use_cone_patterns = core_sparse_checkout_cone;
>>>
>>> Hmm, so the behavior still depends upon the current sparse-checkout
>>> (or lack thereof), despite the documentation and rationale of your
>>> feature as being there to check how a different sparse checkout would
>>> behave?
>>>
>>> I would hate to unconditionally turn cone_patterns off, since that
>>> would come with a huge performance penalty for the biggest repos.  But
>>> turning it unconditionally on wouldn't be good for the non-cone users.
>>> This probably suggests we need something like another flag, or perhaps
>>> separate flags for each mode.  Separate flags might provide the
>>> benefit of allowing cone mode users to specify directories rather than
>>> patterns, which would make it much easier for them to use.
>>>
>> I used 'core_sparse_checkout_cone' because I wanted to allow for the
>> cone mode optimisations, but I also figured that I should respect the
>> configuration. It doesn't change how the patterns are parsed in this case.
>>
>> I agree that it is a bit awkward to have to "translate" the directories
>> into patterns when wanting to use cone mode. I can try adding
>> '--[no]-cone' flags and see how that feels. Together with Victoria's
>> suggestions that would result in having the following flags:
>>
>> * --scope=(sparse|all)
>> * --sparse-patterns-file=<path>
>> * --[no]-cone: used together with --sparse-patterns-file to tell git
>>     whether to interpret the patterns given as directories (cone) or
>>     patterns (no-cone).
>>
>> Which seems like a lot at first glance. But it allows for passing
>> directories instead of patterns for cone mode, and is similar to the
>> behaviour of 'sparse-checkout set'.
>>
>> Does that seem like something that would make sense?
> 
> --sparse-patterns-file still implies patterns; I think that would need
> some rewording.

Yeah. After sleeping on it, I also think that it becomes a difficult
interface to work with, and you'll get different results with the same
patterns whether you pass --cone or --no-cone, which seems error prone
to me.

For better or for worse, both cone and non-cone uses of sparse-checkouts
end up producing pattern files. And those pattern files do unambiguously
describe a filtering of the worktree whether it is in cone-mode or not.

Given that 'ls-tree' is more of a plumbing command, I think it might still
make sense to use the patterns. That would also make the interaction
a bit more logical to me -- e.g. if you want to override the patterns
you have to pass them in the same format as the ones that would be read
by default.

Then maybe it could eventually make sense to expose the translation of
cone-mode patterns as well, e.g.

    git sparse-checkout set --cone --std-out dir1 dir2 dir3

or similar.

> More importantly, though, based on your usecase description, I wonder
> if you might be better served by either extending the check-ignore
> subcommand or adding a similar helper ("check-sparsity"?), rather than
> tweaking ls-tree.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2] ls-tree: add --sparse-filter-oid argument
  2023-01-27 11:58         ` William Sprent
@ 2023-01-28 16:45           ` Elijah Newren
  0 siblings, 0 replies; 19+ messages in thread
From: Elijah Newren @ 2023-01-28 16:45 UTC (permalink / raw)
  To: William Sprent
  Cc: William Sprent via GitGitGadget, git, Eric Sunshine,
	Ævar Arnfjörð Bjarmason, Derrick Stolee,
	Victoria Dye

On Fri, Jan 27, 2023 at 3:59 AM William Sprent <williams@unity3d.com> wrote:
>
> On 26/01/2023 04.25, Elijah Newren wrote:
> > On Wed, Jan 25, 2023 at 8:16 AM William Sprent <williams@unity3d.com> wrote:
> >>
> >> On 25/01/2023 06.11, Elijah Newren wrote:
> >>> It looks like Ævar and Victoria have both given really good reviews
> >>> already, but I think I spotted some additional things to comment on.
> >>>
> >>> On Mon, Jan 23, 2023 at 3:46 AM William Sprent via GitGitGadget
> >>> <gitgitgadget@gmail.com> wrote:
> >>>>
> >>>> From: William Sprent <williams@unity3d.com>
> >>>>
> >>>> There is currently no way to ask git the question "which files would be
> >>>> part of a sparse checkout of commit X with sparse checkout patterns Y".
> >>>> One use-case would be that tooling may want know whether sparse checkouts
> >>>> of two commits contain the same content even if the full trees differ.
> >>>
> >>> Could you say more about this usecase?  Why does tooling need or want
> >>> to know this; won't a checkout of the new commit end up being quick
> >>> and simple?  (I'm not saying your usecase is bad, just curious that it
> >>> had never occurred to me, and I'm afraid I'm still not sure what your
> >>> purpose might be.)
> >>>
> >>
> >> I'm thinking mainly about a monorepo context where there are a number of
> >> distinct 'units' that can be described with sparse checkout patterns.
> >> And perhaps there's some tooling that only wants to perform an action if
> >> the content of a 'unit' changes.
> >
> > So, you're basically wanting to do something like
> >     git ls-tree --paths-matching-sparsity-file=<pattern-file> $COMMIT1
> >> sparse-files
> >     git ls-tree --paths-matching-sparsity-file=<pattern-file> $COMMIT2
> >>> sparse-files
> >     sort sparse-files | uniq >relevant-files
> >     git diff --name-only $COMMIT1 $COMMIT2 >changed-files
> > and then checking whether relevant-files and changed-files have a
> > non-empty intersection?
>
> Well, the concrete use-case I'm exploring is something along the lines
> of using the content hashes of sparse checkouts as cache keys for resource
> heavy jobs (builds/tests/etc).
>
> So, that would be something along the lines of,
>
>      git ls-tree -r --paths-matching-sparsity-file=<pattern-file> \
>      | sha1sum > cache-key
>
> and then performing a lookup before performing an action (which would
> then only be done in the context of the sparse checkout). My thinking
> is that this only would require git and no additional tooling, which in
> turn makes it very easy to reproduce the state where the job took place.
>
>
> > Would that potentially be better handled by
> >     git diff --name-only $COMMIT1 $COMMIT2 | git check-ignore
> > --ignore-file=<pattern-file> --stdin
> > and seeing whether the output is non-empty?  We'd have to add an
> > "--ignore-file" option to check-ignore to override reading of
> > .gitignore files and such, and it'd be slightly confusing because the
> > documentation talks about "ignored" files rather than "selected"
> > files, but that's a confusion point that has been with us ever since
> > the gitignore mechanism was repurposed for sparse checkouts.  Or maybe
> > we could just add a check-sparsity helper, and then allow it to take
> > directories in-lieu of patterns.
>
> I don't think it necessarily would be better handled by that. But it would
> be workable. It would be a matter of collating the output of
>
>    git ls-tree -r <commit>
>
> with
>
>    git ls-tree --name-only -r <commit> | git check-ignore ...
>
> Which is less ergonomic. But it is also a less intrusive change.
>
> Really, the main thing is to expose the sparse filtering logic somehow, and
> allow for building tooling on top of it.
> > This seems nicer than opening a can of worms about letting every git
> > command specify a different set of sparsity rules.
>
> I think you are the better judge of how much of a can of worms that would
> be. I don't think it would be too out of line with how git acts in general
> though, as we have things like the the 'GIT_INDEX_FILE' env-var.

I agree you've got a reasonable usecase here.  The integration you've
described sounds like something that could be independently composable
with several other commands, and you alluded to that in your commit
message.  But is adding it to ls-tree the best spot?  If we add it
there, then folks who want to similarly integrate such capabilities
with other commands (archive, diff, grep, rev-list --objects, etc.),
seem more likely to do so via direct integration.  We already have a
very large can of worms to work on to make commands behave in ways
that are limited to sparse paths (see
Documentation/techncial/sparse-checkout.txt, namely the "behavior A"
stuff).  As can be seen in that document, what to do for limiting
commands to sparse paths is obvious with some commands but has lots of
interesting edge cases for others (even with years of experience with
sparse checkouts, we had 3- and 4- way differing opinions on the
intended behavior for some commands when we started composing that
document a few months ago).  If we had jumped straight to
implementation for some commands, we would have likely painted
ourselves into a corner for other commands.  Adding another layer of
specifying an alternate set of sparsity rules will likely have
interesting knock-on effects that we should think through for all the
commands to ensure we aren't painting ourselves into a similar corner,
if we go down this route.

However, in the cases that sparse-checkout.txt document was
addressing, the behavior fundamentally needs to be integrated into all
the relevant commands to get user experience right.  In your case, you
merely need a separate tool to be able to compose the output of
different commands together.  So, exposing whether sparsity rules
would select various paths in a single separate command (maybe git
check-ignore, or a new sparse-checkout subcommand, or maybe just a new
subcommand similar to check-ignore) would avoid a lot of these issues,
and give us a single place to extend/improve as we learn about further
usecases.

[...]
> >> I agree that it is a bit awkward to have to "translate" the directories
> >> into patterns when wanting to use cone mode. I can try adding
> >> '--[no]-cone' flags and see how that feels. Together with Victoria's
> >> suggestions that would result in having the following flags:
> >>
> >> * --scope=(sparse|all)
> >> * --sparse-patterns-file=<path>
> >> * --[no]-cone: used together with --sparse-patterns-file to tell git
> >>     whether to interpret the patterns given as directories (cone) or
> >>     patterns (no-cone).
> >>
> >> Which seems like a lot at first glance. But it allows for passing
> >> directories instead of patterns for cone mode, and is similar to the
> >> behaviour of 'sparse-checkout set'.
> >>
> >> Does that seem like something that would make sense?
> >
> > --sparse-patterns-file still implies patterns; I think that would need
> > some rewording.
>
> Yeah. After sleeping on it, I also think that it becomes a difficult
> interface to work with, and you'll get different results with the same
> patterns whether you pass --cone or --no-cone, which seems error prone
> to me.
>
> For better or for worse, both cone and non-cone uses of sparse-checkouts
> end up producing pattern files. And those pattern files do unambiguously
> describe a filtering of the worktree whether it is in cone-mode or not.

Back when cone mode was introduced, I objected to reusing the existing
pattern format and/or files...but was assuaged that it was just an
implementation detail that wouldn't be exposed to users (since people
would interact with the 'sparse-checkout' command instead of direct
editing of $GIT_DIR/info/sparse-checkout).  It's still a performance
penalty, because we waste time both turning directory names into
patterns when we write them out, and when we read them in by parsing
the patterns to see if they match cone mode rules and if so discard
the patterns and just build up our hashes.  The patterns are nothing
more than an intermediary format we waste time translating to and
from, though once upon a time they existed so that if someone suddenly
started using an older (now ancient) version of git on their current
checkout then it could still hobble along with degraded performance
instead of providing an error.  (We have introduced other changes to
the config and git index which would cause older git clients to just
fail to parse and throw an error, and even have defined mechanisms for
such erroring out.  We could have taken advantage of that for this
feature too.)

Anyway, long story short, if you're going to start exposing users to
this implementation detail that was meant to be hidden (and do it in a
way that may be copied into several commands to boot), then I
definitely object.

> Given that 'ls-tree' is more of a plumbing command, I think it might still
> make sense to use the patterns. That would also make the interaction
> a bit more logical to me -- e.g. if you want to override the patterns
> you have to pass them in the same format as the ones that would be read
> by default.

No, sparsity specification should be provided by users the same way
they normally specify it (i.e. the way they pass it to
`sparse-checkout {add,set}`), not the way it's stored via some hidden
implementation detail.

I'd make an exception if you ended up using `git check-ignore` because
that command was specifically written about gitignore-style rules, and
git-ignore-style rules just happen to have been reused as-is for
non-cone-mode sparse-checkout rules.  But if you go that route:
   (1) you have to frame any extensions to check-ignore as things that
are useful for gitignore checking, and only incidentally useful to
sparse-checkouts
   (2) you'd have to forgo any cone-mode optimizations
Going the check-ignore route seems like the easiest path to providing
the basic functionality you need right now.  If your usecases remain
niche, that might be good enough for all time too.  If your usecases
become popular, though, I expect someone would eventually tire of
using `check-ignore` and implement similar functionality along with
supporting cone-mode in a `git check-sparsity` or `git sparse-checkout
debug-rules` command or something like that.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2023-01-28 16:45 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-11 17:01 [PATCH] ls-tree: add --sparse-filter-oid argument William Sprent via GitGitGadget
2023-01-13 14:17 ` Eric Sunshine
2023-01-13 20:01   ` Junio C Hamano
2023-01-16 15:13     ` William Sprent
2023-01-16 12:14   ` William Sprent
2023-01-23 11:46 ` [PATCH v2] " William Sprent via GitGitGadget
2023-01-23 13:00   ` Ævar Arnfjörð Bjarmason
2023-01-24 15:30     ` William Sprent
2023-01-23 13:06   ` Ævar Arnfjörð Bjarmason
2023-01-24 15:30     ` William Sprent
2023-01-24 20:11   ` Victoria Dye
2023-01-25 13:47     ` William Sprent
2023-01-25 18:32       ` Victoria Dye
2023-01-26 14:55         ` William Sprent
2023-01-25  5:11   ` Elijah Newren
2023-01-25 16:16     ` William Sprent
2023-01-26  3:25       ` Elijah Newren
2023-01-27 11:58         ` William Sprent
2023-01-28 16:45           ` Elijah Newren

Code repositories for project(s) associated with this public inbox

	https://80x24.org/mirrors/git.git

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).